could fstat(1) show files in use by vnd(4)?
I had a XEN3_DOM0 kernel crash after doing "umount -f /build" (which I did because "fstat /build" didn't find anything using that filesystem). However there were several vnd(4) devices open by Xen domUs using files on /build that I had completely forgotten about! [ 178445.6211804] fatal integer divide fault in supervisor mode [ 178445.6211804] trap type 8 code 0 rip 0x80936fb7 cs 0xe030 rflags 0x10246 cr2 0x9f82f9e21000 ilevel 0 rsp 0x9f8305c9ee00 [ 178445.6211804] curlwp 0x9f8020cd81c0 pid 0.1268 lowest kstack 0x9f8305c9a2c0 kernel: integer divide fault trap, code=0 Stopped in pid 0.1268 (system) at netbsd:vndthread+0x677: idivq ff78(%rbp),%rax vndthread() at netbsd:vndthread+0x677 ds edf0 es 0 fs 81c0 gs 800 rdi 0 rsi 6 rbp 9f8305c9eef0 rbx e0b0dc0 rdx 0 rcx 135 rax 0 r8 0 r9 97a7c r10 9f806119f040 r11 fffe r12 0 r13 800 r14 9f8021e3c800 r15 0 rip 80936fb7vndthread+0x677 cs e030 rflags 10246 rsp 9f8305c9ee00 ss e02b netbsd:vndthread+0x677: idivq ff78(%rbp),%rax db{4}> After a reboot we can see the vnd(4) uses: # vndconfig -l vnd0: /build (/dev/mapper/vg0-build) inode 861956 vnd1: /build (/dev/mapper/vg0-build) inode 861966 vnd2: /build (/dev/mapper/vg0-build) inode 861953 vnd3: not in use So, might it be possible to have fstat show these somehow? (perhaps with the/a kernel thread identified as having them open) Also, is this a crash that should be fixed, or is "umount -f" always a Buyer-Beware operation with expected "undefined behaviour"? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpciMYzV_vbE.pgp Description: OpenPGP Digital Signature
kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
ng "portability" in mind in the design and implementation of Netgraph too. I encourage anyone who's read this far, but who doesn't yet know so much about Netgraph, to have a look at Archie Cobbs' DaemonNews article and Julian's slides describing what's been worked on in Netgraph more recently: http://people.freebsd.org/~julian/netgraph.html http://people.freebsd.org/~julian/BAFUG/talks/Netgraph/Netgraph.pdf (BTW, Kohler's "Click Modular Router" is another interesting project!) -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpfCHbBPXKwo.pgp Description: PGP signature
Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
At Fri, 29 Jan 2010 14:43:38 -0600, David Young wrote: Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph > > On Fri, Jan 29, 2010 at 02:56:31PM -0500, Greg A. Woods wrote: > > I need advanced kernel-level multilink PPP (MLPPP) support, including > > the ability to create bundle links via UDP (and maybe TCP) over IP. > > Why do you need "kernel-level multilink PPP" support? Do you need to > interoperate with existing multilink PPP systems? Partly, but the biggest concern is performance. I.e.: 1. We absolutely do need to use MLPPP. We do control both ends of the connection, and we may someday look at other protocols, but our current production head-end concentrators are using MLPPP. 2. We also need to do it over multiple connections that are up to many tens of megabits/sec each, perhaps sometimes even 100mbps each. Home cable connections are now 10-50mbps down or more in many places, and truly high-speed ADSL2 is also growing in availability. We aggregate such connections for both speed and reliability reasons. Our current low-end FreeBSD-based CPE device, which has a board with a 500 MHz AMD Geode LX800 on it, when connected to a 50mbps+2mbps cable connection that has been split into two tunnels, can achieve 8-mbps max (download) with userland MLPPP, period; but as much as 34mbps with MPD using Netgraph MLPPP via UDP, and that was just a quick&dirty test without tuning anything or using truly independent connections. As I'm sure you know it's just not feasible to move data fast enough in and out of userland to split and reassemble packets them on commodity CPE devices. We also need to do ipsec (with hardware crypto), ipfilter, ethernet bridging and vlans, etc., all on the same little processors. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpezIhV4d9wy.pgp Description: PGP signature
Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
At Sat, 30 Jan 2010 11:37:41 +0900, Masao Uebayashi wrote: Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph > > What you need is something like npppd/pipex which OpenBSD has just imported? Not as it is, as far as I can tell. (I don't see any new documentation imported for it -- just a couple of kernel files and the usr.sbin/npppd stuff, also without manual pages it seems, sigh.) Does it actually do MLPPP? I only find mention of Multilink PPP (which they abbreviate "MP" for some silly reason) in usr.sbin/npppd/npppd/ppp.h. usr.sbin/npppd seems to be server-only. I need client code first, then eventually server support. The kernel code (if indeed it has any client code -- not sure yet) doesn't seem to allow forwarding through UDP or TCP. It does mention PPTP, and PPPoE in places but those don't really help me directly. The document I eventually found here: http://www.seil.jp/download/eng/doc/npppd_pipex.pdf confirms that this seems to be server/concentrator only. (that link sure would have helped me figure this out faster!) The more I think about it, the more I highly desire the simple way Netgraph modules can be composed into any graph that meets one's current requirements, and it's all done without recompiling anything. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpQ5Byu9pjMk.pgp Description: PGP signature
Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
At Sat, 30 Jan 2010 15:11:03 -0600, David Young wrote: Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph > > On Sat, Jan 30, 2010 at 03:59:29PM -0500, Greg A. Woods wrote: > > The kernel code (if indeed it has any client code -- not sure yet) > > doesn't seem to allow forwarding through UDP or TCP. It does mention > > PPTP, and PPPoE in places but those don't really help me directly. > > You can operate gre(4) over UDP without involving userland, does that > help any? Well, if/when whatever does client-side MLPPP can be configured to use GRE tunnels as members of a bundle, and assuming I can convince MPD on the server side to stick a ng_gre node in before the ng_ppp node on each incoming bundle, then yes, it would help. Ideally though I just want to encapsulate the PPP frames in UDP to be directly compatible with MPD on the server side. > Is MLPPD necessary/desirable for some reason? I'm not sure what MLPPD is -- Did you mean MLPPP? If so, then yes, MLPPP is, currently, a core feature of the project I'm working on. (MLPP is something else entirely I think -- the closest thing to network protocols I can find is MLPP-over-IP.) -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgppY6nawJxxO.pgp Description: PGP signature
Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
At Sun, 31 Jan 2010 10:35:44 +0900 (JST), Yasuoka Masahiko wrote: Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph > > npppd and pipex don't support multilink PPP. "MP" in ppp.h have been > drived from RFC 3145. Thank you for confirming! -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgplX9ydxnmCZ.pgp Description: PGP signature
Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
At Sat, 30 Jan 2010 19:35:47 -0500, Thor Lancelot Simon wrote: Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph > > As far as I know, the standard *is* "MP". MLPPP -- in my years-ago > experience anyway -- was Livingston's proprietary predecessor of the > standard protocol; they don't interoperate. Well long ago there was RFC 1717, which was written by authors from Newbridge, UCB, and Lloyd Internetworking, and indeed the title of that RFC appears to abbreviate "PPP Multilink Protocol" to "MP" (though perhaps it should be called "PPP-MP"). There was also a protocol from Ascend called Multichannel Protocol Plus (MP+) and I don't know if/how it was related to PPP-MP. Livingston did support RFC 1717 and they also called it "MP", or sometimes "multi-line load balancing". If I remember correctly Lucent bought Livingston, then Ascend. Initially I need to inter-operate with a concentrator running MPD on FreeBSD using Netgraph, thus ng_ppp(4), which implements RFC 1990 PPP Multilink Protocol, probably using UDP encapsulation. (RFC1990 obsoletes RFC1717) Porting Netgraph still seems to be the most optimal solution all round, though perhaps not with the fastest result, unless I can get help on the FreeBSD side at making the code more portable. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpZiiprzulJ0.pgp Description: PGP signature
Re: blocksizes
At Mon, 1 Feb 2010 15:34:39 -0500 (EST), der Mouse wrote: Subject: Re: blocksizes > > > This can easily happen if you copy the image between disks with > > different block sizes. > > Now _there_ is a valid argument for doing everything in terms of bytes > (as was discussed briefly upthread). Indeed. Or at least using only _one_ logical block size that's consistent for the system across all hardware that can be used by the system. Otherwise one must have a working equivalent NetBSD system that can make use of both kinds of disks in order to copy an image from one kind of disk to another. Instead I think it would be best to be able to use any kind of host system to make an image copy of a NetBSD disk even across disks with different sector sizes, i.e. without having to use a system which can understand both the on-disk filesystem and how it deals with different hardware sector sizes. In the pure sense of trying to do what's most optimal for a given system on a given type of hardware, I think I can understand the desire to use the hardware sector size, or multiples thereof, in the disk driver and to map logical sectors to match. However for a portable system I think the on-disk filesystem representation should try to use a single logical sector size across all hardware. I hesitate to say even this much, never mind any more, because I still feel like I'm sitting firmly and safely on the fence. :-) -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpqrFxhJMuKN.pgp Description: PGP signature
Re: (Semi-random) thoughts on device tree structure and devfs
At Sun, 7 Mar 2010 20:50:03 +, Quentin Garnier wrote: Subject: Re: (Semi-random) thoughts on device tree structure and devfs > > On Sun, Mar 07, 2010 at 06:43:49PM +0900, Masao Uebayashi wrote: > [...] > You're barking up the wrong tree. What's annoying is not that the > numbering changes. It is that the numbering is relevant to the use of > the device. I expect dk(4) devices to be given names (be it real names > or GUIDs), and I expect to be able to use that whenever I currently have > to use a string of the form "dkN". Indeed. This needs carving in stone somewhere, since folks seem to forget it. I think even I have been known to forget it sometimes. ;-) > Wrong. Device numbers should be irrelevant to anything but operations > on device_t objects. Indeed. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpI8t1YQjNaV.pgp Description: PGP signature
Re: (Semi-random) thoughts on device tree structure and devfs
At Wed, 10 Mar 2010 08:56:36 + (GMT), Iain Hibbert wrote: Subject: Re: (Semi-random) thoughts on device tree structure and devfs > > So, you want to be able to mount a disk by the label: > > $ mount -t msdosfs -o label "foobar" /external_disk_foobar Yes, something like that, using fs_volname of course. I've wanted this kind of feature for decades. And of course all the other filesystem tools should have this interface as well. It's no good if it's not uniformly usable. newfs and tunefs need to be able to set and change fs_volname to start with. Disk tools could be made to work with disk label names too for added fun, but let us not confuse fs_volname with pack names, disklabel names, etc. Naturally this should not replace the use of the device file, but rather be added in addition to it, as an optional way to specify the ultimate device used to access the filesystem. In fact I'd much rather see lots of work go into this feature than into anything even remotely related to devfs. BTW, we don't want to end up with the horrid mess some GNU/Linux systems now use when their kernel config's specify root=LABEL=xxx -- I think we can do much better. > or, if you know the UUID > > $ mount -t msdosfs -o uuid 3478374923723423 ~/thumb_drive I think UUID's, as I understand them so far (fs_id, right?), are really too fragile, too meaningless and difficult to read, and too dangerous, to use for this purpose. They are not actually unique, to start with, so labelling them so is just plain wrong. Search google for Russell Coker's discussion on Label vs. UUID. Filesystem volume names can be said to have many of the same problems, except to start with we know and understand that they're not unique right off the bat, and we can assign human meaning to them and make them memorable. Let's at least get filesystem access by volume names working right, then we can go on to think about other things, if they still seem worthwhile. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgp02dHW7siee.pgp Description: PGP signature
Re: (Semi-random) thoughts on device tree structure and devfs
At Thu, 11 Mar 2010 10:22:29 +0900, Masao Uebayashi wrote: Subject: Re: (Semi-random) thoughts on device tree structure and devfs > > On Thu, Mar 11, 2010 at 4:33 AM, Greg A. Woods wrote: > > At Wed, 10 Mar 2010 08:56:36 + (GMT), Iain Hibbert > > wrote: > > Subject: Re: (Semi-random) thoughts on device tree structure and devfs > >> > >> So, you want to be able to mount a disk by the label: > >> > >> $ mount -t msdosfs -o label "foobar" /external_disk_foobar > > > > Yes, something like that, using fs_volname of course. I've wanted this > > kind of feature for decades. > > While I understand usefulness of human-readable labels, I don't think > it should be handled in kernel. Because labels are arbitrary. They > are not ensured to be unique. The fs_id value is _NOT_ going to be any more unique than the fs_volname value. The fs_id value is also not guaranteed to be unique to start with, especially not across the operational lifetime of a filesystem. There are a plethora of ways the fs_id can be duplicated, and just about as many ways for it to get lost (or changed without change control) too. Sure, labels are arbitrary -- at least to the machine. They are not, necessarily, arbitrary to the human who creates them though. In any case the label doesn't have to be _guaranteed_ to be unique to be useful to both the human and the machine. Also, the filesystem identifier doesn't have to be a meaningless lengthy string of impossible to memorize sequences of digits to be useful to the system either -- a human created, human meaningful, label can be just as useful to the machine. > I think labels should be resolved by some name service. It's not > different than /etc/hosts -> IP address. Sorry, but I'm flabbergasted! What the heck does that mean in this context of filesystem identification? Do you really want to add more complexity, goo, and mess, and places for errors to happen by adding a translation layer? First off, there's really nowhere to store your magical mappings. K.I.S.S. Please! We do have a place to store a human readable/meaningful filesystem identifier. Let the human provide this label. If the system finds duplicate labels then tell the human which devices have conflicting labels and where those filesystem were last mounted and let the human decide which device should be used. (i.e. the labels do need to be unique for a successful automatic initialisation of the system, but there needs to be a manual way to work around them not being unique regardless of what data they consist of) In my opinion the fs_id value is truly useless anywhere outside of the on-disk storage of a single filesystem copy where its sole valid use is (IIUC) to help to match valid backup superblock copies. The fact I'm not even sure it's safe or sane to derive the NFS filesystem filehandle from it in any way. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpUEhptX6Bge.pgp Description: PGP signature
Re: (Semi-random) thoughts on device tree structure and devfs
At Fri, 12 Mar 2010 00:35:24 +0900, Masao Uebayashi wrote: Subject: Re: (Semi-random) thoughts on device tree structure and devfs > > Speaking of tracking state... I've found that keeping track of state > in devfsd is very wrong. Indeed -- I do agree with that much at least! I've had diskless systems running for a long while now (since 2003) where /dev is created by init(8) on every boot (by running /sbin/MAKEDEV, as I've renamed it). In the extremely rare cases where I've wanted to change permissions or similar on a device node I can just use the normal commands: chmod 666 /dev/tty001 and if I want to make such a change persistent across boots I just add that exact same command to /etc/rc.local. There's no magic needed. I think the only key feature necessary is that devfs handle the normal permissions and ownership changes, but to do so of course with no more persistence than tmpfs, md. or mfs. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpbWA7E47MWU.pgp Description: PGP signature
Re: (Semi-random) thoughts on device tree structure and devfs
At Fri, 12 Mar 2010 20:22:25 +0100, Manuel Bouyer wrote: Subject: Re: (Semi-random) thoughts on device tree structure and devfs > > On Thu, Mar 11, 2010 at 05:54:11PM -0500, Greg A. Woods wrote: > > Indeed -- I do agree with that much at least! > > > > I've had diskless systems running for a long while now (since 2003) > > where /dev is created by init(8) on every boot (by running > > /sbin/MAKEDEV, as I've renamed it). > > > > In the extremely rare cases where I've wanted to change permissions or > > similar on a device node I can just use the normal commands: > > > > chmod 666 /dev/tty001 > > > > and if I want to make such a change persistent across boots I just add > > that exact same command to /etc/rc.local. > > > > There's no magic needed. > > > > I think the only key feature necessary is that devfs handle the normal > > permissions and ownership changes, but to do so of course with no more > > persistence than tmpfs, md. or mfs. > > This wouldn't work very well for hot-plug devices. > As I understand it, nodes would be created at plug time, and removed at unplug > time (correct me if I'm wrong). So you would need to run you chmod > when your e.g. USB device is plugged (which is also the time at which you > know where it will how up in the device space). Hmmm well, we have had "hot plug" devices of a sort ever since 1.6 or earlier (when I began using MFS /dev). The only magic trick there is to be able to predict all the possible major and minor numbers at the time you write your MAKEDEV script, or at least be able to update that script as necessary. In the past this has been sufficient, eg. with SCSI probe and scan detecting new devices. However even that kind of magic really isn't truly necessary. Indeed without devfs it could be as easy as the kernel to simply spitting out a message saying that "a device at major N, minor Y" was available to be used (when it was detected), and then leave it entirely up to the user, or some agent of the user (eg. a script monitoring for such messages), to run "mknod" as appropriate, and perhaps adjusting permissions and ownerships at the same time, possibly even updating /etc/MAKEDEV.local. In fact I've wanted the kernel to tell me what major/minor number(s) to use for new SCSI devices, though to some extent the way MAKEDEV is written to use unit numbers, it works well enough. Obviously there are other ways for the kernel to notify userland of such events as device attach/detach besides having a script monitor /dev/console output or kernel syslog messages. Perhaps kqueue() monitoring /dev itself is sufficient, though perhaps then only for a "flat" file tree in /dev. So, with a devfs implementation that creates the new /dev file node automatically, the agent script could still be responsible for changing permissions and ownerships as desired. I.e. no magic for persistence of filesystem metadata is necessary in devfs so long as there are ways to monitor for and handle events that indicate changes have happened in the live state of devfs filesystem. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpwkBzlUcDyn.pgp Description: PGP signature
Re: config(5) break down
At Tue, 16 Mar 2010 10:22:42 +, Andrew Doran wrote: Subject: Re: config(5) break down > > Correctamundo. 95% of downloads in the week following the release of 5.0 > were for x86. It doesn't say much about embedded but does tell us that > a very large segment of the user population does commodity hardware. > > (What the figures also revealed was that a number of the ports had as close > to zero downloads as matters. Which is, to be frank, a red flag for > those that are not maintained.) Please do not even think about using downloads as a measure of which ports are used and how much they are used! That's a completely invalid measurement of how NetBSD might be used. Many of us just download the source. We don't tell you which parts of it that we use or don't use. Even port-* mailing list subscriptions aren't a truly valid hint of which ports are used or how much they are used. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpVTS2ubXFh7.pgp Description: PGP signature
Re: NetBSD & binary [was Re: config(5) break down]
At Tue, 16 Mar 2010 13:35:34 -0400 (EDT), der Mouse wrote: Subject: Re: NetBSD & binary [was Re: config(5) break down] > > Yes, this excludes the people who don't understand and don't want to. > To steal a term from marketing, I don't think NetBSD should try to > serve that market segment; it's already well-served by others, and I > see no percentage in trying to join them. It doesn't serve them better > (indeed, by adding yet another alternative they neither are nor want to > be competent to choose among, it serves them worse) and it doesn't > serve NetBSD (people who don't even want to understand are among the > least likely to turn into developers and contribute back). So what's > to like? Thank you! Very well said. I don't know if it helps to go any further, but perhaps I could add that we don't really want that market segment anyway -- they would only increase our "costs". We need to let them go waste other people's time. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpiAhnW3Pc0J.pgp Description: PGP signature
MIPS SoC systems (was: Dead ports [Re: config(5) break down])
At Fri, 19 Mar 2010 21:23:35 +, Herb Peyerl wrote: Subject: Re: Dead ports [Re: config(5) break down] > > On Fri, Mar 19, 2010 at 05:19:47PM -0400, Thor Lancelot Simon wrote: > > Have a look at > > http://www.rmicorp.com/assets/docs/2070SG_XLR_XLS_Product_Selection_Guide_2008-12-16.pdf > > specifically at the bottom few rows on the "XLS" chart. You're looking at > > parts that have 3 or 4 Gig-E interfaces, tons of useful hardware offload, > > and are, by published reports, way down in the sub-$50 range. You can > > get very similar stuff from Cavium. > > Last time I bought a cavium board it was >$5k USD... An Octeon 3850 > was $700 for 1521 piece part... I didn't think they had anything > reasonable down below $500? (and as far as I remember, they already > had FreeBSD running on the Octeons). Admittedly it's been a few > years. FreeBSD is re-doing all its MIPS support, with quite a bit of work going into the Atheros and Cavium ports. Atheros is running, and some Cavium are running too, but not yet all the most interesting ones. Check out Warner Losh's postings: http://bsdimp.blogspot.com/search/label/mips I'm interested in bringing over some of those ports to NetBSD (though if I try to do it for my day job I'll need to bring over Netgraph first). Here's one company making Cavium-based systems at a reasonable price: http://www.portwell.com/products/detail.asp?CUSTCHAR1=CAM-0100 This one doesn't run FreeBSD yet, but someone is working on it and they are very close (it's not much different from the Cavium eval board Warner shows booting). They have a bunch of higher-end systems based on Cavium CPUs too (and some other CPUs too): http://www.portwell.com/products/MIPS.asp This company isn't as low-priced, but has similar devices. This one is just under $500, single unit: http://www.lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms/Desktop_NPU_Platforms/MR-320 and they also have a wide product range: http://lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms One of the cheapest Atheros boards is the Ubiquiti RouterStation series. You can get one in a case with power supply from various vendors now for just over $100, single unit pricing (the board is $80). http://www.ubnt.com/rspro This is one that FreeBSD runs on already, and I think adapting our AR53xx port to also work on its AR71xx SoC would be relatively easy. It's pretty snappy, but it has a poorly supported Ethernet switch chip that as yet limits it for use in my day job. When you start looking at what the GNU/Linux OpenWRT project supports, there are dozens of very interesting little systems available at relatively low prices. http://oldwiki.openwrt.org/TableOfHardware.html Routerboard.com (MikroTik) sell a bunch of interesting boards that even including their own "proprietary" GNU/Linux port licensing, are still quite cost effective. Most of the more powerful ones are AR71xx based. -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpZAMKw7TNbn.pgp Description: PGP signature
Re: Hardware RAID problem with NetBSD 5?
At Tue, 30 Mar 2010 00:38:05 + (UTC), John Klos wrote: Subject: Hardware RAID problem with NetBSD 5? > > ataraid0: found 1 RAID volume > ld0 at ataraid0 vendtype 3 unit 0: nVidia ATA RAID-1 array > ld0: 931 GB, 121601 cyl, 255 head, 63 sec, 512 bytes/sect x 1953525120 > sectors I guess ataraid(4) is broken in NetBSD-5, as it is in NetBSD-4 and -current. See PRs #42985 and #38273 for starters. > Strange... Does anyone have any ideas? Has anyone seen behaviour like > this, particularly the reset button getting disabled? I booted today's kernel and encountered a rather harder lockup than previously (DDB hung doing a backtrace, sending BREAK to the serial console had no effect), though the reset button, at least on my machine downstairs, still worked fine. I can imagine some machines where the reset button is more of a software controlled feature -- I've seen that kind of design mistake several times before -- but I don't know any details of your MSI board (and I can't find any manuals or other information about it on MSI's site). -- Greg A. Woods Planix, Inc. +1 416 218 0099http://www.planix.com/ pgpCNy4FPf9eD.pgp Description: PGP signature
kernel network interface incompatibilities between netbsd-4 and netbsd-5?
Are there known kernel network interface incompatibilities between netbsd-4 and netbsd-5? I mention this because in considering upgrading one of my servers from netbsd-4 to netbsd-5 I noticed that a static-linked arpwatch binary built on netbsd-4 was complaining about bogons on my network even though they were not bogons -- they were all in the same subnet: Oct 15 12:36:39 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db Oct 15 12:36:48 historically last message repeated 10 times Oct 15 12:36:49 historically arpwatch: bogon 204.92.254.244 0:f:d3:0:5:83 Oct 15 12:36:50 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db Oct 15 12:37:05 historically last message repeated 14 times I also noticed that unbound wasn't answering DNS queries ether, though it didn't make any complaints. Unfortunately the old netbsd-4 fstat is useless against a netbsd-5 kernel to see if unbound was actually listening on the right interfaces. -- Greg A. Woods +1 416 218-0098VE3TCP RoboHack Planix, Inc. Secrets of the Weird pgp4moR2WFLUD.pgp Description: PGP signature
getrusage() problems with user vs. system time reporting
all count_bits() = 0.1968 us/c user, 0.0019 us/c sys, 0.0729 us/c wait, 0.2716 us/c wall count_ul_bits() = 0.2629 us/c user, 0.0026 us/c sys, 0.0835 us/c wait, 0.3489 us/c wall similarly good with clang (again with -O0): Apple clang version 1.7 (tags/Apple/clang-77) (based on LLVM 2.9svn) time() = 0.1796 us/c user, 0.0016 us/c sys, 0.0175 us/c wait, 0.1987 us/c wall nulltime() = 0.1841 us/c user, 0.0014 us/c sys, 0.0185 us/c wait, 0.2040 us/c wall countbits_sparse() = 0.2145 us/c user, 0.0019 us/c sys, 0.0308 us/c wait, 0.2472 us/c wall countbits_dense() = 0.3065 us/c user, 0.0026 us/c sys, 0.0744 us/c wait, 0.3835 us/c wall COUNT_BITS() = 0.1918 us/c user, 0.0016 us/c sys, 0.0407 us/c wait, 0.2341 us/c wall count_bits() = 0.1961 us/c user, 0.0018 us/c sys, 0.0929 us/c wait, 0.2907 us/c wall count_ul_bits() = 0.2548 us/c user, 0.0029 us/c sys, 0.1576 us/c wait, 0.4153 us/c wall Interestingly, totally as an aside, with clang -O3 the differences in the algorithms are so far in the noise as to be invisible, and it's almost as if the compiler recognised every one of my functions and just did something magic instead: tcountbits: now running each algorithm for 3000 iterations time() = 0.1782 us/c user, 0.0010 us/c sys, 0.0059 us/c wait, 0.1851 us/c wall nulltime() = 0.1782 us/c user, 0.0011 us/c sys, 0.0057 us/c wait, 0.1850 us/c wall countbits_sparse() = 0.1782 us/c user, 0.0010 us/c sys, 0.0073 us/c wait, 0.1864 us/c wall countbits_dense() = 0.1782 us/c user, 0.0011 us/c sys, 0.0058 us/c wait, 0.1852 us/c wall COUNT_BITS() = 0.1782 us/c user, 0.0011 us/c sys, 0.0051 us/c wait, 0.1844 us/c wall count_bits() = 0.1782 us/c user, 0.0010 us/c sys, 0.0076 us/c wait, 0.1869 us/c wall count_ul_bits() = 0.1782 us/c user, 0.0011 us/c sys, 0.0073 us/c wait, 0.1866 us/c wall I want to try using OS X's mach_absolute_time() on my Mac instead of gettimeofday(), and perhaps also in parallel to getrusage() since on OS X the gettimeofday() calls to seed the values are done without context switches (due to the magic of OS X's COMMPAGE) and thus I think I can safely assume that there will be approximately no measurable system time used for each iteration. That might give me reliable timing of each algorithm to compare with getrusage(). I have also already looked at gprof(1) results to compare here as well, but use of gprof(1) isn't always possible, and it doesn't necessarily meet the same needs either -- sometimes "time -l" is all you've got, and that means one must be able to trust getrusage() to give reproducible results. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ #include #include #include #include #include #include #include #include #include #include #include #include #include /* * WARNING: time can appear to have gone backwards with getrusage(2)! * * See NetBSD Problem Report #30115 (and PR#10201). * See FreeBSD Problem Report #975 (and PR#10402). * * Problem has existed in all *BSDs since 4.4BSD if not earlier. * * Only FreeBSD has implemented a "fix" (as of rev.1.45 (svn r44725) of * kern_resource.c (etc.) on April 13, 1999) * * But maybe it is even worse than that -- distribution of time between user * and system doesn't seem to match reality! * * See the GNU MP Library (GMP)'s tune/time.c code for better timing? */ #if defined(__APPLE__) # define MILLIONS 10 /* my mac is way faster! :-) */ #else # define MILLIONS 1 #endif static unsigned int iter = MILLIONS * 100UL; char *argv0 = "progname"; /* * for info about the worker algorithms used here see: * http://graphics.stanford.edu/~seander/bithacks.html> */ /* * do nothing much, but make sure you do it! */ unsigned int nullfunc(unsigned long); unsigned int nullfunc(unsigned long v) { volatile unsigned int bc = (unsigned int) v; return bc; } /* * return the number of bits set to one in a value * * old-fashioned bit-by-bit bit-twiddling very slow! */ unsigned int count_ul_bits(unsigned long); unsigned int count_ul_bits(unsigned long v) { unsigned int c; c = 0; /* * we optimize away any high-order zero'd bits... */ while (v) { c += (v & 1); v >>= 1; } return c; } /* * return the number of bits set to one in a value * * Subtraction of 1 from a number toggles all the bits (from right to left) up * to and including the righmost set bit. * * So, if we decrement a number by 1 and do a bitwise and (&) with itself * ((n-1) & n), we will clear the righmost set bit in the number. * * Therefore if we do
Re: getrusage() problems with user vs. system time reporting
On Thu, Oct 27, 2011 at 07:24:03 +-300, Jukka Ruohonen wrote: > > This is a well-known bug that is over 15 year old. The much simpler tests in > atf(7) replicate it well. The used tracker PR is kern/30115. Michael van > Elst suggested therein couple of reasonable (IMO) solutions. Part of the point of this new discussion is that I am attempting, perhaps poorly, to show that I think PR#30115 and its historical counterpart, and similar reports in the PR databases of other *BSDs, represent a separate, unique, problem. It is possible that the problem I'm trying to show here shares, or is at least related to, the same cause as the problem shown in PR#30115. That's part of what I'm trying to discover here. However FreeBSD's solution to PR#30115 is not in any way a valid solution to the problem I'm trying to show here, regardless of whether the problem I'm trying to show has the same cause or not. That solution will prevent the little wobbles that the simplistic tests demonstrate, but it won't make overall getrusage() timing results any more meaningful and consistent. Indeed it may even make them a wee bit more wrong, though I'm not sure this last part matters so much. From what I understand currently, especially if the root cause of these problems is related, then David's proposed solution would be on the right track: On Fri, 28 Oct 2011 08:48:19 +0100, David Laight wrote: > > If you are willing to take the cost of getting the timestamp (in > some units) on every kernel entry/exit (as well as the process switch) > then the time in usr/sys can be added to the clock tick counts and > used when the actual execution time is split. > (Doing it that way means the units don't have to be THAT accurate) Hmmm if we could save the current time on every kernel entry, and then increment a new "l_systime" variable with the elapsed time on every return to user mode, and of course use the same clock as is used for l_rtime (i.e. binuptime()), then the only wild-card variable left is interrupt time. Just how expensive is updatertime() and the associated bookkeeping it needs? Hmmm So, then user time would be the difference between the sum of thread runtimes and the sum of thread systimes, less some value for interrupt time. Ideally interrupt time would also be measured similarly (using the same clock again) by the interrupt dispatcher and accumulated against whatever thread (kernel or user) was interrupted (e.g. in l_intrtime). However I don't quite see how this could be possible to do safely, especially in conjunction with SMP, though I'm not familiar enough with the details of the locking that might be required to know for sure. If I'm wrong and it is possible to do then directly measuring and accounting for interrupt time would also be a very good thing, (assuming it wouldn't be so costly as to radically change overall system performance). In any case with the current state of affairs I'm beginning to think the interrupt ticks are the real wild-cards here and I'm wanting to modify getrusage() to return a new ru_itime value as well (or add a new system call to return the raw p_rtime and p_*ticks values along with stathz). After all, how likely is it that the average of time accounted to p_iticks will actually match the true time used by interrupts. I'm guessing average interrupt service times are far less than stathz intervals. I'm also wondering if I can force "stathz=0" at runtime, perhaps with a sysctl, so that I can also avoid the perturbations caused by having a different (and possibly changing) statclock rate. It's all well and good to try to reduce the cost of statclock handling by giving it a much lower rate than hardclock, but in the end that just makes the division of p_rtime as returned by getrusage() effectively meaningless, and thus some of the work done by statclock may as well be simply not done at all in the first place when stathz is non-zero. It would be much less misleading, to say the least. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgp06bJ5UY0f1.pgp Description: PGP signature
Re: getrusage() problems with user vs. system time reporting
At Mon, 31 Oct 2011 21:10:40 +, David Laight wrote: Subject: Re: getrusage() problems with user vs. system time reporting > > There is an kernel option in i386 (and maybe amd64) to do some > per-syscall stats. One of those counts 'time in syscall' and IIRC > could easily be used to weight the tick counts so that getrusage > gives more accurate times. I had no idea there was a SYSCALL_TIMES_PROCTIMES option as well! (and I see it's been sitting there un-documented since 2007, and so it is already in the netbsd-5 sources I'm experimenting with!) This is exciting! This is what I was looking for! I had ignored SYSCALL_TIMES because it seemed from the manual pages to be lacking per-process hooks, though I was getting to the place where I might have noticed that this would be an appropriate framework in which to add per-process support. :-) So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it effectively converts the struct proc p_*ticks fields into cycle counts instead of stathz tick counts. (though it seems enabling this does not disable the additional accumulation of stathz ticks, nor does it adjust the calculations in kern_resource.c to give expected values) It looks like SYSCALL_TIMES is indeed on both i386 and amd64 at this time, which will do fine for me for now, but given how it seems to work it looks like it could be made to work on alpha, mips, powerpc, sparc64, and ia64 with relative ease. > The problem is that getting an accurate timestamp is relatively > expensive. It has been almost the dominant part of process switch. Because they do use cpu_counter32(), I'm surprised they would be that expensive to keep. If one were to get rid of the big syscall_counts and syscall_times tables and just use the bits necessary for SYSCALL_TIMES_PROCTIMES, would that help reduce the overhead to a more acceptable level? BTW, have you ever built and tested a kernel with appropriate instances of SYSCALL_TIME_ISR_ENTRY() and SYSCALL_TIME_ISR_EXIT() put into place? If so, do you have suggestions as to where I could try putting those macros, especially in a netbsd-5 kernel? In my estimation it's useless to try to make getrusage() show more accurate user time without also firmly accounting for ISR times as well. Relatively speaking I don't mind at all taking a small, equitable, hit in context switching if as a result I can get relatively accurate user and system (and ISR) times per process as a result. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpYwaAKjBz1j.pgp Description: PGP signature
Re: getrusage() problems with user vs. system time reporting
At Mon, 31 Oct 2011 23:28:49 +, David Laight wrote: Subject: Re: getrusage() problems with user vs. system time reporting > > On Mon, Oct 31, 2011 at 04:08:13PM -0700, Greg A. Woods wrote: > > > > So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it > > effectively converts the struct proc p_*ticks fields into cycle counts > > instead of stathz tick counts. (though it seems enabling this does not > > disable the additional accumulation of stathz ticks, nor does it adjust > > the calculations in kern_resource.c to give expected values) > > It doesn't matter. With the cycle counter values, the stathz ticks > are noise. The counts are then a bit like doing the stathz count > on every tick of the cycle counter! Ah, yes, of course! I realized that shortly after posting while I was adding #ifdef's to turn off the counting in statclock(). :-) > > Because they do use cpu_counter32(), I'm surprised they would be that > > expensive to keep. > > If a cpu's TSC rate changes (eg with power saving) they they'll give > different results. So you'd really need a nanotime function. > OTOH using the valuse to split the total execution time is probably > always better than the current code. Wikipedia's entry on "Time Stamp Counter" (and Intel's app-note about using RDTSC for performance monitoring) also mention that on any processor since the Pentium Pro with out-of-order execution an accurate cycle count can only be obtained by preceding the RDTSC instruction with something like CPUID, or on CPUs which support one, the RDTSCP instruction. It is also mentioned that some processors run the time-stamp cycle counter at a constant rate (not the actual current CPU clock rate) (though apparently this quirk can be identified And of course there's the issue of multiple processors, since as I understand it the TSC on different cores are not synchronised. Finally though I'm still learning more about virtual TSCs on VMware and VirtualBox, I'm not so sure the TSC will be at all useful in such a virtual machine environment. Indeed with all this doom and gloom about TSC it seems it might be better just to use binuptime() -- that probably won't be as fast though Perhaps if I'm inspired tomorrow I'll try to re-do the sysctl_stats.h macros to use it instead of cpu_counter32(), and then use this real measure of system time in calcru() instead of pretending the stathz ticks mean anything. > Getting the TSC is (IIRC) 30-40 clocks on i386 - because it is a > synchronising instruction. But it might be the delays only really > affect back to back reads. ad@ knows more - it will be in the > archives somewhere. I found some references saying that it could be 150-200 cycles, and another saying that it was closer to 80 cycles. BTW, I don't seem to have any luck identifying the CPU in the VirtualBox VM that I'm running NetBSD-5 in: # cpuctl identify 0 cpuctl: cpuset_create: Cannot allocate memory ktrace seems to say the error is coming from sysctl(), not calloc(): 354 1 cpuctl CALL __sysctl(0xbfbfe954,2,0x80758e0,0xbfbfe95c,0,0) 354 1 cpuctl RET __sysctl 0 354 1 cpuctl CALL open(0x806b261,2,0x51) 354 1 cpuctl NAMI "/dev/cpuctl" 354 1 cpuctl RET open 3 354 1 cpuctl CALL __sysctl(0xbfbfe888,2,0xbfbfe8e0,0xbfbfe8e4,0,0) 354 1 cpuctl RET __sysctl 0 354 1 cpuctl CALL __sysctl(0xbfbfe858,2,0xbfbfe8b0,0xbfbfe8b4,0,0) 354 1 cpuctl RET __sysctl 0 354 1 cpuctl CALL __sysctl(0x8072130,2,0xbfbfe8e0,0xbfbfe8e4,0,0) 354 1 cpuctl RET __sysctl -1 errno 12 Cannot allocate memory I was about to try to copy over a sysctl.debug symbol file from my build machine, after turning on the network, and I got a crash as I started the rcp, and it's the first time I've seen such a crash and the only difference is that I've turned on SYSCALL_TIMES_PROCTIMES et al. Mutex error: lockdebug_barrier: spin lock held lock address : 0xc0d4de54 type : spin initialized : 0xc04e7086 shared holds : 0 exclusive: 1 shares wanted: 0 exclusive: 0 current cpu : 0 last held: 0 current lwp : 0xd3d02840 last held: 0xd3d02840 last locked : 0xc04e622c unlocked : 0xc04e624b owner field : 0x00010700 wait/spin:0/1 panic: LOCKDEBUG Begin traceback... copyright(c0d50643,0,0,c0c36d90,d2a69c40,d2a69bd8,c0d4de54,c0c33a24,d3d02840,c04e622c) at 0xc0b8d29d End traceback... dumping to dev 0,1 offset 2000263 dump 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17
questions about clocks, and statclock() in particular....
* time it can be highly inaccurate, especially +* for interrupt service routines which may +* routinely be much shorter-running than +* stathz. +* +* Statistically though the likelyhood of a +* statclock() call comming while another ISR +* is already executing is therefore relatively +* low, unless perhaps the interrupt rate is +* really high, in which case taking a whole +* stathz tick ``hit'' for ISRs might, on +* average, be OK. +*/ p->p_iticks++; +#endif } spc->spc_cp_time[CP_INTR]++; } else if (p != NULL) { +#ifndef SYSCALL_PROCTIMES + /* +* Unfortunately since this allocates a whole stathz +* interval to the process as system time it can be +* highly inaccurate, especially for some system calls +* which may take far less than a stathz interval to +* finish. +* +* Unlike interrupts though, some system calls may +* actually run fairly long. Does that make this fair? +*/ p->p_sticks++; +#endif spc->spc_cp_time[CP_SYS]++; } else { spc->spc_cp_time[CP_IDLE]++; } } - spc->spc_pscnt = psdiv; if (p != NULL) { + struct vmspace *vm = p->p_vmspace; + long rss; + + /* +* If the CPU is currently scheduled to a non-idle process, +* then charge that process with the appropriate VM resource +* utilization for a tick. +* +* Assume that the current process has been running the entire +* last tick, and account for VM use regardless of whether in +* user mode or system mode (XXX or interrupt mode?). +* +* rusage VM stats are expressed in kilobytes * +* ticks-of-execution. +*/ + /* based on code from 4.3BSD kern_clock.c and from FreeBSD ... */ + +# define pg2kb(n) (((n) * PAGE_SIZE) / 1024) + + p->p_stats->p_ru.ru_idrss += pg2kb(vm->vm_dsize); /* unshared data */ + p->p_stats->p_ru.ru_isrss += pg2kb(vm->vm_ssize); /* unshared stack */ + p->p_stats->p_ru.ru_ixrss += pg2kb(vm->vm_tsize); /* "shared" text? */ + + rss = pg2kb(vm_resident_count(vm)); + if (rss > p->p_stats->p_ru.ru_maxrss) + p->p_stats->p_ru.ru_maxrss = rss; + + /* finally account overall for one stathz tick */ atomic_inc_uint(&l->l_cpticks); + + /* we're done with mucking with proc stats fields */ mutex_spin_exit(&p->p_stmutex); } + + /* +* reset the profhz divisor counter +* +* Note we must use the global variable here, not spc->spc_psdiv, since +* the statclock rate may already have been lowered by another CPU +*/ + spc->spc_pscnt = psdiv; } -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpulhZDl617g.pgp Description: PGP signature
Re: getrusage() problems with user vs. system time reporting
At Tue, 01 Nov 2011 01:43:23 -0700, "Greg A. Woods" wrote: Subject: Re: getrusage() problems with user vs. system time reporting > > Indeed with all this doom and gloom about TSC it seems it might be > better just to use binuptime() -- that probably won't be as fast > though Perhaps if I'm inspired tomorrow I'll try to re-do the > sysctl_stats.h macros to use it instead of cpu_counter32(), and then use > this real measure of system time in calcru() instead of pretending the > stathz ticks mean anything. So, I've done that now. (including removing dependency on SYSCALL_COUNTS and SYSCALL_TIMES, etc.; all except for figuring out how to hook in the ISR hooks...) Seems binuptime() is indeed way too expensive to run at every mi_switch(), syscall(), etc. (it more than doubles the time it takes for gettimeofday() to run), but getbinuptime() seems to be sufficiently low-cost to use in these situations. Unfortunately getbinuptime() isn't immediately looking a whole lot better than the statistical sampling in statclock(), though perhaps, with enough benchmark runtime, it is, as expected, being _much_ more fair at splitting between user and system time. At least this is the case on a VirtualBox VM. I need to see it on real hardware next. With some further analysis, and with addition of new time values to struct proc so that statclock() ticks can also be accounted (right now I re-use the p_*ticks storage for 64-bit nanosecond values), it may be possible to come up with a simple algorithm so that calcru() can use balance out the difference between the getbinuptime() values and the true binuptime() stored in {l,p}_rtime. Storing the raw getbinuptime() values would also avoid having to do 64-bit adds in mi_switch(), syscall(), et al. If anyone's interested in more details I can post some of my results, as well as the changes I've made. Any comments about this would be appreciated! One thing that's confusing me is that though normally for short-running processes I'm seeing the getbinuptime() values be either zero, or somewhat less than the binuptime() value from p_rtime, on rare occasions I also see "vastly" larger getbinuptime() values. For example (this from calcru(), as it is called from in kern_exit.c): exit|tty: atrun[377]: rt=2216 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks exit|tty: atrun[361]: rt=2066 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks exit|tty: atrun[409]: rt=2207 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks exit|tty: atrun[130]: rt=2272 ms, u+s=1694 ms (u=0 ns / s=1694070 ns), it=0 ticks exit|tty: atrun[434]: rt=4048 ms, u+s=10151 ms (u=10151849 ns / s=0 ns), it=0 ticks exit|tty: atrun[162]: rt=3209 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks exit|tty: atrun[458]: rt=2576 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks "rt" is the real-time (p_rtime) value calculated by calcru(), in ms "u" is the accumulated user time from getbinuptime() calls, in ns "s" is the accumulated system time from getbinuptime() calls, in ns "it" is the old-style statistically gathered p_iticks value "u+s" is of course "u" + "s", converted to ms (the longest running sample above was when the VM was idle except for cron, but the VirtualBox host, my desktop, may have been quite busy) -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpVuuQcViaeH.pgp Description: PGP signature
Re: getrusage() problems with user vs. system time reporting
BTW, there's a bug in my code -- min() should be max(), I think At Sun, 6 Nov 2011 14:26:02 + (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: getrusage() problems with user vs. system time reporting > > wo...@planix.ca ("Greg A. Woods") writes: > > >I tried using hardclock_ticks in sys/syscall_stats.h, but even with > >HZ=1000, the resolution was too fine compared to the time taken by the > >some system calls, and even the average time slice for user mode. > >It is _way_ better than statclock ticks though, and "almost" free. > > Isn't statclock there to have measurements that are _not_ synchronized > with hardclock? Yes, sort of. If statistical sampling of the program counter is being done by the clock interrupt(s) (i.e. SYSCALL_* options are not enabled), then yes definitely. Using hardclock() alone can allow a process to accidentally or purposefully become synchronised to the system clock leading to either inaccurate resource utilisation or even deliberate resource hiding. Note that on i386 so far as I can tell stathz is always zero so hardclock() is always used to collect usage samples anyway. With SYSCALL_* and using hardclock_ticks as the timer, I'm not sure. Perhaps an evil process could call a cheap system call such as getpid() just before a hardclock_tick would be incremented, which would potentially give it an extra tick of user-mode CPU time. That's why I would like to use a higher-resolution timer with SYSCALL_* Using the CPU time-stamp counter is much higher resolution of course, and in theory with it then it would be impossible for any process to avoid resource usage detection. However since the timer is so fast that I am forced to somehow divide that counter down to a reasonable rate such that the calculations using it don't overflow, there may still be room for problems. It's really too bad that the timecounter infrastructure seems only geared to providing one timer and there's no easy/obvious way (that I can see) for any other kernel subsystem to reach in and access any other potentially usable monotonic timer in a more machine independent way (but without having to use all of binuptime() or similar). Obviously I could do this in a platform specific way where possible, but With a true monotonic timer, one that gives known units of time, then the SYSCALL_* feature could account for the actual time used in each mode, using context switches to mark the divisions instead of having to do any statistical sampling, and instead of having to use some timer with an unknown rate (such as the CPU TSC) as a ticker that's then used to divide out p_rtime between system, user, and interrupt time. I wonder if I could get by using the frequency discovered by the timecounter code for the CPU TSC and then from that calculate an estimate for the amount of real time spent in each mode? The issues with CPU timestamp counters would still remain, but perhaps that's better than trying to use only some bits of the TSC sums. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpK0soNxDKww.pgp Description: PGP signature
clues to getrusage() problems with user vs. system time reporting
count_bits() = 0.0515 us/c user, 9.6292 us/c sys, 0.0128 us/c wait, 9.6935 us/c wall count_ul_bits() = 0.0561 us/c user, 9.8984 us/c sys, 0.0280 us/c wait, 9.9826 us/c wall 68.87 real 0.37 user68.47 sys 20356 maximum resident set size 7 average shared memory size 408 average unshared data size 7 average unshared stack size 59 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 45 involuntary context switches Would anyone who bothered to read this far, and who knows a thing or two about context switching and such care to try to help see if the hooks in can be better placed or otherwise improved? FYI, with random() my test program is still showing wonky wobbly results on my iMac, and though they are not terrible, they're not really usable at this resolution either: $ /usr/bin/time -l ./tcountbits -t -r -i 100 tcountbits: now running each algorithm for 1 iterations random() = 0.0056 us/c user, 0. us/c sys, 0.0002 us/c wait, 0.0058 us/c wall nulltime() = 0.0113 us/c user, 0. us/c sys, 0.0006 us/c wait, 0.0119 us/c wall countbits_sparse() = 0.0616 us/c user, 0.0001 us/c sys, 0.0028 us/c wait, 0.0645 us/c wall countbits_dense() = 0.1372 us/c user, 0.0004 us/c sys, 0.0102 us/c wait, 0.1479 us/c wall COUNT_BITS() = 0.0152 us/c user, 0. us/c sys, 0.0006 us/c wait, 0.0158 us/c wall count_bits() = 0.0205 us/c user, 0. us/c sys, 0.0007 us/c wait, 0.0212 us/c wall count_ul_bits() = 0.0949 us/c user, 0.0003 us/c sys, 0.0044 us/c wait, 0.0996 us/c wall 36.71 real34.62 user 0.09 sys 368640 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 110 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 25945 involuntary context switches -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpTQEj5UbRbm.pgp Description: PGP signature
Re: language bindings (fs-independent quotas)
At Fri, 18 Nov 2011 16:27:53 +0200, Alan Barrett wrote: Subject: Re: language bindings (fs-independent quotas) > > On Fri, 18 Nov 2011, Manuel Bouyer wrote: > >> Assuming that there's no need to handle fields with embedded > >> spaces, perl's split() function will DTRT. > > > > No, it does not because there are fields that can be empty. > > The common way of dealing with that is to have a placehloder like "-" > for empty fields. I dunno (and don't want to know :-)) about perl, but it's easy enough to insert proper field separators into fixed-width columnar input with Awk and then go about using split() or whatever uses FS normally. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpRRsykMmxKn.pgp Description: PGP signature
Re: Lost file-system story
At Tue, 6 Dec 2011 12:44:16 -0500, Donald Allen wrote: Subject: Re: Lost file-system story > > much more clear. When I read this before the fun started, I took it to > mean, perhaps unjustifiably, what I know to be true -- there is some > non-zero probability that fsck of an async file-system will not be > able to verify and/or restore the filesystem to correctness after a > crash. You are saying that the probability, in the case of NetBSD, is > 1. If that's true, that there's no periodic sync, I would say that's > *really* a mistake. It should be there with a knob the administrator > can turn to adjust the sync frequency. Just to be clear: There is such a knob, or rather binary switch. It's called umount(2). sync(2) might work too, but I seem to vaguely remember something about it not working for async-mounted filesystems, and some obscure reason why it wouldn't/couldn't work for them, though that doesn't seem logical to me any more. sync(2) should, IMHO, even go so far as to cause the dirty flag to be cleared on the disk once all the writes to flush all necessary updates have completed (and assuming of course that no further changes of any kind are made to the filesystem after sync(2) scheduled all the writes, and assuming of course that writes cached in the storage interface controller or in the drive controller will be written out in order. In theory "mount -u -r" should work too, but then there's PR#30525. Steve Bellovin asked a question some time ago on netbsd-users about why umount(2) works, but "mount -u -r" doesn't, and to the best of my understanding it hasn't been answered yet (though mention was made of a possible fix to be found in FreeBSD, followed by some musings on how hard it is to find and use such fixes in the diverging code bases of FreeBSD and NetBSD). Perhaps sync(2) will fail for async-mounted filesystems, or even without MNT_ASYNC, for the same reason that "mount -u -r" fails, though that's pure speculation based on my vague ideas, and is not based on anything in the code. The question was asked in PR#30525 about "mount -u -r" vs. filesystems mounted with MNT_SYNC, but nobody knew if that would make any significant difference or not (and I would naively suspect not). Perhaps the superblock should also record when a filesystem has been mounted with MNT_ASYNC so that fsck(8) can print a warning such as: "FS is dirty and was mounted async. Demons will fly out of your nose" -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgppoVyhhnBug.pgp Description: PGP signature
Re: Lost file-system story
At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen wrote: Subject: Re: Lost file-system story > > "does not guarantee to keep a consistent file system structure on the > disk" is what I expected from NetBSD. From what I've been told in this > discussion, NetBSD pretty much guarantees that if you use async and > the system crashes, you *will* lose the filesystem if there's been any > writing to it for an arbitrarily long period of time, since apparently > meta-data for async filesystems doesn't get written as a matter of > course. I'm not sure what the difference is. You seem to be quibbling over minor differences and perhaps one-off experiences. Both OpenBSD and NetBSD also say that you should not use the "async" flag unless you are prepared to recreate the file system from scratch if your system crashes. That means use newfs(8) [and, by implication, something like restore(8)], not fsck(8), to recover after a crash. You got lucky with your test on OpenBSD. > And then there's the matter of NetBSD fsck apparently not > really being designed to cope with the mess left on the disk after > such a crash. Please correct me if I've misinterpreted what's been > said here (there have been a few different stories told, so I'm trying > to compute the mean). That's been true of Unix (and many unix-like) filesystems and their fsck(8) commands since the beginning of Unix. fsck(8) is designed to rely on the possible states of on-disk filesystem metadata because that's now Unix-based filesystems have been guaranteed to work (barring use of MNT_ASYNC, obviously). And that's why by default, and by very strong recommendation, filesystem metadata for Unix-based filesystems (sans WABPL) should always be written synchronously to the disk if you ever hope to even try to use fsck(8). > I am not telling the OpenBSD story to rub NetBSD peoples' noses in it. > I'm simply pointing out that that system appears to be an example of > ffs doing what I thought it did and what I know ext2 and journal-less > ext4 do -- do a very good job of putting the world into operating > order (without offering an impossible guarantee to do so) after a > crash when async is used, after having been told that ffs and its fsck > were not designed to do this. You seem to be very confused about what MNT_ASYNC is and is not. :-) Unix filesystems, including Berkeley Fast File System variant, have never made any guarantees about the recoverability of an async-mounted filesystem after a crash. You seem to have inferred some impossible capability based on your experience with other non-Unix filesystems that have a completely different internal structure and implementation from the Unix-based filesystems in NetBSD. Perhaps the BSD manuals have assumed some knowledge of Unix history, but even the NetBSD-1.6 mount(8) manual, from 2002, is _extremely_ clear about the dangers of the "async" flag, with strong emphasis in the formatted text on the relevant warning: async All I/O to the file system should be done asyn- chronously. In the event of a crash, _it_is_ _impossible_for_the_system_to_verify_the_integrity_of_ _data_on_a_file_system_mounted_with_this_option._ You should only use this option if you have an applica- tion-specific data recovery mechanism, or are willing to recreate the file system from scratch. According to CVS that wording has not changed since October 1, 2002, and the emphasised text has been there unchanged since September 16, 1998. > So I'd love it if my experience encourages someone to improve NetBSD > ffs and fsck to make use of async practical As others have already said, this has already been done. It's called WABPL. See wapbl(4) for more information. Use "mount -o log" to enable it. (BTW, I personally don't think you would want to use softdep -- it can suffer almost as badly as async after a crash, though perhaps without totally invalidating fsck(8)'s ability to at least recover files and directories which were static since mount; and it does also offer vastly improved performance in many use cases, but as the manual says, it should still be used with care (i.e. recognition of the risks of less-tested, much more complex code, and vastly changed internal implmentation semantics implying radically different recovery modes.) -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgp7bEgL4qiOc.pgp Description: PGP signature
Re: Lost file-system story
At Fri, 9 Dec 2011 22:12:25 -0500, Donald Allen wrote: Subject: Re: Lost file-system story > > On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods wrote: > > At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen > > wrote: > > Subject: Re: Lost file-system story > > > > > > "does not guarantee to keep a consistent file system structure on the > > > disk" is what I expected from NetBSD. From what I've been told in this > > > discussion, NetBSD pretty much guarantees that if you use async and > > > the system crashes, you *will* lose the filesystem if there's been any > > > writing to it for an arbitrarily long period of time, since apparently > > > meta-data for async filesystems doesn't get written as a matter of > > > course. > > > > I'm not sure what the difference is. > > You would be sure if you'd read my posts carefully. The difference is > whether the probability of an async-mounted filesystem is near zero or > near one. I think perhaps the misunderstanding between you and everyone else is because you haven't fully appreciated what everyone has been trying to tell you about the true meaning of "async" in Unix-based filesystems, and in particular about NetBSD's current implementation of Unix-based filesystems, and what that all means to implementing algorithms that can relibably repair the on-disk image of a filesystem after a crash. I would have thought the warning given in the description of "async" in mount(8) would be sufficient, but apparently you haven't read it that way. Perhaps the problem is the last occurance of the word "or" in the last sentence of that warning should be changed to "and". To me that would at least make the warning a bit stronger. > > And that's why by default, and by very strong recommendation, filesystem > > metadata for Unix-based filesystems (sans WABPL) should always be > > written synchronously to the disk if you ever hope to even try to use > > fsck(8). > > That's simply not true. Have you ever used Linux in all the years that > ext2 was the predominant filesystem? ext2 filesystems were routinely > mounted async for many years; everything -- data, meta-data -- was > written asynchronously with no regard to ordering. DO NOT confuse any Linux-based filesystem with any Unix-based filesystem. They may have nearly identical semantics from the user programming perspective (i.e. POSIX), but they're all entirely different under the hood. Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have never ever Ever EVER given any guarantee about the repariability of the filesystem after a crash if it has been mounted with MNT_ASYNC. Indeed it is more or less _impossible_ by design for the system to make any such guarantee given what MNT_ASYNC actually means for Unix-based filesystems, and especially what it means in the NetBSD implementation. > > Unix filesystems, including Berkeley Fast File System variant, have > > never made any guarantees about the recoverability of an async-mounted > > filesystem after a crash. > > I never thought or asserted otherwise. Well, from my perspective, especially after carefully reading your posts, you do indeed seem to think that async-mounted Unix-based filesystems should be able to be repaired, at least some of the time, despite the documentation, and all the collected wisdom of those who've replied to your posts so far, saying otherwise. > > You seem to have inferred some impossible capability based on your > > experience with other non-Unix filesystems that have a completely > > different internal structure and implementation from the Unix-based > > filesystems in NetBSD. > > Nonsense -- I have inferred no such thing. Instead of referring you to > previous posts for a re-read, I'll give you a little summary. I am > speaking about probabilities. I completely understand that no > filesystem mounted async (or any other way, for that matter), whether > Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash. OK, let's try stating this once more in what I hope are the same terms you're trying to use: The probablility of any Unix-based filesystem being repariable after a crash is zero (0) if it has been mounted with MNT_ASYNC, and if there was _any_ activity that affected its structure since mount time up to the time of the crash. It still might survive after some types of changes, but it _probably_ won't. There are no guarantees. Use "newfs" and "restore" to recover. Linux ext2 is not a Unix-based filesystem and Linux itself is not a Unix-based kernel. The meaning of "async" to ext2 is apparently very different than it is to any
Re: Lost file-system story
At Mon, 12 Dec 2011 15:08:40 +, David Holland wrote: Subject: Re: Lost file-system story > > On Sun, Dec 11, 2011 at 06:53:26PM -0800, Greg A. Woods wrote: > No, as far as I can tell he understands perfectly well; he just > doesn't consider the behavior acceptable. > > It appears that currently a ffs volume mounted -oasync never writes > back metadata. I don't think this behavior is acceptable either. I agree there are conditions and operations which _should_ guarantee that the on-disk state of the filesystem is identical to what the user perceives and thus that the filesystem is 100% consistent and secure. It seems umount(2) works to make this guarantee, for example. The two other most important of these that come to mind are: mount -u -r /async-mounted-fs and mount -u -o noasync /async-mounted-fs It is my understanding that neither works at the moment, and that this is a known and reported and accepted bug, as I outlined in an earlier post to this thread. I think sync(2) should probably also work, but _only_ if the filesystem is made entirely quiescent from before the time sync() is called, and until after the time all the writes it has scheduled have completed, all the way to the disk media. (and of course once activity starts on the filesystem again, all guarantees are lost again) It might be nice if sync(2) could schedule all the needed writes to happen in an order which would ensure consistency and repairability of the on-disk image at any given time, but I'm guessing this might be too much to ask, at least without some more significant effort. However without enforcing the "synchronous" ordering of writes, sync(2) is effectively useless for the purposes Mr. Allen appears to have, though perhaps his level of risk tollerance would still make it useful to him while others of us would still be unable to tolerate its dangers in any scenarios where we were not prepared to use newfs to recover. Besides, the only way I know to guarantee a filesystem remains quiescent is to unmount it, so if you do that first then there's nothing for sync(2) to do afterwards, so nothing new to implement. :-) > > DO NOT confuse any Linux-based filesystem with any Unix-based > > filesystem. They may have nearly identical semantics from the user > > programming perspective (i.e. POSIX), but they're all entirely different > > under the hood. > > > > Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have > > never ever Ever EVER given any guarantee about the repariability of the > > filesystem after a crash if it has been mounted with MNT_ASYNC. > > What on earth do you mean by "Unix-based filesystems" such that this > statement is true? I mean exactly what it sounds like -- nothing more. Having almost no knowledge about ext2 or any other non-Unix-based filesystems, I'm trying to be careful to avoid making any claims about those non-Unix-based filesystems. I included FFS as a Unix-based filesystem because I know for sure that it shares many of the attributes of the original Unix filesystems with respect to the issues surrouding MNT_ASYNC. > > Perhaps this sentence from McKusick's memo about fsck will help you to > > understand: "fsck is able to repair corrupted file systems using > > procedures based upon the order in which UNIX honors these file system > > update requests." This is true for all Unix-based filesystems. > > No, it is true for ffs, and possibly for our ext2 implementation > (which shares a lot of code with ffs) but nothing else. Well, if you follow what I by Unix-based filesystems, and you ignore LFS and options like WABPL, as I've said, then I believe it is entirely true since within my definition that leaves just FFS, and. V7, though it didn't have MNT_ASYNC, would suffer the same as if MNT_ASYNC were implemented for it -- indeed I'm guessing that NetBSD's reimplementation of v7fs will have the same problems with MNT_ASYNC. As I say, I don't know enough about the non-Unix-based filesystems in NetBSD, such as those compatible AmigaDOS, Acorn, Windows NT, or even MS-DOS, to know if they would be adversely affected by MNT_ASYNC. Indeed I'm not even sure if they all have reasonable filesystem repair tools (NetBSD has none, except maybe for ext2fs and msdos, though in my experience NetBSD's MS-DOS filesystem implementation is very fragile and it does not have a truly useful fsck_msdos, even without trying to use MNT_ASYNC with it). SysVbfs may suffer too, but I don't know enough about it either despite it being by definition Unix-based, and we don't have an fsck for it in any case. I'd also be guessing about EFS, and I'm not sure I'd categorize it as Unix-based any more than I do LFS. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpVUXizbZcol.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 11:09:44 -0500 (EST), Mouse wrote: Subject: Re: Lost file-system story > > They _can_ be repaired...some of the time. When they can, it is > because, by coincidence, it just so happens that the stuff that got > written produces a filesystem fsck can repair. That's totally irrellevant. Possibilities other than zero or one are not useful in manual pages, and they are only useful to an end user as a very last resort -- equivalent to calling out the army to put Humpty Dumpty back together again. For all useful intents and purposes any probablity of irreparable damage of greater than zero is, for the end user, and for all planning purposes, as good as a probability of one. Plan to use "newfs" and "restore" after every crash and you'll be OK. Plan otherwise and you will eventually be disappointed. > That's not how I feel about it when I've lost a filesystem. I'll take > a filesystem with a nonzero probability of recovering something useful > from over one that guarantees to trash everything any day (other things > being equal, of course). Heh. Yup, there are those of use who will find it a challenge to see just how much we can recover from a damaged file system no matter how useful the outcome may be. You don't put that in the manual page though, and you never give the end user that expectation (unless it's already too late for them and they've got yolk all over their face). -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgp42zLQBCM7L.pgp Description: PGP signature
Re: Lost file-system story
At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen wrote: Subject: Re: Lost file-system story > > How can you possibly say such a thing and hope to be taken seriously? > What you just said means that P(survival) = .999 is the same as > P(survival) = 0. > > There are a LOT of situations (e.g., mine) where P(survival) = .999 > would be very acceptable and P(survival) = 0 would not. The manual page must not give probabilities or even speak of possiblities. So, as-is you have been warned properly by the manual page. For planning purposes you _must_ expect that your filesystem will be damaged beyond repair after a crash and that you will have to use "newfs" and restore to recover. Learn these expectations well and you will be happier in the long run. Fail to learn them and you have no recourse but to wallow in your own sorrows. I.e. you can't come to the mailing list and say that you expected something better just because you say you can get something better from something else entirely different. You have false expectations based on your experiences with entirely foreign environments. Maybe Humpty Dumpty can be put back together again, sometimes, but even if you have all the King's horses and all the King's men on call to respond to a disaster at a moment's notice, you must not expect that you can have the egg put back together successfully, even just once, even if it does look like just a minor crack this time. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpiHkVmGsc5g.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 14:23:40 -0600, Eric Haszlakiewicz wrote: Subject: Re: Lost file-system story > > Donald, don't listen to Greg. Just in case it needs to be repeated, you're > not the only one that thinks it is reasonable to expect a non-0 probability > that things will be recovereable, even if something goes wrong. Eric, what part of MNT_ASYNC don't you understand? -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpz7FPSKpwfe.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 14:17:35 -0600, Eric Haszlakiewicz wrote: Subject: Re: Lost file-system story > > On Mon, Dec 12, 2011 at 11:39:38AM -0800, Greg A. Woods wrote: > > Having almost no knowledge about ext2 or any other non-Unix-based > > filesystems, I'm trying to be careful to avoid making any claims about > > those non-Unix-based filesystems. > > hmm.. so then how can you claim that it is "entirely different" (as you did > in an earlier email)? It sounds like you're talking our of your, ahem.. > depth. As I said, I'm trying to be careful to avoid making claims one way or another about non-Unix-based filesystems. I'm also trying to keep in mind that MNT_ASYNC can be an attribute of the OS implementation well above the filesystems and I'm also trying to avoid making claims about non-Unix filesytem structures which may be faced with this "feature" for the first time. Once upon a time I was quite familiar with the use of the tools that came before fsck. I have a great deal of experience with the on-disk structure of V7fs, SysVfs, and many of the minor variants of these filesystems. I'm experienced with many of the things that can go wrong with these filesystems and I'm moderately experienced with how they can be repaired as best as is humanly possible with low-level bit manipulating tools when bugs in either the kernel or fsck cause unexpected failures (not unlike what can happen when MNT_ASYNC is used). I'm moderately experienced with more modern filesystems such as with SysVr4's native FS and Berkeley FFS, though less experienced with low-level on-disk repair of those filesystems (since on these modern Unix-based filesystems the standard repair tools, especially fsck, have been vastly improved; and kernel bugs which destroy the ordered writing of metadata have effectively been eliminated). > > I included FFS as a Unix-based filesystem because I know for sure that > > it shares many of the attributes of the original Unix filesystems with > > respect to the issues surrouding MNT_ASYNC. > > Have you tried actually comparing the current NetBSD ffs sources against > whatever "Unix" sources you are talking about? While I'm sure that there > are many "attributes" that are shared, if you even compare the current NetBSD > sources with those from, say, 1994, you will find a ton of differences. This has nothing to do with any given pile of source code per se. The issues that affect repariability of a Unix-based filesystem are higher level design considerations that are common to the implementations of fsck and the filesystems they can repair from the v7 addenda tape all the way through to the implementation of modern day NetBSD's fsck_ffs(8). You might find McKusick and Kowalski's paper about BSD FFS fsck enlightening. (I can supply a copy if you can't find it elsewhere. It would be nice if it could be included in the NetBSD distribution, even if not cleaned up to reflect the current implementation. It was in 4.4BSD-Lite2, after all.) Like I said earlier: Perhaps the superblock(s) should also record when a filesystem has been mounted with MNT_ASYNC so that fsck(8) can print a warning such as: "FS is dirty and was mounted async. Demons will fly out of your nose" -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgprM7NvSBuE4.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 16:15:56 -0500, Greg Troxel wrote: Subject: Re: Lost file-system story > > Donald came here not complaining, just surprised that things were > somewhat worse than one would have expected. And he's right - "async" > doesn't mean "and data might never be written indefinitely", just that > there are no ordering or completion guarantees. Are you sure? I see nothing which says MNT_ASYNC will write anything at all out at any time before a umount(2) call. Personally I think that's a good thing -- it is, perhaps, an indication that MNT_ASYNC is being as efficient as it can possibly be, though of course it may also just be accidental that NetBSD's implemenation doesn't behave more as some folks seem to expect it to do. In any case I'm not so sure it matters in the long run. The relative damage to the filesystem is all a matter of circumstance. The fact is that use of MNT_ASYNC means the filesystem is very easily damaged beyond the ability of fsck's algorithms to repair it in a useful manner. One can concoct special circumstances where NetBSD's implementation fairs worse than others, but equally it is possible to concoct circumstances where no true fully async implementation can ever do very well. > I'm not 100% clear what > is wrong, but it seems likely that this discussion has surfaced a bug or > two The only real bug I see is that "mount -u -o noasync" might not work (just as "mount -u -r" is known not to work). But I seem to be the only one really focusing on that side of the issue here. Indeed it might be nice if an otherwise idle system would _eventually_ clean all its dirty buffers one way or another even if they are part of a filesystem that has been mounted with MNT_ASYNC, but I don't see that as a requirement of MNT_ASYNC, and I certainly wouldn't want that to give allowance for the manual to be less foreboding than it already is. Indeed I would still want to see fsck spit out the warning I suggested, or at least one with as much or even more force in setting the user's expectations for failure. Perhaps it is so simple that fixing the known accepted bug(s), i.e. such that "mount -u -r" and "mount -u -o noasync" work will have the fallout of also making MNT_ASYNC mounted filesystems also eventually gain better consistency on idle systems. :-) (I am waffling though on whether I think sync(2) should have any beneficial affect on the consistency of MNT_ASYNC-mounted filesytems.) -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpk33GV80xWn.pgp Description: PGP signature
Re: Lost file-system story
At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn wrote: Subject: Re: Lost file-system story > > On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote: > > > > fsck is supposed to handle *all* corruptions to the file system that can > > occur as part of normal file system operation in the kernel. It is doing > > best effort for others. It's a bug if it doesn't do the former and a > > potential missing feature for the latter. > > > > There are a lot of slips twixt cup and lip. If you are really unlucky > you can get an outage at just the wrong time that will cause the > filesystem to be hosed so badly that fsck cannot recover it. Sure, fsck > can run to completion but all you have is most of your FS in lost+found > which you have to be really really desperate to sort through. I have > been working with UNIX for over 20years now and I have only seen this > happen once and it was with a commercial UNIX. I've seen that happen more than once unfortunately. SunOS-4 once I think. I agree 100% with Joerg here though. I'm pretty sure at least some of the times I've seen fsck do more damage than good it was due to a kernel bug or more breaking assumptions about ordered operations. There have of course also been some pretty serious bugs in various fsck implementations across the years and vendors. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpYVEF362Y36.pgp Description: PGP signature
Re: Lost file-system story
At Mon, 12 Dec 2011 18:49:31 -0500 (EST), "Matt W. Benjamin" wrote: Subject: Re: Lost file-system story > > Why would sync not be effective under MNT_ASYNC? Use of sync is not > required to lead to consistency expect with respect to an arbitrary > point in time, but I don't think anyone ever believed otherwise. > However, there should be no question of metadata never being written > out if sync was run? Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though I'm not sure it will, or indeed even should be required to, have a guaranteed ongoing beneficial affect to the on-disk consistency of filesystem that was mounted with MNT_ASYNC while activity continues to proceed on the filesystem. I.e. I don't expect sync(2) to suddenly enforce order on the writes that it schedules to a MNT_ASYNC-mounted filesystem. The ordering _may_ be a natural result of the implementation, but if it's not then I wouldn't consider that to be a bug, and I certainly wouldn't write any documentation that suggested it might be a possible outcome. MNT_ASYNC means, to me at least, that even sync(2) can get away with doing writes to a filesystem mounted with that flag in an order other than one which would guarantee on-disk consistency to a level where fsck could repair it. I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted filesystems before it makes them better, and I don't see how that could be considered to be a bug. I do agree that IFF the filesystem is made quiescent, AND all writes necessary and scheduled by sync(2) are allowed to come to completion, THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be consistent (and all data blocks must be flushed to the disk too). However if you're going to go to that trouble (i.e. close all files open on the MNT_ASYNC-mounted filesystem and somehow prevent any other file operations of any kind on that filesystem until such time that you think the sync(2) scheduled writes are all done), then it should be just as easy, if not even easier, to do a "mount -u -r" (or "mount -u -o noasync", or even "umount"), in which case you'll not only be sure that the filesystem is consistent and secure, but you'll know when it reaches this state (i.e. you won't have to guess about when sync(2)'s scheduled work completes). -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpcLcSlnWPyx.pgp Description: PGP signature
Re: Lost file-system story
At Wed, 14 Dec 2011 07:50:37 + (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: Lost file-system story > > wo...@planix.ca ("Greg A. Woods") writes: > > >easy, if not even easier, to do a "mount -u -r" > > Does this work again? Not that I know of, and PR#30525 concurs, as does the commit mentioned in that PR to prevent it from falsely appearing to work, a change which remains in netbsd-5 and -current to date. See my discussion of this issue earlier in this thread. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpsPtoKtaNDu.pgp Description: PGP signature
retrocomputing NetBSD style
At Fri, 29 May 2015 10:22:35 +, David Holland wrote: Subject: Re: Removing ARCNET stuffs > > There's one other thing I ought to mention here, which is that I have > never entirely understood the point of running a modern OS on old > hardware; if you're going to run a modern OS, you can run it on modern > hardware and you get the exact same things as on old hardware, except > faster and smoother. It's always seemed to me that running vintage > OSes (on old hardware or even new) is more interesting, because that > way you get a complete vintage environment with its own, substantively > different, set of things. This does require maintaining the vintage > OSes, but that's part of the fun... nonetheless, because I don't > understand this point I may be suggesting something that makes no > sense to people who do, so take all the above with that grain of > salt. You're quite right that it is interesting to run classic software on classic hardware, to the extent that retrocomputing is about preserving a bit of history, or living in the past, or whatever, and to the extent that one might enjoy such a thing. However there were, and are, a lot of us who want(ed) a modern OS to run on our old hardware because we want(ed) to re-purpose that fine old hardware to do something new and exciting with it. I.e. I am/was not building a museum, but rather trying to get things done and learn new things. For example I started running NetBSD on Sun-3 and early sparc systems because that's the hardware I had, and it was good an capable hardware. However the original SunOS-4 was broken and decrepit for the uses I wanted to put it to, and I didn't have source so I couldn't really fix it. NetBSD opened the door to doing modern things without paying high-end prices for the latest commercial hardware and software. At that time the older hardware really was built better too, and it was more "operational" -- i.e. it had proper serial console support, and once I got to using Alphas, proper 24x7-Lights-Out support with the ability to power cycle it and reset it remotely without extra control hardware. In many respects I still do the same thing, but because of the things you were saying about how the pace of hardware change has dropped significantly in recent years, now the hardware I use is just an older variant of the same stuff you can buy new -- e.g. my new-to-me servers are Dell PE2950's -- they're replacing a PE2650, but they're not really all that much different from a brand new R710 or similar. The old 2650 is really feeling dated now and its processors are missing a number of features I want, but with the 2950 I can run the very same binaries quite a range of hardware from the latest greatest back to these older second-hand systems. Also, w.r.t. supporting older and less-capable systems, I would now treat them exactly the same as modern embedded systems with similar limitations. I don't expect I'll ever do many, if any, full builds on my RPi or BBB, and hopefully not even build many packages on them either, but rather I will cross-compile for them on my far more beefy big build server. Were I to try to run the latest NetBSD on an old Micro-VAX, Sun3, etc., I would never expect to actually do self-hosted builds on such systems. I really don't understand anyone who has the desire to try to run build.sh on a VAX-750 to build even just a kernel, let alone the whole distribution. I won't even bother trying that on my Soekris board! -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgp0rTWOpmHiz.pgp Description: PGP signature
Re: Removing ARCNET stuffs
At Tue, 2 Jun 2015 11:47:01 -0700, Dennis Ferguson wrote: Subject: Re: Removing ARCNET stuffs > > It's too long an argument, but I think any approach to a > multiprocessor network stack that attempts to get there starting > with the existing network L2/L3/interface code as a base is likely > not on the table. I would offer the rather herculean effort spent > on FreeBSD to attempt to do exactly that, and the fairly mediocre > result it produced, as evidence. The resources to match that > probably don't exist, and if there were a better, easier way to do > this it would have been done already. I think the least cost way to > produce a better result is actually to make a big change, preserving > the device drivers and the transport protocol code (which needs to run > single-threaded per-socket in any case) and any non-IP protocol code > that still works (running single-threaded) but doing a wholesale > replacement of the code that moves packets between those things with > something that can operate without locks. Doing it this way has some > risks, not the least of which is that it would leave you with networking > code unlike anyone else's (though if it were well done I'm not sure this > would last, everyone has trouble with the network stack), but I think > this makes the problem tractable and has a good chance of producing > something that scales quite well even without a lot of Linux-style > micro-optimization effort. Dennis, if you are able I wonder if you could comment on how well you think the NetGraph implementation in FreeBSD fares with respect to being part of a multiprocessor network stack, and if you think it offers any advantages (and/or has any disadvantages) in an SMP environment. I understand that NetGraph gained some finer-grained SMP support as early as FreeBSD-5.x. I also read about some NetGraph locking and performance issues in the 201309DevSummit notes, but I don't know any of the details. What if NetGraph was the _only_ network stack in the kernel? And what about Luigi Rizzo's netmap? (which claims to be specifically targeted at multi-core machines) (I'm going to try to learn a bit more about netmap at BSDCan this year.) And finally, what about the possibilities for a more formal STREAMS-like implementation, or at least something that would be compatible with existing STREAMS modules at the API (DDI/DKI) level, w.r.t. SMP? This would maybe allow independent maintenance and testing of less widely used protocol modules (and perhaps even drivers) by third parties. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpjagMEFUHHu.pgp Description: PGP signature
Re: retrocomputing NetBSD style
At Wed, 3 Jun 2015 09:23:37 -0400 (EDT), Mouse wrote: Subject: Re: retrocomputing NetBSD style > > GAW Wrote: > > I really don't understand anyone who has the desire to try to run > > build.sh on a VAX-750 to build even just a kernel, let alone the > > whole distribution. > > I recall a time where NetBSD/vax was broken for a long time because > everyone was cross-building; as soon as a native build was attemped, > the brokenness showed up. > > I native build on _everything_. If it can't native build, it isn't > really part of my stable, so to speak. Yes, there is that issue! See, for instance, my recent posts comparing assembler output from kernel compiles done by the same compiler when run on amd64 vs. i386. However those are the kinds of bugs one might hope can be caught by decent enough regression tests of the compiler and its toolchain. Unfortunately these are tests which we don't have now, in part because in a sense we treat the whole system as the regression test, thus forcing users to do native compiles to prove there are no noticeable regressions. Of course if we did have a proper cross-compiler regression test suite then we would only have to build and run such tests on those less capable machines. In some sense though since I don't intend to use my Soekris board (or RPi, or BBB, etc.) as development systems, I only really care that the cross compiler generates working code for them, and we do have an increasingly useful whole-system regression test suite that I do intend to run on those smaller systems to prove they work well when their binaries have been built on my build server. However this issue does have me wanting to do builds on my RPi and BBB and to dig my Alpha and another Sparc server out of storage, and find a couple of MIPS systems of each type, just so I can try cross-compiling on them all and prove that any future fixes to the compiler will then result in identical code no matter what host it runs on, including self-hosted. So, I guess until/unless we have a good compiler regression test suite then another awesome use for older and very different hardware from the current melange of almost-identical i386 derivatives is to help run as a test base for the toolchain. -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpv4E6v43hu2.pgp Description: PGP signature
Not Groff! Heirloom Doctools!
At Thu, 04 Jun 2015 14:53:56 +0200, Johnny Billquist wrote: Subject: Re: Groff > > On 2015-06-04 12:44, Robert Swindells wrote: > > > > Johnny Billquist wrote: > > > > > What happened to the original roff? I mean, groff is just a gnu > > > replacement for roff. Maybe switch back to the original? > > > > The sources to all of DWB are available from AT&T: > > > > <http://www2.research.att.com/~astopen/download/> > > > > It needs a bit of work to get it to build on NetBSD though. > > Hmm. What about roff from 2.11BSD? That shouldn't be so hard to get > building on NetBSD... Have my posts since 2009 about Heirloom Doctools somehow mostly going into a black hole or something!?!?!?! I get responses of "yes, please!" on the lists, but nothing happens and people still keep posting truly lame suggestions as if they've never heard of Heirloom Doctools. I posted about it in a response to this very thread just three days ago (though I redirected to tech-userlevel then too)! Yes, sorry Johnny, but your suggestion really is poor. Ancient troff, was a poor fit for "modern" use even 25 years ago with psroff to generate PostScript from its C/A/T output -- it's full of bugs and missing tons of features (beyond being device independent), and still written in what's basically PDP11 assembler dressed up as C (i.e. it's missing all of BWK's extensive rework), never mind that it's not actually in the original 2.11BSD release, which contains just Berkeley's bits (and the same small bits are in the 4.4BSD release too). Heirloom Doctools _is_ the original troff, in its very latest form! (well, there's a fork on github that's got a bunch more bug fixes) A better place to get the original troff, in modern form, with an open-source license would be Plan-9. However Heirloom Doctools is equivalent to the Plan-9 version, but without Plan-9 dependencies, and with more fixes and features. I.e. Heirloom Doctools are the very most up-to-date code from the very people who wrote and maintained it since the beginning (sans Joe Ossanna, of course) . Back before 2009 it already produced PDFs and handled UTF-8. Heirloom Doctools already builds and works on NetBSD just fine, and has done so since before 2009 (advertised as working on 2.0 in 2007). Heirloom Doctools is the essentially the complete set of tools from the AT&T Documenter's Work Bench suite -- i.e. it contains all the other _necessary_ pre-processors like eqn, pic, tbl, grap, refer, and vgrind, and it contains the back-end drivers and font tables for PostScript and PDF and other printers. The only thing it's really missing are the papers from /usr/{share/}doc, but those are freely available elsewhere, including from the DWB release. As I discussed back in 2009, Heirloom Doctools is essentially better quality and far more feature-full than the last DWB release, and arguably has a much better license, and of course DWB since 2009 is probably never going to see another public maintenance release now that Glen Fowler has retired. The only thing DWB has over Heirloom Doctools is arguably better PostScript support (oh, and 'pm', but it's C++ :-)). Why do people keep forgetting about it, and WTF are we still waiting for? (once again re-directing to tech-userlelvel where this discussion is more apropos) -- Greg A. Woods Planix, Inc. +1 250 762-7675http://www.planix.com/ pgpD3ABW7JR_U.pgp Description: PGP signature
Re: semaphores options
At Mon, 8 Apr 2019 21:19:32 +0300, Dima Veselov wrote: Subject: semaphores options > > Greetings! > Sorry for posting so many questions recently, but my production > server failed to start PostgreSQL after system upgrade (8-STABLE). > > This was caused by semaphores, which I like to set in kernel options, > which now are not working. Better say some are working, some are > not. > > I solved the problem setting them via sysctl but I wonder what happened > with options(4)? > It seems that SEMMNI, SEMMNS, SEMMNU, NOFILE and CHILD_MAX do not > work anymore, but SHMSEG and NMBCLUSTERS are good. I beleive they > were always working because the system worked long time and had > sysctl.conf empty. Any recent changes? Indeed, something seems to have changed, and the problem continues with -current as of late January (8.99.32). I think the culprit was this change, which somehow didn't have an accompanying change to any documentation (most notably options(4) still documents all the removed settings): RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v revision 1.65 date: 2015-05-12 19:06:25 -0700; author: pgoyette; state: Exp; lines: +4 -2; commitid: G8nWAd1qbrsX8ely; Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and SYSVMSG options. Move associated variables out of param.c and into the module's source file. This commit adds a great big ugly "#if XXX_PRG" around all the related SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all support for "options SEMMNI=NNN" and related. Perhaps this only affects kernels which have the SysV IPC code baked in, though I've no idea how the so-called modular world is supposed to work for pre-set definitions -- I guess it doesn't, though perhaps there's still some hook for config(1)?. The real underlying problem may be that none of the SysV IPC options from options(4) where ever properly set up with "defflag" or "defparam" in the appropriate "files" file (sys/kern/files.kern probably), or as we used to say, they were never "defopt'ed" for config. See config(5). Having "options FOO=1234" worked without "defparam" if the use was in sys/conf/param.c, but it doesn't seem to work with the new regime. Maybe it would work again if "defparam" lines were added to the right place. FYI, I have had the following in my kernel configs (in this particular case edited into XEN3_DOMU) since a very long time ago (before 1.6), and they continued to work up to and including 5.2_STABLE: # System V compatible IPC subsystem. (msgctl(2), semctl(2), and shmctl(2)) # # Note: SysV IPC parameters could be changed dynamically, see sysctl(8). # options SYSVMSG # System V-like message queues # options MSGMNI=200 # max number of message queue identifiers (default 40) options MSGMNB=32768# max size of a message queue (default 2048) options MSGTQL=512 # max number of messages in the system (default 40) options MSGSSZ=128 # size of a message segment (must be 2^n, n>4) (default 8) options MSGSEG=16384# max number of message segments in the system # (must be less than 32767) (default 2048) # options SYSVSEM # System V-like semaphores options SEMMNI=200 # max number of semaphore identifiers in system (def=10) options SEMMNS=600 # max number of semaphores in system (def=60) options SEMMNU=300 # number of undo structures in system (def=30) options SEMUME=100 # max number of undo entries per process (def=10) # options SYSVSHM # System V-like memory sharing options SHMMAXPGS=16384 # Size of shared memory map (def=2048) But on my 8.99.32 XEN3_DOMU kernel these only give me: # sysctl kern.ipc kern.ipc.sysvmsg = 1 kern.ipc.sysvsem = 1 kern.ipc.sysvshm = 1 kern.ipc.shmmax = 2097152000 kern.ipc.shmmni = 128 kern.ipc.shmseg = 128 kern.ipc.shmmaxpgs = 512000 kern.ipc.shm_use_phys = 0 kern.ipc.msgmni = 200 kern.ipc.msgseg = 16384 kern.ipc.semmni = 10 kern.ipc.semmns = 60 kern.ipc.semmnu = 30 FYI, to show it did/does work on an older system: 23:02 [0.185] # uname -a NetBSD central 5.2_STABLE NetBSD 5.2_STABLE (XEN3_DOMU) #0: Sun Jun 5 16:33:15 PDT 2016 woods@building:/build/woods/building/netbsd-5-amd64-amd64-obj/work/woods/m-NetBSD-5/sys/arch/amd64/compile/XEN3_DOMU amd64 23:02 [0.186] # sysctl kern.ipc kern.ipc.sysvmsg = 1 kern.ipc.sysvsem = 1 kern.ipc.sysvshm = 1 kern.ipc.shmmax = 67108864 kern.ipc.shmmni = 128 kern.ipc.shmseg = 128 kern.ipc.shmmaxpgs = 16384 kern.ipc.shm_use_phys = 0 kern.ipc.msgmni = 200 kern.ipc.msgseg = 16384 kern.ipc.semmni = 200 kern.ipc.semmns = 600 kern.ipc.semmnu = 300 -- Greg A. Woods +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgphcHNkIPNT0.pgp Description: OpenPGP Digital Signature
Re: semaphores options
At Mon, 08 Apr 2019 20:37:39 -0700, "Greg A. Woods" wrote: Subject: Re: semaphores options > > RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v > > revision 1.65 > date: 2015-05-12 19:06:25 -0700; author: pgoyette; state: Exp; lines: +4 > -2; commitid: G8nWAd1qbrsX8ely; > Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and > SYSVMSG options. Move associated variables out of param.c and into > the module's source file. > > > This commit adds a great big ugly "#if XXX_PRG" around all the related > SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all > support for "options SEMMNI=NNN" and related. Note also this change only appears in NetBSD-8.0 in terms of releases. The netbsd-7 branch and all its releases preserve the original behaviour where these "options" worked -- or at least that's what I understand from CVS. I haven't tested this -- my only currently running 7.x kernel didn't have my custom config edits. -- Greg A. Woods +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpPmvsEu7Au3.pgp Description: OpenPGP Digital Signature
Re: NULL pointer arithmetic issues
#x27;s new undefined behaviour rules as interpreted by some compiler maintainers now allow the compiler to STUPIDLY assume that since the programmer has knowingly put a supposed de-reference of a pointer on the first line of the function, then any comparisons of that pointer with NULL further on are OBVIOUSLY never ever going to be true and so it can SILENTLY wipe out the whole damn security check. I guess I'm saying that modern compiler maintainers are not sane, and at least some of the more recent C Standards Committee are definitely NOT sane and/or friendly and considerate. C's primitive nature engenders the programmer to think in terms of what the target machine is going to do, and as such it is extremely sad and disheartening that the standards committee chose to endanger users in so many ways. [[ in modern "Standard C" ]] It’s not that evaluating something like (1<<32) might have an unpredictable result, but rather that the entire execution of any program that evaluates such an expression is ENTIRELY meaningless! Indeed according to "Standard C" the execution is not even meaningful up to the point where undefined behaviour is encountered. Undefined behaviour trumps ALL other behaviors of the C abstract machine. And it is all in the goal of attempting comprehensive maximum possible optimization of all code at any expense INCLUDING correct operation of the program. Not all so-called "undefined behaviours" are quite this bad, yet, but in general we would be infinitely better off with a more completely defined abstract machine that might force some target architectures to jump through hoops instead of forcing EVERY programmer to ALWAYS be more careful than EVERY conceivable optimizer. As Phil Pennock said: If I program in C, I need to defend against the compiler maintainers. [[ and future standards committee members!!! ]] If I program in Go, the language maintainers defend me from my mistakes. And I say: Modern "Standard C" is actually "Useless C" and "Unusable C" Indeed I now say if "Standard C" follows C++ then it will be safe to say that a good optimizing compiler will soon be able to turn all C programs into "abort()" calls. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpdFaG78xjZR.pgp Description: OpenPGP Digital Signature
Re: NULL pointer arithmetic issues
At Mon, 24 Feb 2020 22:15:22 -0500 (EST), Mouse wrote: Subject: Re: NULL pointer arithmetic issues > > > Greg A. Woods wrote: > > > > NO MORE "undefined behaviour"!!! Pick something sane and stick to it! > > > > The problem with modern "Standard" C is that instead of refining > > the definition of the abstract machine to match the most common > > and/or logical behaviour of existing implementations, the standards > > committee chose to throw the baby out with the bath water and make > > whole swaths of conditions into so-called "undefined behaviour" > > conditions. > > Unfortunately for your argument, they did this because there are > "existing implementations" that disagree severely over the points in > question. I don't believe that's quite right. True "Undefined Behaviour" is not usually the explanation for differences between implementations. That's normally what the Language Lawyers call "Implementation Defined" behaviour. "Undefined behaviour" is used for things like dereferencing a nil pointer. There's little disagreement about that being "undefined by definition" -- even ignoring the Language Lawyers. We can hopefully agree upon that even using the original K&R edition's language: "C guarantees that no pointer that validly points at data will contain zero" The problem though is that C gives more rope than you might ever think possible in some situations, such as for example, the chances of dereferencing a nil pointer with poorly written code. The worse problem though is when compiler writers, what I'll call "Optimization Warrior Lawyers", start abusing any and every possible instance of "Undefined Behaviour" to their advantage. This is worse than ignoring Hoare's advice -- this is the very epitome of premature optimization -- this is pure evil. This is breaking otherwise readable and usable code. I give you again my example: > > An excellent example are the data-flow optimizations that are now > > commonly abused to elide security/safety-sensitive code: > > > int > > foo(struct bar *p) > > { > > char *lp = p->s; > > > > if (p == NULL || lp == NULL) { > > return -1; > > } > > This code is, and always has been, broken; it is accessing p->s before > it knows that p isn't nil. How do you know for sure? How does the compiler know? Serious questions. What if all calls to foo() are written as such: if (p) foo(p); I agree this might not be "fail-safe" code, or in any other way advisable, but it was perfectly fine in the world before UB Optimization Warriors, however today's "Standard C" gives compilers license to replace "foo()" with a trap or call to "abort()", etc. I.e. it takes a real "C Language Lawyer(tm)" to know that past certain optimization levels the sequence points prevent this from happening. In the past I could equally assume the optimizer would rewrite the first bit of foo() as: if (! p || ! p->s) return -1; In 35 years of C programming I've never before had to pay such close attention to such minute details. I need tools now to audit old code for such things, and my current experience to date suggests UBSan is not up to this task -- i.e. runtime reports are useless (perhaps even with high-code-coverage unit tests). This is the main point of my original rant. "Undefined Behaviour" as it has been interpreted by Optimization Warriors has given us an unusable language. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpytwmS13Epg.pgp Description: OpenPGP Digital Signature
Re: NULL pointer arithmetic issues
At Wed, 26 Feb 2020 00:12:49 -0500 (EST), Mouse wrote: Subject: Re: NULL pointer arithmetic issues > > > This is the main point of my original rant. "Undefined Behaviour" as > > it has been interpreted by Optimization Warriors has given us an > > unusable language. > > I'd say that it's given you unusuable implementations of the language. > The problem is not the language; it's the compiler(s). (Well, unless > you consider the language to be the problem because it's possible to > implement it badly. I don't.) I don't think the C language (in all lower-case, un-quoted, plainly) is the problem -- I think the problem is the wording of the modern standard, and the unfortunate choice to use the phrase "undefined behaviour" for certain things. This has given "license" to optimization warriors -- and their over-optimization is the root of the evil I see in current compilers. It is this unfortunate choice of describing things as "undefined" within the language that has made modern "Standard C" unusable (especially for any and all legacy code, which is most of it, right?). If we outlawed the use of the phrase "undefined behaviour" and made all instances of it into "implementation defined behviour", with a very specific caveat that such instances did not, would not, and could not, ever allow optimizers to even think of violating any possible conceivable principle of least astonishment. E.g. in the example I gave, the only thing allowed would be for the implementation to do as it please IFF and when the pointer passed was actually a nil pointer at runtime (and perhaps in this case with a strong hint that the best and ideal behaviour would be something akin to calling abort()). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpEf9I3wNR2o.pgp Description: OpenPGP Digital Signature
Re: NULL pointer arithmetic issues
At Mon, 9 Mar 2020 17:36:24 +0100, Joerg Sonnenberger wrote: Subject: Re: NULL pointer arithmetic issues > > I consider it as something even worse. Just like the case of passing > NULL pointers to memcpy and friends with zero as size, this > interpretation / restriction in the standard is actively harmful to some > code for the sake of potential optimisation opportunities in other code. > It seems to be a poor choice at that. I.e. it requires adding > conditional branches for something that behaves sanely everywhere but > may the DS9k. Indeed. I way the very existence of anything called "Undefined Behaviour" and its exploitation by optimizers is evil. (by definition, especially if we accept as valid the claim that "Premature optimization is the root of all evil in programming" -- this is of course a little bit of a stretch since my claim could be twisted to say that any and all automatic optimzation by a compiler or toolchain is evil, but of course that's not exactly my intent -- normal optimization which does not change the behaviour and intent of the code is, IMO, OK, but defining "intent" is obviously the problem) So in Standard C all "Undefined Behaviour" should be changed to "Implementation Defined" and it should be required that the implementation is not allowed to abuse any such things for the evil purpose of premature optimzation. For this kind of thing adding an integer to a pointer (or the equivalent, e.g. taking the address of a field in a struct pointed to by a nil pointer) should always do just that, even if the pointer can be proven to be a nil pointer at compile time. It is wrong to do anything else, and absolute insanity to remove any other code just because the compiler assumes SIGSEGV would/should/could happen before the other code gets a chance to run. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpGIvFrvwd81.pgp Description: OpenPGP Digital Signature
So it seems "umount -f /nfs/mount" still doesn't work.....
I? 0:00.01 rshd -L 12 16008 9130 85 0 7860 1140 kqueue I? 0:00.01 pickup -l -t unix -u 0 16131 10040 85 0 2620644 select I? 0:00.01 rshd -L 1000 20090 152700 85 0 25056 7028 select Is ? 0:00.24 xterm -class UXTerm 1000 1768 6090 85 0 11708 1056 ttyraw Is+ pts/1 0:00.09 -ksh 0 2940 75840 117 0 17264236 nfscn2 D+ pts/2 0:00.14 umount -f /future/build 1000 4103 40070 85 0 4120 1064 pause Is pts/2 0:00.09 -ksh 0 7584 41030 85 0 7656 1088 pause Ipts/2 0:00.35 ksh 0 6722 276390 127 0 16768440 tstile D+ pts/3 0:00.00 fstat /future/build 1000 21172 14064 489 85 0 3728 1060 pause Is pts/3 0:00.09 -ksh 0 27639 211720 85 0 9648 1048 pause Ipts/3 0:00.08 ksh 1000 13000 20090 722 85 0 3600 1056 ttyraw Is+ pts/4 0:00.11 -ksh 0 3707 19523 1057 85 0 11736 1044 pause Spts/5 0:00.08 ksh 0 4176 3707 1057 43 0 12900624 - O+ pts/5 0:00.00 ps -alx 1000 19523 1002 3550 85 0 3188 1056 pause Ss pts/5 0:00.09 -ksh 0 1013 1 527 85 0 2660652 ttyraw Is+ ttyE0 0:00.08 -ksh 0 822 1 126 85 0 2524412 ttyraw Is+ ttyE1 0:00.00 /usr/libexec/getty Ws ttyE1 0 828 1 126 85 0 2524416 ttyraw Is+ ttyE2 0:00.00 /usr/libexec/getty Ws ttyE2 0 957 1 126 85 0 2528412 ttyraw Is+ ttyE3 0:00.00 /usr/libexec/getty Ws ttyE3 0 862 1 126 85 0 4188408 ttyraw Is+ ttyE4 0:00.00 /usr/libexec/getty Ws ttyE4 0 1023 1 126 85 0 2524424 ttyraw Is+ ttyE5 0:00.00 /usr/libexec/getty Ws ttyE5 0 1050 1 126 85 0 2524428 ttyraw Is+ ttyE6 0:00.00 /usr/libexec/getty Ws ttyE6 0 668 10 85 0 2528416 ttyraw Is+ xencons0:00.01 /usr/libexec/getty console constty 12:00 [1.61] # crash Crash version 8.99.32, image version 8.99.32. Output from a running system is unreliable. crash> bt /t 0t2940 trace: pid 2940 lid 1 at 0xaf808a4748f0 sleepq_block() at sleepq_block+0xfd kpause() at kpause+0xdf nfs_reconnect() at nfs_reconnect+0x8b nfs_request() at nfs_request+0xf3a nfs_getattr() at nfs_getattr+0x175 VOP_GETATTR() at VOP_GETATTR+0x49 vn_stat() at vn_stat+0x3d do_sys_statat() at do_sys_statat+0x97 sys___lstat50() at sys___lstat50+0x25 syscall() at syscall+0x9c --- syscall (number 441) --- 43292a: crash> bt /t 0t6722 trace: pid 6722 lid 1 at 0xaf808a488920 sleepq_block() at sleepq_block+0x99 turnstile_block() at turnstile_block+0x337 rw_vector_enter() at rw_vector_enter+0x169 genfs_lock() at genfs_lock+0x3c VOP_LOCK() at VOP_LOCK+0x71 vn_lock() at vn_lock+0x90 nfs_root() at nfs_root+0x2b lookup_once() at lookup_once+0x38e namei_tryemulroot() at namei_tryemulroot+0x453 namei() at namei+0x29 fd_nameiat.isra.2() at fd_nameiat.isra.2+0x54 do_sys_statat() at do_sys_statat+0x87 sys___stat50() at sys___stat50+0x28 syscall() at syscall+0x9c --- syscall (number 439) --- 43c94a: crash> So it would seem that even though umount is trying to force an unmount of an NFS mount, the kernel is first trying to reconnect to the server! BTW, I have another system running a quite recent i386 build where crash(8) is unable to do a backtrace: # ktrace crash Crash version 9.99.64, image version 9.99.64. Kernel compiled without options LOCKDEBUG. Output from a running system is unreliable. crash> trace /t 0t4003 crash: kvm_read(0x4, 4): kvm_read: Bad address trace: pid 4003 lid 4003 crash> -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp65J9WmWX7L.pgp Description: PGP signature
Re: So it seems "umount -f /nfs/mount" still doesn't work.....
At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods" wrote: Subject: So it seems "umount -f /nfs/mount" still doesn't work. > Curiously the kernel now does something I didn't quite expect when one tries to reboot a system with a stuck mount. I was able to see this as I was running a kernel that verbosely documents all its shutdown unmounts and detaches. In prior times I had reached for the power switch. At first it just hangs: lilbit# reboot -q [ 1131744.8297338] syncing disks... 3 3 done [ 1131744.9797408] unmounting 0xc1f27000 /more/work (more.local:/work)... [ 1131744.9907053] ok [ 1131744.9907053] unmounting 0xc1f24000 /more/archive (more.local:/archive)... [ 1131745.0004431] ok [ 1131745.0004431] unmounting 0xc1f21000 /more/home (more.local:/home)... [ 1131745.0097426] ok [ 1131745.0097426] unmounting 0xc1f1f000 /once/build (once.local:/build)... [ 1131745.0097426] ok [ 1131745.0210854] unmounting 0xc1f1b000 /future/build (future.local:/build)... [ 1131745.0210854] ok [ 1131745.0304676] unmounting 0xc1f11000 /building/build (building.local:/build)... this is me hitting ^T to try to see what's going on [ 1131753.2800902] load: 0.52 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k [ 1132107.6651517] load: 0.48 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k [ 1133247.8436109] load: 0.48 cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k then I hit ^C and immediately it proceeded ^C[ 1133249.3636755] unmounting 0xc1f0f000 /proc (procfs)... [ 1133249.3636755] ok [ 1133249.3636755] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133249.3788641] unmounting 0xc1ecb000 /kern (kernfs)... [ 1133249.3843127] ok [ 1133249.3843127] unmounting 0xc1ec9000 /cache (/dev/wd1a)... [ 1133249.7636916] ok [ 1133249.7636916] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133249.7736976] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133250.0737098] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133250.1537121] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.0337515] unmounting 0xc1f11000 /building/build (building.local:/build)... [ 1133251.0469644] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133251.0469644] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133251.0579007] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133251.0637673] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133251.0637673] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.0750403] sd0: detached [ 1133251.0750403] scsibus0: detached [ 1133251.0750403] gpio1: detached [ 1133251.0853614] sysbeep0: detached [ 1133251.0853614] midi0: detached [ 1133251.0853614] wd1: detached [ 1133251.0949369] uhub0: detached [ 1133251.0949369] com1: detached [ 1133251.0949369] usb0: detached [ 1133251.1045456] gpio0: detached [ 1133251.1045456] ohci0: detached [ 1133251.1045456] pchb0: detached [ 1133251.1151702] unmounting 0xc1f11000 /building/build (building.local:/build)... [ 1133251.1151702] unmounting 0xc1f0d000 /dev/pts (ptyfs)... [ 1133251.1279509] unmounting 0xc1ec6000 /home (/dev/wd0g)... [ 1133251.1279509] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)... [ 1133251.1393918] unmounting 0xc1ab1000 /var (/dev/wd0e)... [ 1133251.1448739] unmounting 0xc1804000 / (/dev/wd0a)... [ 1133251.1448739] forcefully unmounting /building/build (building.local:/build)... [ 1133251.1587138] forceful unmount of /building/build failed with error -3 [ 1133251.1653872] rebooting... So it seems there's some contention between the internal attempt to unmount the stuck NFS filesystem(s), and the reboot system call itself, but if the reboot command is interrupted, then the kernel can get on with its shutdown procedures, and eventually it actually forces the unmount of the stuck NFS filesystem. Another interesting thing to note is that /future/build was also stuck as future.local is offline at this time. However that's the filesystem I tried to clear first by hand with "umount -f /future/build", but that was stuck, apparently in the same call to nfs_reconnect(). It seems it had done enough that when the reboot() triggered unmounting that it could complete the unmount without problems. (The other mounts on more.local and once.local were responding so they unmounted normally.) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpSyQ4PZfAFq.pgp Description: PGP signature
Re: So it seems "umount -f /nfs/mount" still doesn't work.....
At Tue, 30 Jun 2020 14:28:38 -0700, "Greg A. Woods" wrote: Subject: Re: So it seems "umount -f /nfs/mount" still doesn't work. > > At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods" wrote: > Subject: So it seems "umount -f /nfs/mount" still doesn't work. > > > So, I should have mentioned that "umount -f nfs.server:/remotefs" does work (i.e. it does not hang waiting for the server to reconnect, and provided that there are no processes with cwd or open files on the remote filesystem, it can unmount the filesystem). I.e. the problem is in how umount(8) looks up the parameters of the mount point. If it looks at the mount point it hangs, but if it looks through the mount table, it works. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp_1vuSFtbtv.pgp Description: OpenPGP Digital Signature
USB storage transfers halt when usbdevs is run: hardware bug or software bug?
USB storage device transfers freeze when usbdevs is run: hardware bug or software bug? While I was doing a "gzcat < *.gz > /dev/rsd2d", where sd2 was a USB memory stick, I happened to run "usbdevs -dv" and the writes to the USB device froze, and indeed the writing process was stuck in the kernel (I couldn't even stop it with ^Z). Luckily yanking the stick out seemed to unfreeze and kill the process and clean everything up nicely and I was able to re-insert it and re-do the write to it without incident. This is on an amd64 server running 9.99.64. Upon removal and subsequent re-insertion the kernel said the following (but was silent before this when usbdevs ran): [ 193334.306434] umass0: BBB reset failed, IOERROR [ 193334.306434] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.318288] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.318288] umass0: BBB reset failed, IOERROR [ 193334.329223] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.329223] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.341024] umass0: BBB reset failed, IOERROR [ 193334.341024] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.351781] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.357775] sd2d: error writing fsbn 4053632 of 4053632-4053759 (sd2 bn 4053632; cn 4021 tn 7 sn 23) [ 193334.366963] umass0: BBB reset failed, IOERROR [ 193334.366963] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.378283] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.378283] umass0: BBB reset failed, IOERROR [ 193334.389225] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.389225] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.401026] umass0: BBB reset failed, IOERROR [ 193334.401026] umass0: BBB bulk-in clear stall failed, IOERROR [ 193334.411782] umass0: BBB bulk-out clear stall failed, IOERROR [ 193334.417780] umass0: BBB reset failed, IOERROR [ 193334.417780] sd2(umass0:0:0:0): generic HBA error [ 193334.426444] sd2: detached [ 193334.426444] scsibus1: detached [ 193334.426444] umass0: detached [ 193334.436445] umass0: at uhub6 port 2 (addr 5) disconnected reinsertion: [ 193341.516925] umass0 at uhub6 port 2 configuration 1 interface 0 [ 193341.516925] umass0: SMI Corporation (0x090c) USB DISK (0x1000), rev 2.00/11.00, addr 5 [ 193341.526926] umass0: using SCSI over Bulk-Only [ 193341.526926] scsibus1 at umass0: 2 targets, 1 lun per target [ 193342.366983] sd2 at scsibus1 target 0 lun 0: disk removable [ 193342.376985] sd2: 7712 MB, 15744 cyl, 16 head, 63 sec, 512 bytes/sect x 15794176 sectors [ 193342.386986] sd2: GPT GUID: d1e3490c-b0e6-42e9-9d9e-3ac286a0f7e0 [ 193342.396989] dk6 at sd2: "EFI system", 262144 blocks at 2048, type: msdos [ 193342.396989] dk7 at sd2: "d3aa0396-d911-4aac-baa8-f2478557d31a", 7544832 blocks at 264192, type: ffs I'm guessing it's a software bug with bad locking order somewhere. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgppIi4jGdYQ5.pgp Description: OpenPGP Digital Signature
Re: style change: explicitly permit braces for single statements
At Sun, 12 Jul 2020 10:01:36 +1000, Luke Mewburn wrote: Subject: style change: explicitly permit braces for single statements > > I propose that the NetBSD C style guide in to /usr/share/misc/style > is reworded to more explicitly permit braces around single statements, > instead of the current discourgement. > > IMHO, permitting braces to be consistently used: > - Adds to clarity of intent. > - Aids code review. > - Avoids gotofail: > https://en.wikipedia.org/wiki/Unreachable_code#goto_fail_bug Well, if you s/permit/require/g, I strongly concur (with possibly one tiny exception allowed in rare cases -- when there's no newline). Personally I don't think there's any good excuse for not always putting braces around all single-statement blocks. The only bad execuse is that the language doesn't strictly require them. People are lazy, I get that (I am too), but in my opinion C is just not really safe without them. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgplEU3KR6NQt.pgp Description: OpenPGP Digital Signature
Re: style change: explicitly permit braces for single statements
At Mon, 13 Jul 2020 09:48:07 -0400 (EDT), Mouse wrote: Subject: Re: style change: explicitly permit braces for single statements > > Slavishly always > adding them makes it difficult to keep code from walking into the right > margin: These days one really should consider the right margin to be a virtual concept -- there's really no valid reason not to have and use horizontal scrolling (any code editor I'll ever use can do it on any display), and even most any small-ish laptop can have a nice readable font at 50x132, or even 50x160. (i.e. that's another style guide rule that should die) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpezCK6Ft2EX.pgp Description: OpenPGP Digital Signature
python3.7 rebuild stuck in kernel in "entropy" during an "import" statement
So I've been running a pkg-rolling_replace and one of the packages being rebuilt is python3.7, and it has got stuck, apparently on an "entropy" wait in the kernel, and it's been in this state for over 24hrs as you can see. The only things the process has open appear to be its stdio descriptors, two of which are are open on the log file I was directing all output to. This is on a Xen domU of a machine running: $ uname -a NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 PDT 2021 woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 amd64 09:51 [504] $ ps -lwwp 19875 UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND 0 19875 11551 0 85 0 55412 11324 entropy Ipts/0 0:00.27 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x bad_coding|badsyntax|site-packages|lib2to3/tests/data /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 09:51 [505] $ ps -uwwp 19875 USER PID %CPU %MEM VSZ RSS TTY STAT STARTEDTIME COMMAND root 19875 0.0 0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x bad_coding|badsyntax|site-packages|lib2to3/tests/data /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 09:51 [506] $ fstat -p 19875 USER CMD PID FD MOUNT INUM MODE SZ|DV R/W root python 19875 wd /build10645634 drwxr-xr-x1024 r root python 198750 /dev/pts 3 crw--- pts/0 rw root python 198751 /build 3721223 -rw-r--r-- 28287492 w root python 198752 /build 3721223 -rw-r--r-- 28287492 w 09:51 [507] $ find /build -inum 3721223 /build/packages/root/pkg_roll.out 09:51 [508] $ It was killable -- I sent SIGINT from the tty and it died as expected. Running "make replace" gets it stuck in the same place again, an the SIGINT shows the following stack trace: PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1 ./python -E -Wi /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py -d /usr/pkg/lib/python3.7 -f -x 'bad_coding|badsyntax|site-packages|lib2to3/tests/data' /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 ^T [ 563859.5589422] load: 0.39 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 ^T [ 563866.4606073] load: 0.36 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 ^?Traceback (most recent call last): File "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py", line 20, in from concurrent.futures import ProcessPoolExecutor File "", line 1032, in _handle_fromlist File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py", line 43, in __getattr__ from .process import ProcessPoolExecutor as pe File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py", line 53, in import multiprocessing as mp File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/__init__.py", line 16, in from . import context File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/context.py", line 5, in from . import process File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py", line 363, in _current_process = _MainProcess() File "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py", line 347, in __init__ self._config = {'authkey': AuthenticationString(os.urandom(32)), KeyboardInterrupt *** Error code 1 (ignored) *** Signal 2 *** Signal 2 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpMapUqkjr1L.pgp Description: OpenPGP Digital Signature
nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
) blocking due to lack of entropy [ 563844.834413] entropy: pid 7903 (python) blocking due to lack of entropy [ 566365.511377] entropy: pid 9001 (python) blocking due to lack of entropy [ 577473.897830] entropy: pid 9350 (python) blocking due to lack of entropy [ 579179.381600] entropy: pid 25728 (od) blocking due to lack of entropy [ 579186.994440] entropy: pid 11107 (cat) blocking due to lack of entropy [ 579202.264290] entropy: pid 7248 (cat) blocking due to lack of entropy [ 579669.831978] entropy: ready -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms At Tue, 30 Mar 2021 10:06:19 -0700, "Greg A. Woods" wrote: Subject: python3.7 rebuild stuck in kernel in "entropy" during an "import" statement > > So I've been running a pkg-rolling_replace and one of the packages being > rebuilt is python3.7, and it has got stuck, apparently on an "entropy" > wait in the kernel, and it's been in this state for over 24hrs as you > can see. > > The only things the process has open appear to be its stdio descriptors, > two of which are are open on the log file I was directing all output to. > > This is on a Xen domU of a machine running: > > $ uname -a > NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 > PDT 2021 > woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0 > amd64 > > > 09:51 [504] $ ps -lwwp 19875 > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND > 0 19875 11551 0 85 0 55412 11324 entropy Ipts/0 0:00.27 ./python -E > -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > bad_coding|badsyntax|site-packages|lib2to3/tests/data > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > 09:51 [505] $ ps -uwwp 19875 > USER PID %CPU %MEM VSZ RSS TTY STAT STARTEDTIME COMMAND > root 19875 0.0 0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > bad_coding|badsyntax|site-packages|lib2to3/tests/data > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > 09:51 [506] $ fstat -p 19875 > USER CMD PID FD MOUNT INUM MODE SZ|DV R/W > root python 19875 wd /build10645634 drwxr-xr-x1024 r > root python 198750 /dev/pts 3 crw--- pts/0 rw > root python 198751 /build 3721223 -rw-r--r-- 28287492 w > root python 198752 /build 3721223 -rw-r--r-- 28287492 w > 09:51 [507] $ find /build -inum 3721223 > /build/packages/root/pkg_roll.out > 09:51 [508] $ > > > It was killable -- I sent SIGINT from the tty and it died as expected. > > > Running "make replace" gets it stuck in the same place again, an the > SIGINT shows the following stack trace: > > PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1 > ./python -E -Wi > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py > -d /usr/pkg/lib/python3.7 -f -x > 'bad_coding|badsyntax|site-packages|lib2to3/tests/data' > /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7 > ^T > [ 563859.5589422] load: 0.39 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k > make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 > make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > ^T > [ 563866.4606073] load: 0.36 cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k > make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1 > make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37 > ^?Traceback (most recent call last): > File > "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py", > line 20, in > from concurrent.futures import ProcessPoolExecutor > File "", line 1032, in _handle_fromlist > File > "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py", > line 43, in __getattr__ > from .process import ProcessPoolExecutor as pe > File > "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer wrote: Subject: Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement) > > On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote: > > [...] > > > > Perhaps the answer is that nothing seems to be contributing anything to > > the entropy pool. No matter what device I exercise, none of the numbers > > in the following changes: > > yes, it's been this way since the rnd rototill. Virtual devices are > not trusted. > > The only way is to manually seed the pool. Ah, so that is definitely not what I expected! Previously wasn't it up to the local admin what to trust? I guess throwing bits into /dev/random is one way to play that game, but I have to trust the dom0 implicitly and utterly anyway, so why not trust the devices it presents? This is especially true for xbd block devices. All my blocks are belong to dom0. The network device is in effect no different than if it were real hardware, so if I want to trust network traffic, then I should be able to enable it, just as I could if it were real hardware. The CPUs are also probably the least "virtual" things in Xen, so why not trust them? (Though I'm not sure I understand what entropy they can offer in the first place.) Finally, if the system isn't actually collecting entropy from a device, then why the heck does it allow me to think it is (i.e. by allowing me to enable it and show it as enabled and collecting via "rndctl -l")? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpPeLoehMD2G.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)
[[ sorry I've not been catching up on mailing list discussions as fast as I had hoped to, and I'm way behind on following the entropy rototill. ]] At Wed, 31 Mar 2021 00:12:31 +, Taylor R Campbell wrote: Subject: Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement) > > This is false. If the VM host provided a viornd(4) device then NetBSD > would automatically collect, and count, entropy from the host, with no > manual intervention. I'll leave that idea to others more up-to-date on Xen PV drivers to respond to. Booting a -current GENERIC kernel (which has both Xen PV and virtio(4) devices configured into it) in a "type='pvh'" domU only attaches the xenbus PV devices, no virtio devices, so adding virtio might be a bit of a much bigger task that will need further support on at least the backend, and perhaps on the front-end too, especially to do it without QEMU. I haven't tried if virtio devices show up in an HVM domU precisely because I'm trying to avoid having to run and rely on QEMU (never mind any performance implications of HVM). > > Finally, if the system isn't actually collecting entropy from a device, > > then why the heck does it allow me to think it is (i.e. by allowing me > > to enable it and show it as enabled and collecting via "rndctl -l")? > > The system does collect samples from all those devices. However, they > are not designed to be unpredictable and there is no good reliable > model for just how unpredictable they are, so the system doesn't > _count_ anything from them. See https://man.NetBSD.org/entropy.4 for > a high-level overview. I'm not sure the word "count" appears in entropy(4) any context I can make sense of it in w.r.t. what it means to "collect" but not "count" entropy from those devices. Worse the "Flags" shown by "rndctl -l" don't seem to be directly documented (i.e. they're not described in rndctl(8)), and even on a kernel running on real hardware I don't see the word "count" showing there. After looking at the source I'm not sure the descriptions of the RND_FLAG_* values in rnd(4) help me much either. Based on my vague understanding of all of this, perhaps you meant to say "estimate", instead of "count"? That would make more sense in the context of what I read in rnd(4) and rndctl(8), though "estimate" still seems a little vague in meaning to me. In any case, I don't see why an xbd disk, or a xennet interface, can't be treated exactly as if they were real hardware (i.e. in terms of extracting entropy from their behaviour). This is exactly what virtualization is all about to me -- even for paravirtualization. After all in a threat-free world (i.e. specifically where I also trust other domUs) their entropy is going to reflect (though maybe not exactly mirror) the entropy of the underlying hardware and/or network traffic. So (but maybe not by default) if I as the admin want to trust the entropy available from an xbd(4) or xennet(4) device, then I should be able to enable it with rndctl(8) and have it "count". More importantly though the system shouldn't mislead me into thinking it is "counting" entropy from a device when it is actually not. If I had seen that there were no sources estimating/counting/whatever entropy, and I tried to enable one and was given a nice error message about this not being possible, then I would have looked elsewhere to find out how to give the system more bits of entropy. As is in my Xen domU system the output of "rndctl -l" leads me to believe all of my devices are collecting both timing and value samples, and using either one or the other to gather entropy (though with '-v' I don't see that any bits of entropy have been added from any of those amy millions of collected samples). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpHGwjWgu37A.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? or dom0!!!
Vendor ID: "GenuineIntel"; CPUID level 11 Intel-specific functions: Version 000206c2: Type 0 - Original OEM Family 6 - Pentium Pro Model 12 - Stepping 2 Reserved 8 Extended brand string: "Intel(R) Xeon(R) CPU E5645 @ 2.40GHz" CLFLUSH instruction cache line size: 8 Initial APIC ID: 34 Hyper threading siblings: 32 Feature flags 1fc9cbf5: FPUFloating Point Unit DE Debugging Extensions TSCTime Stamp Counter MSRModel Specific Registers PAEPhysical Address Extension MCEMachine Check Exception CX8COMPXCHG8B Instruction APIC On-chip Advanced Programmable Interrupt Controller present and enabled SEPFast System Call MCAMachine Check Architecture CMOV Conditional Move and Compare Instructions FGPAT Page Attribute Table CLFSH CFLUSH instruction ACPI Thermal Monitor and Clock Ctrl MMXMMX instruction set FXSR Fast FP/MMX Streaming SIMD Extensions save/restore SSEStreaming SIMD Extensions instruction set SSE2 SSE2 extensions SS Self Snoop HT Hyper Threading TLB and cache info: 5a: unknown TLB/cache descriptor 03: Data TLB: 4KB pages, 4-way set assoc, 64 entries 55: unknown TLB/cache descriptor ff: unknown TLB/cache descriptor b2: unknown TLB/cache descriptor f0: unknown TLB/cache descriptor ca: unknown TLB/cache descriptor Processor serial: 0002-06C2---- I noted today though that entropy doesn't seem to be accumulating even in the dom0 despite there being many useful sources configured to both collect and "estimate" _and_ despite the fact there's a valid-looking $random_file that was saved and reloaded by /etc/rc.d/random_seed (and saved again every day by /etc/security): # /etc/rc.d/random_seed rcvar # random_seed random_seed=YES # ls -l /etc/entropy-file -rw--- 1 root wheel 536 Mar 31 04:15 /etc/entropy-file # rndctl -l Source Bits Type Flags ipmi0-Temp0 env estimate, collect, v, t, dv, dt ipmi0-Temp1 0 env estimate, collect, v, t, dv, dt ipmi0-Temp2 0 env estimate, collect, v, t, dv, dt ipmi0-Temp3 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ipmi0-Planar-Te 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-1 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-1 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-2 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-2 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-3 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-3 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-4 0 env estimate, collect, v, t, dv, dt ipmi0-Status 0 ??? estimate, collect, t, dt ipmi0-Voltage 0 power estimate, collect, v, t, dv, dt ipmi0-Voltage10 power estimate, collect, v, t, dv, dt ipmi0-Status1 0 ??? estimate, collect, t, dt ipmi0-Intrusion 0 ??? estimate, collect, t, dt ipmi0-Temp4 0 env estimate, collect, v, t, dv, dt ipmi0-Temp5 0 env estimate, collect, v, t, dv, dt ipmi0-Temp6 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-4 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-5 0 env estimate, collect, v, t, dv, dt ipmi0-FAN-MOD-5 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ipmi0-Ambient-T 0 env estimate, collect, v, t, dv, dt ums0 0 tty estimate, collect, v, t, dt ukbd0 0 tty estimate, collect, v, t, dt /dev/random 0 ??? estimate, collect, v sd2 0 disk estimate, collect, v, t, dt sd1 0 disk estimate, collect, v, t, dt sd0 0 disk estimate, collect, v, t, dt cpu0 0 vm estimate, collect, v, t, dv hardclock 0 skew estimate, collect, t pckbd00 tty estimate, collect, v, t, dt system-power 0 power estimate, collect, v, t, dt autoconf 0 ??? estimate, collect, t seed 0 ??? estimate, collect, v # sysctl kern.entropy kern.entropy.collection = 1 kern.entropy.depletion = 0 kern.entropy.consolidate = -23552 kern.entropy.gather = -23552 kern.entropy.needed = 256 kern.entropy.pending = 0 kern.entropy.epoch = 19 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpZNU3eXL60M.pgp Description: OpenPGP Digital Signature
Re: nothing contributing entropy in Xen domUs? or dom0!!!
At Thu, 1 Apr 2021 04:13:59 + (UTC), RVP wrote: Subject: Re: nothing contributing entropy in Xen domUs? or dom0!!! > > Does this /etc/entropy-file match what's there in your /boot.cfg? > > On my laptop $random_file is left at the default which is: > /var/db/entropy-file Yes I did change that as well (as /var isn't part of the root partition). However that's not the problem for the dom0. "rndseed" isn't currently used (at least not by me or any documentation I'm aware of) when loading (multibooting) a Xen kernel and a NetBSD dom0 kernel. /etc/rc.d/random_seed will do this (again) later anyway. However since as I showed the hardware doesn't seem to be providing entropy that can be "counted" ("estimated"), there's nothing to save, and so nothing to load on the next boot either. I know how to seed it -- but that's not the problem -- the hardware should be providing plenty of entropy. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpfPcjeu55q3.pgp Description: OpenPGP Digital Signature
Re: UVM behavior under memory pressure
At Thu, 1 Apr 2021 21:03:37 +0200, Manuel Bouyer wrote: Subject: UVM behavior under memory pressure > > Of course the system is very slow > Shouldn't UVM choose, in this case, to reclaim pages from the file cache > for the process data ? > I'm using the default vm.* sysctl values. I almost never use the default vm.* values. I would guess the main problem for your system's memory requirements, at the time you showed it, is that the default for vm.anonmin is way too low and so raising vm.anonmin might help. If vm.anonmin isn't high enough then the pager won't sacrifice other requirements already in play for anon pages. Lowering vm.filemax (and maybe also vm.filemin) might also help since your system, at that time, appeared to be doing far less I/O on large numbers of files than, say, a web server or a compile server might be doing. However with almost 3G dedicated to the file cache it would seem your system did recently trawl through a lot of file data, and so with a lower vm.filemax less of it would have been kept as pressure for other types of memory increased. Here are the values I use, with comments about why, from my default /etc/sysctl.conf. These have worked reasonably well for me for years, though I did have a virtual machine struggle to do some builds when I ran too many make jobs in parallel and then a gargantuan compiler job came along and needed too much memory. However there was enough swap and eventually it thrashed its way through, and more importantly I was still able to run commands, albeit slowly, and my one large interactive process (emacs), sometimes took quite a while to wake up and respond. # N.B.: On a live system make sure to order changes to these values so that you # always lower any values from their default first, and then raise any that are # to be raised above their defaults. This way, the sum of the minimums will # stay within the 95% limit. # the minimum percentage of memory always (made) available for the # file data cache # # The default is 10, which is much too high, even for a large-memory # system... # vm.filemin=5 # the maximum percentage of memory that will be reclaimed from other uses for # file data cache # # The default is 50, which may be too high for small-memory systems but may be # about right for large-memory systems... # #vm.filemax=25 # the minimum percentage of memory always (made) available for anonymous pages # # The default is 10, which is way too low... # vm.anonmin=40 # the maximum percentage of memory that will be reclaimed from other uses for # anonymous pages # # The default is 80, which seems just about right, but then again it's unlikely # that the majority of inactive anonymous pages will ever be reactivated so # maybe this should be lowered? # #vm.anonmax=80 # the minimum percentage of memory always (made) available for text pages # # The default is 5, which may be far too low on small-RAM systems... # vm.execmin=20 # the maximum percentage of memory that will be reclaimed from other uses for # text pages # # The default is 30, which may be too low, esp. for big programs on small-memory # systems... # vm.execmax=40 # It may also be useful to set the bufmem high-water limit to a number which may # actually be less than 5% (vm.bufcache / options BUFCACHE) on large-memory # systems (as BUFCACHE cannot be set below 5%). # # note this value is given in bytes. # #vm.bufmem_hiwater= -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpSINIeXL6Sx.pgp Description: OpenPGP Digital Signature
Re: UVM behavior under memory pressure
At Thu, 1 Apr 2021 23:15:42 +0200, Manuel Bouyer wrote: Subject: Re: UVM behavior under memory pressure > > Yes, I understand this. But, in an emergency situation like this one (there > is no free ram, swap is full, openscad eventually gets killed), > I would expect the pager to reclaim pages where it can; > like file cache (down to vm.filemin, I agree it shouldn't go down to 0). > > In my case, vm.anonmax is at 80%, and I suspect it was not reached > (I tried to increase it to 90% but this didn't change anything). As I understand things there's no point to increasing any vm.*max value unless it is already way too low and you want more memory to be used for that category and there's not already more use in other categories (i.e. where a competing vm.*max value is too high). It is the vm.*min value for the desired category that isn't high enough to allow that category to claim more pages from the less desired categories. I.e. if vm.anonmin is too low, and I believe the default of 10% is way too low, then when file I/O gets busy for whatever reason, (and with the default rather high vm.filemax value) large processes _will_ get partially paged out as only 10% of their memory will be kept activated. Simultaneously decreasing vm.filemax and increasing vm.anonmin should guarantee more memory can be dedicated to processes needing it as opposed to allowing file caching to take over. I think in general the vm.*max limits (except maybe vm.filemax) are only really interesting on very small memory systems and/or on systems with very specific types of uses which might demand more pages of one category or the other. The default vm.filemax value on the other hand may be too high for systems that don't _constantly_ do a lot of file I/O _and_ access many of the same files more than once. So if you regularly run large processes that don't necessarily do a whole lot of file I/O then you want to reduce vm.filemax, perhaps quite a lot, maybe even to just being barely above vm.filemin; and of course you want to increase vm.anonmin. One early guide suggested (with my comments): vm.execmin=2# this is too low if your progs are huge code vm.execmax=4# but this should probably be as much as 20 vm.filemin=0 vm.filemax=1# too low for compiling, web serving, etc. vm.anonmin=70 vm.anonmax=95 Note that increasing vm.anonmin won't dedicate memory to anon pages if they're not currently needed of course, but it will guarantee at least that much memory will be made available, and kept available, when and if pressure for anon pages increases. So all of these limits are not "hard limits", nor are they dedicated allocations per-se. A given category can use more pages than its max limit, at least until some other category experiences pressure, i.e. until the page daemon is woken. (Just keep in mind that one cannot currently exceed 95% as the sum of the lower (vm.*min) limits. The total of the upper (vm.*max) limits can be more than 100%, but there are caveats to such a state.) Also if you have a really large memory machine and you don't have processes that wander through huge numbers of files, then you might also want to lower vm.bufcache so that it's not wasted. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp6P_y72diVe.pgp Description: OpenPGP Digital Signature
regarding the changes to kernel entropy gathering
So, I'm not sure what to say here. I'm very surprised, quite confused, more than a little perturbed, and even somewhat angry. It's taken me quite some time to write this. Now temper this with knowing that I do know I'm running -current, not a release, and that I accept the challenges this might cause (thus see the patch below). Updating a system, even on -current, shouldn't cause what I can only describe as _intentional_ breakage, even for matters so important as system security and integrity, and especially not without clear mention UPDATING, and perhaps also with documented and referenced tools to assist in undoing said breakage. Updating a system, even on -current, shouldn't create a long-lived situation where the system documentation and the behaviour and actions of system commands is completely out of sync with the behaviour of the kernel, and in fact lies to the administrator about the abilities of the system. In any case, the following patch (and in particular the last hunk) fixes all my problems and complaints in this domain. It is fully tested, and it works A-OK with Xen in both domU and dom0 kernels. My systems once again have consistent documentation, and tools that don't lie, and are able to function as before w.r.t. matters related to /dev/random and getrandom(2). Now I'm not proposing this as the final solution -- I think there's some middle ground to be found, but at least this gets things back to working. --- sys/kern/kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800 +++ sys/kern/kern_entropy.c 2021-04-03 11:25:31.667067667 -0700 @@ -1306,7 +1306,7 @@ /* Wait for some entropy to come in and try again. */ KASSERT(E->stage >= ENTROPY_WARM); - printf("entropy: pid %d (%s) blocking due to lack of entropy\n", + printf("entropy: pid %d (%s) blocking due to lack of entropy\n", /* xxx uprintf() instead/also? */ curproc->p_pid, curproc->p_comm); if (ISSET(flags, ENTROPY_SIG)) { @@ -1577,6 +1577,16 @@ KASSERT(i == __arraycount(extra)); entropy_enter(extra, sizeof extra, 0); explicit_memset(extra, 0, sizeof extra); + + aprint_verbose("entropy: %s attached as an entropy source (", rs->name); + if (!(flags & RND_FLAG_NO_COLLECT)) { + printf("collecting"); + if (flags & RND_FLAG_NO_ESTIMATE) + printf(" without estimation"); + } + else + printf("off"); + printf(")\n"); } /* @@ -1610,6 +1620,8 @@ /* Free the per-CPU data. */ percpu_free(rs->state, sizeof(struct rndsource_cpu)); + + aprint_verbose("entropy: %s detached as an entropy source\n", rs->name); } /* @@ -1754,21 +1766,21 @@ rnd_add_uint32(struct krndsource *rs, uint32_t value) { - rnd_add_data(rs, &value, sizeof value, 0); + rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY); } void _rnd_add_uint32(struct krndsource *rs, uint32_t value) { - rnd_add_data(rs, &value, sizeof value, 0); + rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY); } void _rnd_add_uint64(struct krndsource *rs, uint64_t value) { - rnd_add_data(rs, &value, sizeof value, 0); + rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY); } /* -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp9RYQDmBbzG.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > Date: Sat, 03 Apr 2021 12:24:29 -0700 > > From: "Greg A. Woods" > > > > Updating a system, even on -current, shouldn't create a long-lived > > situation where the system documentation and the behaviour and actions > > of system commands is completely out of sync with the behaviour of the > > kernel, and in fact lies to the administrator about the abilities of the > > system. > > It would help if you could identify specifically what you are calling > a lie. > > > @@ -1754,21 +1766,21 @@ > > rnd_add_uint32(struct krndsource *rs, uint32_t value) > > { > > > > - rnd_add_data(rs, &value, sizeof value, 0); > > + rnd_add_data(rs, &value, sizeof value, sizeof value * ABBY); > > } > > The rnd_add_uint32 function is used by drivers to feed in data from > sources _with no known model for their entropy_. Indeed -- that's the idea. > It's how drivers > toss in data that might be helpful but might totally predictable, and > the driver has no way to know. Yeah, so? They don't need to know this. I'm not actually asking random drivers to decide the amount of physical entropy they can collect. That is controlled elsewhere. > Your change _creates_ the lie that every bit of data entered this way > is drawn from a source with independent uniform distribution. No, my change _allows_ the administrator to decide which devices can be used as estimating/counting entropy sources. For example I know that many of the devices on almost all of my machines (virtual or otherwise) are equally good sources of entropy for their uses. An addition change, one which I would also find totally acceptable, would be to disable the current default of allowing "estimation" on devices which are not true hardware RNGs. I.e. maybe this simple change would suffice (though I haven't checked beyond a quick grep to see that this flag is the mostly commonly used one -- perhaps some real RNG devices could also be changed to use explicit flags to enable estimation by default): --- sys/sys/rndio.h.~1.2.~ 2016-07-23 14:36:45.0 -0700 +++ sys/sys/rndio.h 2021-04-04 12:39:15.609936311 -0700 @@ -91,8 +91,7 @@ #define RND_FLAG_ESTIMATE_TIME 0x4000 /* estimate entropy on time */ #define RND_FLAG_ESTIMATE_VALUE0x8000 /* estimate entropy on value */ #defineRND_FLAG_HASENABLE 0x0001 /* has enable/disable fns */ -#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\ -RND_FLAG_ESTIMATE_TIME) +#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME) #defineRND_TYPE_UNKNOWN0 /* unknown source */ #defineRND_TYPE_DISK 1 /* source is physical disk */ There are a vast number of ways this re-tooling of entropy collection could have been done better. I'm asking for discussion on what amount to some VERY simple changes which completely and totally solve many real-world uses of this code while at the same time not just allowing, but defaulting to, the very strict and secure operation for special situations. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpXj_p1tBVqr.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 04 Apr 2021 23:47:10 +0700, Robert Elz wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If we want really good security, I'd submit we need to disable > the random seed file, and RDRAND (and anything similar) until we > have proof that they're perfect. Indeed, I concur. I trust the randomness and in-observability and isolation of the behaviour of my system's fans far more than I would trust Intel's RDRAND or RDSEED instructions. I even trust the randomness of the timings of the virtual disks in my Xen domU virtual machines more-so, even with multiple sibling guests, even if some of those other guests can be influenced by untrusted third parties at critical times. > Personally, I'm happy with anything that your average high school > student is unlikely to be able to crack in an hour. I don't run > a bank, or a military installation, and I'm not the NSA. If someone > is prepared to put in the effort required to break into my systems, > then let them, it isn't worth the cost to prevent that tiny chance. > That's the same way that my house has ordinary locks - I'm sure they > can be picked by someone who knows what they're doing, and better security > is available, at a price, but a nice happy medium is what fits me best. Indeed again. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpvuqMttwSyI.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 04 Apr 2021 21:14:31 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Do note, the existing randomness sources are still being sampled and > mixed into the pool, so even if the starting state from the saved > entropy may be known (by violating the security of the storage), > it's still not possible to predict the complete stream of randomness > data once the system has seen a bit of uptime (given that there are > actual other sources of (unverified) entropy which aren't all of too > low quality). No amount of uptime and activity was increasing the entropy in my system before I patched it. /dev/random remained blocked after days of busy system activity. I would argue that most, if not all, of the sources of entropy identified by rndctl(8) on my systems are high-quality and secure sources in my circumstances and for my uses. Perhaps the unpatched implementation isn't doing exactly what you think it is? The unpatched implementation completely and entirely prevents the system from ever using any of those sources, despite showing that they are enabled for use. > However, in the new scheme of things, because most of the > traditional sources have unknown quality, and we have no reliable > method to estimate how much "actual entropy" those sources > provide, they no longer count towards the *estimate* of what is > now a lower bound on the "real" entropy available in the pool. It really doesn't matter what can be determined in general and from a distance. What matters is what a given administrator can determine in particular for a given application in a given circumstance. Before my patch the system was not behaving as documented and could not be made to behave as the documentation said it could be made to behave. With my patch I can choose which to trust from amongst the available sources. Without that patch my choices are ignored and the system lies to me about using my choices. I would argue my patch fixes a critical bug. > Besides, the implementation has been thoroughly vetted. E.g. the > reference [7] from the wikipedia article states in the conclusion on > page 20 > >Overall, the Ivy Bridge RNG is a robust design with a large >margin of safety that ensures good random data is generated even >if the Entropy Source is not operating as well as predicted. "design" != implementation -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpjs3QaPXmot.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 16:39:11 -0400 (EDT), Mouse wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > No amount of uptime and activity was increasing the entropy in my > > system before I patched it. > > As I understand it, entropy was being contributed. What wasn't > happening was the random driver code recognizing and acknowledging that > entropy, because it had no way to tell how much of it there really was. Clearly there was no entropy being contributed in any way shape or form. It wasn't the driver code at fault. It was the code I fixed with my patch that was at fault. I told the system to "count" the entropy being gathered by the appropriate driver(s), but it was being ignored entirely. After my fix the system behaved as I told it to. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpKRv3dDs3Kt.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 05 Apr 2021 00:07:49 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Indeed, that's also compatible with what I wrote. The samples > from whatever sources you have are still being mixed into the > pool, but they are not being counted as contributing to the > entropy estimate, because the quality of the samples is at best > unknown. Perhaps we're talking past each other? Until I made the fix no amount of time or activity or of me telling the system to make use of the driver inputs was unblocking getrandom(2) or /dev/random, so it doesn't really matter if anything was being "mixed into the pool" so to speak as the pool was empty. > A possible workaround is, once you have some uptime and some bits > mixed into the pool, you can do: I don't need a work-around -- I found a fix. I corrected some code that was purposefully ignoring my orders for how it should behave. > I am still of the fairly firm beleif that the mistrust in the > hardware vendors' ability to make a reasonable and robust > implementation is without foundation. Well there are still millions of systems out there without the fancy newer hardware RNGs available to make them more secure than Fort Knox. At least a small handful of them run NetBSD for me, and want them to work for my needs and I was, and am, quite happy with using entropy that can be collected from various devices that my systems (virtual and real) actually have. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpw8NF4N8YCU.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > > What about architectures that have nothing like RDRAND/RDSEED? Are > > they, effectively, totally unsupported now? > > Nope, not entirely. But they have to be seeded once. If they > have storage which survives reboots, and entropy is saved and > restored on reboot, they will be ~fine. BTW, to me reusing the same entropy on every reboot seems less secure. > Systems without persistent storage and also without RDRAND/RDSEED > will however be ... a more challenging problem. Leaving things like that would be totally silly. With my patch the old way of gathering entropy from devices works just fine as it always did, albeit with the second patch it does require a tiny bit of extra configuration. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpgeBbtqrqWg.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 01:05:58 +0200, Joerg Sonnenberger wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Part of the problem here is that most of the non-RNG data sources are > easily observable either from the local system (e.g. any malicious user) > or other VMs on the same machine (in case of a hypervisor) or local > machines on the same network (in case of network interrupts). It _Just_ _Doesn't_ _Matter_ (i.e. for many of us, most of the time). Now ideally in the hypervisor scenario we would have a backend device that read from /dev/random and offered it to the VM guest as a virtual hardware RNG. Or maybe it's as simple as passing a those few bytes through a custom Xenstore string and having a script in the VM read them and inject them into /dev/random. But that's not been done yet. BTW, personally, on at least on some machines, I don't have any worry whatsoever at the moment about one VM guest spying on, or influencing the PRNG, in another. Zero worry. They're all _me_. I don't need some theoretically perfect level of protection from myself. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpFPOplfhwSl.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 23:09:18 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If you know this (and this is something I certainly can't confidently > assert!), you can write 32 bytes to /dev/random, save a seed, and be > done with it. I don't have random data easily available at install time. I don't have random data easily available every time I boot a machine with non-persistent storage (e.g. a test ISO image). I _do_ trust well enough the sources of randomness in some device drivers to provide me with a secure enough amount of entropy, for my purposes. And so with my fix(es) I don't need to feed supposedly random data to every system on every install and/or every reboot. What's worse? My fixes, or something like this in /etc/rc.local: echo -n "" > /dev/random > But users who don't go messing around with obscure rndctl settings in > rc.conf will be proverbially shot in the foot by this change -- except > they won't notice because there is practically guaranteed to be no > feedback whatsoever for a security disaster until their systems turn > up in a paper published at Usenix like <https://factorable.net/>. You're really stretching your argument thinly if you are assuming everyone _needs_ perfect entropy here. Also, that's only if the default RND_FLAG_ESTIMATE_* bits are turned off. AND only if the system doesn't have some true hardware RNG. > What your change does is equivalent to going around to every device > driver that previously said `this provides zero entropy, or I don't > know how much entropy it provides' and replacing that claim by `this > is a sample of an independent and perfectly uniform random string of > bits', which is a much stronger (and falser) claim than even the old > `entropy estimation' confabulation that NetBSD used to do. No, only if the default RND_FLAG_ESTIMATE_* bits are ***NOT*** turned off. AND only if the user is like me and stuck with some poor second-grade ancient hardware that doesn't have some fancy new true hardware RNG. In the mean time a more productive approach would be to figure out what's best for those of us who don't need perfection every time and/or to fix those device drivers that could feed sufficiently random data to the entropy pool, and then to recommend a suitable value for rndctl_flags in /etc/rc.conf. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpnOADtmWrjC.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Sun, 4 Apr 2021 18:47:23 -0700, Brian Buhrow wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Hello. As I understand it, Greg ran into this problem on a xen domu. > In checking my NetBSD-9 system running as a domu under xen-4.14.1, > there is no rdrand or rdseed feature exposed to domu's by xen. This > observation is confirmed by looking at the xen command line reference > page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html The problem in the domU was really just the very tip of the iceberg. The dom0 exhibits the exact same problem and for the same reasons. > and NetBSD doesn't trust the random sources provided by the xennet(4) > and xbd(4) drivers. Therefore, the only solution to get randomness > working for the first time on a newlyinstalled domu is to write 32 > bytes to /dev/random. It's not that the xbd(4) devices, etc. are not trusted as entropy sources -- the new entropy system doesn't trust anything, real or virtual, despite the documentation saying that it can be made to do so. My patch fixes that bug. It was very obvious once I understood the root of the issue. As a result my patch fixes the bug for Xen dom0 and domU. Writing randomness to /dev/random is _NOT_ a general solution (though it could be IFF it can be reliably taken from /dev/urandom AND IFF the rest of the system and documentation is completely and adequately fixed to match the new regime). What perturbs me the most and makes me rather angry is that the rest of the system, and the system documentation, continued to lie and mislead me for days (and it didn't help that nobody who knew this was pointing helpfully and clearly at the root of the problem). So, my patch ALSO restores the kernel's behaviour to match the documentation and tools (specifically rndctl). That the core of it it is just a two-line patch makes this fix extremely satisfying. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpWiXqui7McJ.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer wrote: Subject: Re: regarding the changes to kernel entropy gathering > > If I understood it properly, there's no need for such a knob. > echo 0123456789abcdef0123456789abcdef > /dev/random > > will get you back to the state we had in netbsd-9, with (pseudo-)randomness > collected from devices. Well, no, not quite so much randomness. Definitely pseudo though! My patch on the other hand can at least inject some real randomness into the entropy pool, even if it is observable or influenceable by nefarious dudes who might be hiding out in my garage. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpdkEisDB6Js.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 16:13:55 +1200, Lloyd Parkes wrote: Subject: Re: regarding the changes to kernel entropy gathering > > The current implementation prints out a message whenever it blocks a > process that wants randomness, which immediately makes this > implementation superior to all others that I have ever seen. The > number of times I've logged into systems that have stalled on boot and > made them finish booting by running "ls -lR /" over the past 20 years > are too many to count. I don't know if I just needed to wait longer > for the boot to finish, or if generating entropy was the fix, and I > will never know. This is nuts. Indeed! > We can use the message to point the system administrator to a manual > page that tells them what to do, and by "tells them what to do", I > mean in plain simple language, right at the top of the page, without > scaring them. Excellent idea! :-) However I have been wondering if sending the message just to the console, and logging it, say in /var/log/kern, is sufficient. It still took me a very long time to find the existing new message because I don't hang out on the console -- this is a VM, after all, and it's running in a city almost exactly 4200km driving distance from me too! As-is I feel I hang out on the console more often than the average admin who doesn't use a physical console, and of course infinitely more often than any user who doesn't admin his own server. I have added the following comment to the kernel to remind me to think more about this, as a uprintf(9) at the same time would pop right up on the actual user's session too: --- kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800 +++ kern_entropy.c 2021-04-03 11:25:31.667067667 -0700 @@ -1306,7 +1306,7 @@ /* Wait for some entropy to come in and try again. */ KASSERT(E->stage >= ENTROPY_WARM); - printf("entropy: pid %d (%s) blocking due to lack of entropy\n", + printf("entropy: pid %d (%s) blocking due to lack of entropy\n", /* xxx uprintf() instead/also? */ curproc->p_pid, curproc->p_comm); if (ISSET(flags, ENTROPY_SIG)) { -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpbil_4h9ofy.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 03:02:42 +0200, Joerg Sonnenberger wrote: Subject: Re: regarding the changes to kernel entropy gathering > > Except that's not what the system is doing. It removes the seed file on > boot and creates a new one on shutdown. That's not exactly what the documentation says it does (from rndctl(8)): -L Load saved entropy from file save-file and overwrite it with a seed derived by hashing it together with output from /dev/urandom so that the new seed has at least as much entropy as either the old seed had or the system already has. If interrupted, either the old seed or the new seed will be in place. The code seems to concur. Also the system re-saves the $random_file via /etc/security (unconditionally, i.e. always, but only if $random_file is set). -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpJ2gB7j21GX.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
ources is needed to "stir" the pot in the first place, then why not just "count" it as "real" entropy and be done with it -- at least then it is obvious when enough entropy has been gathered and the currently implemented algorithms handle things properly and securely and all inside the kernel. I.e. the admin doesn't have to put a "sleep 30" or whatever in front of it and hope that's enough and that it's still not too predictable. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpxsHTzqoenJ.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote: > > > > BTW, to me reusing the same entropy on every reboot seems less secure. > > Sure. But that's not what the code actually does. > > Please, read the code in more depth (or in this case, breadth), then argue > about it. Sorry, I was eluding to the idea of sticking the following in /etc/rc.local as the brain-dead way to work around the problem: echo -n "" > /dev/random However I have not yet read and understood enough of the code to know if: dd if=/dev/urandom of=/dev/random bs=32 count=1 is any more "secure" -- I'm guessing (hoping?) it depends on exactly when this might be run, and also depends on which, if any, other device sources are enabled for "collecting". If in some rare case none were enabled, or if it were run before any were able to "stir the pool", then I'm guessing it would be no more secure than writing a fixed string. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpgF42U_yi8i.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
x70, ver=426 kern.entropy.consolidate (1.1260.1263): CTLTYPE_INT, size 4, flags 0x70, func=0x8083f151, ver=427 kern.entropy.gather (1.1260.1264): CTLTYPE_INT, size 4, flags 0x70, func=0x8083dd4c, ver=428 kern.entropy.needed (1.1260.1265): CTLTYPE_INT, size 4, flags 0x100, ver=429 kern.entropy.pending (1.1260.1266): CTLTYPE_INT, size 4, flags 0x100, ver=430 kern.entropy.epoch (1.1260.1267): CTLTYPE_INT, size 4, flags 0x100, ver=431 Perhaps function pointer values shouldn't be printed as integers? And there are no text descriptions for some of the kern.entropy values: 17:27 [1.831] # sysctl -d kern.entropy.needed kern.entropy.needed: (no description) 17:27 [1.832] # sysctl -d kern.entropy.pending kern.entropy.pending: (no description) 17:27 [1.833] # sysctl -d kern.entropy.epoch kern.entropy.epoch: (no description) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpE52Jkajvwh.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Tue, 6 Apr 2021 12:08:54 +, Taylor R Campbell wrote: Subject: Re: regarding the changes to kernel entropy gathering > > The main issue that hits people is that the traditional mechanism by > which the OS reports a potential security problem with entropy is for > it to make applications silently hang -- and the issue is getting > worse now that getrandom() is more widely used, e.g. in Python when > you do `import multiprocessing'. I think adding a uprintf(9) that the user who started the blocked process (i.e. not just the admin) has a better chance of directly seeing would be one step closer, and should be extremely easy. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpOvi5MZvUCj.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Tue, 6 Apr 2021 20:21:43 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote: > > > > And the stock implementation has no possibility of ever providing an > > initial seed at all on its own (unlike previous implementations, and of > > course unlike what my patch _affords_). > > Isn't it as simple as: > > dd bs=32 if=/dev/urandom of=/dev/random No, that still leaves the question of _when_ to run it. (And, at least at the moment, where to put it. /etc/rc.local?) Isn't something the following better (assuming you choose your devices carefully): echo 'rndctl_flags="-t env;-t disk;-t tty"' >> /etc/rc.conf That's what my patches fix and allow, and this way you don't have to guess when you can safely use /dev/urandom as an entropy seed -- the seeding happens in real time, and only as entropy bits are made available from those given devices. That can also be done by sysinst, assuming a reasonably well worded question can be answered, and that it might only need to be asked if there are no "rng" type devices already. Doing this also requires no network access (ever). It can even be done, ahead of time, for use on immutable systems. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpW4B04umieR.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Wed, 7 Apr 2021 09:52:29 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > On Tue, Apr 06, 2021 at 03:12:45PM -0700, Greg A. Woods wrote: > > > Isn't it as simple as: > > > > > > dd bs=32 if=/dev/urandom of=/dev/random > > > > No, that still leaves the question of _when_ to run it. (And, at least > > at the moment, where to put it. /etc/rc.local?) > > Of course not! > > You run it once. Manually. And never again. Nope, sorry, that's not a good enough answer. It doesn't solve the problem of dealing with a lack of mutable storage. A system _MUST_ be able to be booted and with no user intervention be able to (eventually) get to the state where /dev/random and getrandom(2) WILL NOT block, and it _MUST_ be able to do so without the help of any hardware RNG, and without the ability to store (and read) a seed from a file or other storage device. I.e. we _MUST_ be _ABLE_ to choose to use other devices as sources for entropy, even if they are not perfect. We had this, it works fine, we still need it. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpeaL6Xd0CAO.pgp Description: OpenPGP Digital Signature
Re: regarding the changes to kernel entropy gathering
At Wed, 7 Apr 2021 22:47:39 +0200, Martin Husemann wrote: Subject: Re: regarding the changes to kernel entropy gathering > > When you create a custom setup like that, you will have to replace > etc/rc.d/entropy with a custom solution (e.g. mounting some flash storage). No storage means "NO storage.". > Or you ignore the issue and do the dd at each boot - hopefully not generating > any strong keys on that machine then (but you would have no good storage > for those anyway). Or I don't ignore the issue and instead I fix the code so that it's still possible to get entropy estimates from non-hardware-RNG devices and then things keep working the way they used to, and there's still some possibility of _real_ entropy being used to seed the PRNGs. From what I've seen here so far I'm far from alone in wanting that ability. What's most confusing is to why there's such animosity and stubborn unwillingness to even consider that the old way of getting some entropy from a few less-than-perfect sources was good enough for many, or even most, of us. It's better than no entropy when there are no "perfect" sources, and that's also a situation that includes many of us. It doesn't have to be the default. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpgg9AaQiU92.pgp Description: OpenPGP Digital Signature
I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
\0 \0 \0 \0 \0 \0 \0 \0 \0 * 0001760 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 U 252 0002000 # dd if=/dev/rvnd0d count=17 msgfmt=quiet| od -c 000 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 002 \0 \0 \0 \0 \0 \0 \0 \0 \b \0 \0 \0 020 \0 \0 \0 0020020 030 \0 \0 \0 230 005 \0 \0 \0 \0 \0 \0 377 377 377 377 0020040 367 360 p ` \0 \0 \0 007 200 037 \0 027 \0 \0 \0 0020060 \0 @ \0 \0 \0 \b \0 \0 \b \0 \0 \0 005 \0 \0 \0 0020100 \0 \0 \0 \0 < \0 \0 \0 \0 300 377 377 \0 370 377 377 0020120 016 \0 \0 \0 013 \0 \0 \0 004 \0 \0 \0 \0 020 \0 \0 0020140 003 \0 \0 \0 002 \0 \0 \0 \0 \b \0 \0 \0 \0 \0 \0 0020160 \0 \0 \0 \0 \0 020 \0 \0 200 \0 \0 \0 004 \0 \0 \0 0020200 \0 \0 \0 \0 300 220 005 \0 001 \0 \0 \0 \0 \0 \0 \0 0020220 367 360 p ` _ ` A q 230 005 \0 \0 \0 \b \0 \0 0020240 \0 @ \0 \0 \0 \0 \0 \0 300 220 005 \0 300 220 005 \0 0020260 027 \0 \0 \0 001 \0 \0 \0 \0 X \0 \0 0 d 001 \0 0020300 001 \0 \0 \0 377 357 003 \0 375 347 007 \0 016 \0 \0 \0 0020320 \0 001 \0 200 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 0020340 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 * 0021000 In fact the vnd0d device seems to give garbage forever -- it seems to have been completely confused by trying to access a real disk image! As a side note unfortunately even though access to this LVM-backed mini-memstick.img file now seems OK enough to get the install booted and a shell running, access to other FreeBSD xbd(4) devices is still not working from FreeBSD (i.e. a fresh newfs'ed FS appears corrupt to an immediate fsck, without mounting, and even fsck of the mounted root in this IMG fails enormously). # df Filesystem 512-blocks Used Avail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 2 0 100%/dev tmpfs 65536232 65304 0%/var tmpfs 40960 8 40952 0%/tmp # fsck /dev/ufs/FreeBSD_Install ** /dev/ufs/FreeBSD_Install SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n ** Last Mounted on ** Root file system ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=28 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=112 SALVAGE? [yn] ^Cda0: disk error cmd=write 8145-8152 status: fffe # * FILE SYSTEM MARKED DIRTY * # -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpELwDHrgUjQ.pgp Description: OpenPGP Digital Signature
a working patch to _allow_ non-hardware-RNG entropy sources
RND_FLAG_COLLECT_VALUE|RND_FLAG_HASCB); } Index: sys/rndio.h === RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/rndio.h,v retrieving revision 1.2 diff -u -r1.2 rndio.h --- sys/rndio.h 6 Sep 2015 06:01:02 - 1.2 +++ sys/rndio.h 9 Apr 2021 18:01:03 - @@ -91,8 +91,29 @@ #define RND_FLAG_ESTIMATE_TIME 0x4000 /* estimate entropy on time */ #define RND_FLAG_ESTIMATE_VALUE0x8000 /* estimate entropy on value */ #defineRND_FLAG_HASENABLE 0x0001 /* has enable/disable fns */ -#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\ -RND_FLAG_ESTIMATE_TIME) +#define RND_FLAG_DEFAULT (RND_FLAG_COLLECT_VALUE|RND_FLAG_ESTIMATE_VALUE| \ +RND_FLAG_COLLECT_TIME|RND_FLAG_ESTIMATE_TIME) +/* + * N.B.: It would appear from the above value that by default all devices using + * RND_FLAG_DEFAULT will be enabled directly to collect _and_ estimate(count) + * entropy based on both deltas in values they submit, and the time delta + * between submissions. HOWEVER this is moderated by a switch in + * kern_entropy.c:rnd_attach_source() which will add either the NO_COLLECT + * and/or the NO_ESTIMATE flag depending on what type the device is. + * + * By default only RND_TYPE_SKEW, RND_TYPE_ENV, RND_TYPE_POWER, and RND_TYPE_RNG + * will avoid both of these flags being set. + * + * Network devices will be entirely disabled (from both colleciton and + * estimating) as they can possibly be easily influenced externally. + * + * All other devices will be given the NO_ESTIMATE flag such that they are not + * used to estimate(count) entropy by default. + * + * In any case either or both of the RND_FLAG_NO_* flags can be turned off at + * runtime by the RNDCTL ioctl on rnd(4), i.e. by rndctl(8) such that entropy + * collection and estimation can be enabled on a per-device or per-type basis. + */ #defineRND_TYPE_UNKNOWN0 /* unknown source */ #defineRND_TYPE_DISK 1 /* source is physical disk */ Index: sys/rndsource.h === RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/rndsource.h,v retrieving revision 1.7 diff -u -r1.7 rndsource.h --- sys/rndsource.h 30 Apr 2020 03:28:19 - 1.7 +++ sys/rndsource.h 8 Apr 2021 18:15:01 - @@ -45,8 +45,6 @@ /* * struct rnd_delta_estimator - * - * Unused. Preserved for ABI compatibility. */ typedef struct rnd_delta_estimator { uint64_tx; @@ -68,8 +66,8 @@ struct krndsource { LIST_ENTRY(krndsource) list;/* the linked list */ charname[16]; /* device name */ - rnd_delta_t time_delta; /* unused */ - rnd_delta_t value_delta;/* unused */ + rnd_delta_t time_delta; /* */ + rnd_delta_t value_delta;/* */ uint32_ttotal; /* number of bits added while cold */ uint32_ttype; /* type, RND_TYPE_* */ uint32_tflags; /* flags, RND_FLAG_* */ @@ -89,8 +87,10 @@ uint32_t); void rnd_detach_source(struct krndsource *); +#if 0 void _rnd_add_uint32(struct krndsource *, uint32_t); /* legacy */ void _rnd_add_uint64(struct krndsource *, uint64_t); /* legacy */ +#endif void rnd_add_uint32(struct krndsource *, uint32_t); void rnd_add_data(struct krndsource *, const void *, uint32_t, uint32_t); Index: uvm/uvm_page.c =========== RCS file: /cvs/master/m-NetBSD/main/src/sys/uvm/uvm_page.c,v retrieving revision 1.250 diff -u -r1.250 uvm_page.c --- uvm/uvm_page.c 20 Dec 2020 11:11:34 - 1.250 +++ uvm/uvm_page.c 8 Apr 2021 21:41:20 - @@ -983,8 +983,7 @@ * Attach RNG source for this CPU's VM events */ rnd_attach_source(&ucpu->rs, ci->ci_data.cpu_name, RND_TYPE_VM, - RND_FLAG_COLLECT_TIME|RND_FLAG_COLLECT_VALUE| - RND_FLAG_ESTIMATE_VALUE); + RND_FLAG_DEFAULT); } /* -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpMtaoK9AMBK.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
At Sat, 10 Apr 2021 18:44:32 -0700, Brian Buhrow wrote: Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!) > > hello. This must be some kind of regression that's ben around a > while. I'm runing a xen dom0 with NetBSD-5.2 and xen-3.3.2, very old, > but vnd(4) does expose the entire file to the domu's including FreeBSD > 11 and 12 without any corruption or booting issues. Do you know when > this trouble began? I don't know -- I think I've only ever successfully used ISO files, and I think I gave up on some IMG file(s) previously (possibly not just from FreeBSD) without trying to understand why they didn't work. Have you tried specifically with a recent FreeBSD mini-memstick.img file? I'm thinking (esp. given what I see from "od -c < /dev/rvnd0d") that what's wrong is the vnd(4) driver is (also?) imposing some mis-interpreted idea about the number of cylinders and heads or something like that, especially given that "fdisk vnd0" is so totally confused about what's in there. There's a definite pattern of corruption anyway -- I just can't explain it well enough yet. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpbUW36DVCRL.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
On the other hand NetBSD's own .img files work OK. However interestingly there's a small, but apparently insignificant (because it works OK) difference between how fdisk sees the disk image and the vnd0 device: # fdisk -F images/NetBSD-9.99.81-amd64-live.img Disk: images/NetBSD-9.99.81-amd64-live.img NetBSD disklabel disk geometry: cylinders: 972, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) # vndconfig -cv vnd0 images/NetBSD-9.99.81-amd64-live.img /dev/rvnd0: 7999586304 bytes on images/NetBSD-9.99.81-amd64-live.img # fdisk vnd0 Disk: /dev/rvnd0 NetBSD disklabel disk geometry: cylinders: 7629, heads: 64, sectors/track: 32 (2048 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) 21:10 [1.1496] # disklabel vnd0 # /dev/rvnd0: type: ESDI disk: image label: flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 7629 total sectors: 15624192 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 8 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] a: 15622144 2048 4.2BSD 1024 819216 # (Cyl. 1 - 7628) c: 15622144 2048 unused 0 0# (Cyl. 1 - 7628) d: 15624192 0 unused 0 0# (Cyl. 0 - 7628) # disklabel images/NetBSD-9.99.81-amd64-live.img # images/NetBSD-9.99.81-amd64-live.img: type: ESDI disk: image label: flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 7629 total sectors: 15624192 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 8 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] a: 15622144 2048 4.2BSD 1024 819216 # (Cyl. 1 - 7628) c: 15622144 2048 unused 0 0# (Cyl. 1 - 7628) d: 15624192 0 unused 0 0# (Cyl. 0 - 7628) From inside the NetBSD live image: [ 1.4412586] xbd4 at xenbus0 id 4: Xen Virtual Block Device Interface [ 1.4422594] xbd4: using event channel 20 [ 1.7112647] entropy: xbd4 attached as an entropy source (collecting without estimation) [ 1.7112647] xbd4: 7629 MB, 512 bytes/sect x 15624192 sectors [ 1.7112647] xbd4: backend features 0x9 # df Filesystem 1K-blocks UsedAvail %Cap Mounted on /dev/xbd4a7562414 4699114 2485180 65% / ptyfs 110 100% /dev/pts # fdisk xbd4 Disk: /dev/rxbd4 NetBSD disklabel disk geometry: cylinders: 7629, heads: 1, sectors/track: 2048 (2048 sectors/cylinder) total sectors: 15624192, bytes/sector: 512 BIOS disk geometry: cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder) total sectors: 15624192 Partitions aligned to 2048 sector boundaries, offset 2048 Partition table: 0: NetBSD (sysid 169) start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active 1: 2: 3: Bootselector disabled. First active partition: 0 Drive serial number: 0 (0x) The NetBSD live.img root filesystem seems fine and clean: # fsck -n /dev/rxbd4a ** /dev/rxbd4a (NO WRITE) ** Last Mounted on / ** Root file system ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 32740 files, 2349557 used, 1431650 free (538 frags, 178889 blocks, 0.0% fragmentation) -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpqxRB084Uts.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
At Sun, 11 Apr 2021 16:06:27 - (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!) > > k...@munnari.oz.au (Robert Elz) writes: > > >Date:Sun, 11 Apr 2021 14:25:40 - (UTC) > >From:mlel...@serpens.de (Michael van Elst) > >Message-ID: > > > | + dg->dg_secperunit = vnd->sc_size / DEV_BSIZE; > > >While it shouldn't make any difference for any properly created image > >file, make it be > > > (vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE; > > >so that any trailing partial sector remains in the image. > > > The trailing partial sector is already ignored. Fortunately no disk image > can even have a partial trailing sector and some magically implicit > padding would have unexpected side effects. > > But the code also needs to be adjusted for different sector sizes. So since vnd->sc_size is in units of disk blocks dg->dg_secperunit = ((vnd->sc_size * DEV_BSIZE) + DEV_BSIZE - 1) / vnd->sc_geom.vng_secsize; right? -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpHppeDklPmd.pgp Description: OpenPGP Digital Signature
Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)
ls 0/50/2-386/18/17), Active 2: 3: First active partition: 1 Drive serial number: 2425393296 (0x90909090) So, as you can see below I think it's better to round the device out to a full number of cylinders if we're still going to play the CHS silliness. But for vnd(4) in particular I think it does beg the questions I ask in the new comments below, especially the first one. --- vnd.c.~1.278.~ 2021-03-07 17:18:43.0 -0800 +++ vnd.c 2021-04-11 11:00:52.147530152 -0700 @@ -1480,20 +1480,41 @@ } } else if (vnd->sc_size >= (32 * 64)) { /* -* Size must be at least 2048 DEV_BSIZE blocks -* (1M) in order to use this geometry. +* The file's size must be at least 2048 DEV_BSIZE +* blocks (1M) in order to use this (fake) geometry. +* +* XXX why ever use this arbitrary fake setup instead of the next */ vnd->sc_geom.vng_secsize = DEV_BSIZE; vnd->sc_geom.vng_nsectors = 32; vnd->sc_geom.vng_ntracks = 64; - vnd->sc_geom.vng_ncylinders = vnd->sc_size / (64 * 32); + vnd->sc_geom.vng_ncylinders = (vnd->sc_size + (64 * 32) - 1) / (64 * 32); } else { + /* +* XXX is there anything that pretends which is worse: +* rotational delay, or seeking? Does it matter for < 1M? +*/ +#if 1 + /* else pretend it's just one big platter of single-sector cylinders */ vnd->sc_geom.vng_secsize = DEV_BSIZE; vnd->sc_geom.vng_nsectors = 1; vnd->sc_geom.vng_ntracks = 1; vnd->sc_geom.vng_ncylinders = vnd->sc_size; +#else + /* else pretend it's just one big cylinder */ + vnd->sc_geom.vng_secsize = DEV_BSIZE; + vnd->sc_geom.vng_nsectors = vnd->sc_size; + vnd->sc_geom.vng_ntracks = 1; + vnd->sc_geom.vng_ncylinders = 1; +#endif } + /* +* n.b.: this will round the disk's size up to an even cylinder +* amount, but (if it is writeable) writing into the partly +* empty cylinder, i.e. past current end of the file, will +* simply extend the file +*/ vnd_set_geometry(vnd); if (vio->vnd_flags & VNDIOF_READONLY) { -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpqbNpL0b2M3.pgp Description: OpenPGP Digital Signature
one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
So, with the vnd(4) issue more or less sorted, there seems to be one major mystery remaining w.r.t. whatever has gone wrong with the ability of NetBSD-current XEN3_DOM0 to host FreeBSD domUs. I still can't create a clean filesystem on a writeable disk. The "newfs" runs fine, but a subsequent "fsck" finds errors and cannot fix them (though the first run does change one or two things). I can't even get a clean fsck of the running system's root FS: (the "ada0: disk error" after I hit ^C is because the underlying disk (vnd0d) is exported read-only to the domU) # fsck -v /dev/ufs/FreeBSD_Install start / wait fsck_ufs /dev/ufs/FreeBSD_Install ** /dev/ufs/FreeBSD_Install SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n ** Last Mounted on ** Root file system ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=28 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=112 SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe * FILE SYSTEM MARKED DIRTY * # Most mysteriously this filesystem is in use as the root FS and all the files in it can be found and read! Presumably they are all intact too -- no programs have failed or behaved mysteriously (except fsck) and all the human readable files I've looked at (e.g. manual pages) all seem fine. In fact it only seems to be fsck that complains, possibly along with any attempt to write to a filesystem, that causes problems. (I believe writing to a filesystem appears to corrupt it but that is only according to fsck. I do seem believe there was an eventual crashes of a system that had been running with active filesystems, but I have not got far enough again since to reproduce this, due to the fsck problem.) # mount /dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only) devfs on /dev (devfs, local, multilabel) tmpfs on /var (tmpfs, local) tmpfs on /tmp (tmpfs, local) # df Filesystem 512-blocks Used Avail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 2 0 100%/dev tmpfs 65536232 65304 0%/var tmpfs 40960 8 40952 0%/tmp # time -l sh -c 'find / -type f | xargs cat > /dev/null ' 38.58 real 1.36 user18.30 sys 4872 maximum resident set size 13 average shared memory size 5 average unshared data size 215 average unshared stack size 1906 page reclaims 0 page faults 0 swaps 14024 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 12348 voluntary context switches 33 involuntary context switches In fact I can put a copy of the FreeBSD img file into an LVM LV, attach it to the running FreeBSD domU, mount it (without an FSCK, since the FreeBSD_Install filesystem comes clean from the factory), then do "diff -r -X /mnt -X /dev / /mnt" and find only the expected differences. So, what could be different about how fsck reads v.s. the kernel itself? If indeed writing to filesystem corrupts it, how and why? It seems NetBSD can make sense of the BSD label inside the FreeBSD mini-memstick.img file, e.g. when accessed through vnd(4), but it can't seem to make sense of the filesystem(s) inside (which I guess might be expected?): # file -s /dev/rvnd0f /dev/rvnd0f: DOS/MBR boot sector, BSD disklabel # disklabel vnd0 # /dev/rvnd0: type: vnd disk: vnd label: fictitious flags: bytes/sector: 512 sectors/track: 32 tracks/cylinder: 64 sectors/cylinder: 2048 cylinders: 387 total sectors: 791121 rpm: 3600 interleave: 1 trackskew: 0 cylinderskew: 0 headswitch: 0 # microseconds track-to-track seek: 0 # microseconds drivedata: 0 6 partitions: #sizeoffset fstype [fsize bsize cpg/sgs] d:791121 0 unused 0 0# (Cyl. 0 -386*) e: 1600 1unknown # (Cyl. 0*- 0*) f:789520 1601 4.2BSD 0 0 0 # (Cyl. 0*-386*) disklabel: boot block size 0 disklabel: super block size 0 # fsck -n /dev/vnd0f ** /dev/rvnd0f (NO WRITE) BAD SUPER BLOCK: CAN'T FIND SUPERBLOCK /dev/rvnd0f: CANNOT FIGURE OUT SECTORS PER CYLINDER -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpBY33Act0N5.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Sun, 11 Apr 2021 13:23:31 -0700, "Greg A. Woods" wrote: Subject: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > In fact it only seems to be fsck that complains, possibly along > with any attempt to write to a filesystem, that causes problems. Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD xbd(4) with a new filesystem created on it, is impossible. I was able to write 500MB of zeros to the LVM LV backed disk, overwriting the copy of the .img file I had put there, and only see 500MB of zeros back on the NetBSD side, so writing directly to the raw /dev/da1 on FreeBSD seems to write data without problem. However then the following happens when I try to use a new FS there: # newfs /dev/da1 /dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096 using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 43596352, 44878592, 46160832, 47443072, 48725312, 50007552, 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 60265472, 61547712, 62829952 # mount /dev/da1 /mnt # mount /dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only) devfs on /dev (devfs, local, multilabel) tmpfs on /var (tmpfs, local) tmpfs on /tmp (tmpfs, local) /dev/da1 on /mnt (ufs, local) # df Filesystem 512-blocks UsedAvail Capacity Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102%/ devfs 2 20 100%/dev tmpfs 6553660864928 1%/var tmpfs 40960 840952 0%/tmp /dev/da1 60901560 16 56029424 0%/mnt # cp /COPYRIGHT /mnt UFS /dev/da1 (/mnt) cylinder checksum failed: cg 0, cgp: 0xe66de1a4 != bp: 0xf433acbc UFS /dev/da1 (/mnt) cylinder checksum failed: cg 1, cgp: 0x89ba8532 != bp: 0x3491fbd0 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 3, cgp: 0xdeaf87a7 != bp: 0x3a071e86 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 7, cgp: 0x7085828d != bp: 0xaaae0f19 UFS /dev/da1 (/mnt) cylinder checksum failed: cg 15, cgp: 0x293dfe28 != bp: 0xe2f25f8b UFS /dev/da1 (/mnt) cylinder checksum failed: cg 31, cgp: 0x9a4d0762 != bp: 0x4119c6e [[ and on and on ]] UFS /dev/da1 (/mnt) cylinder checksum failed: cg 49, cgp: 0x931f84e5 != bp: 0xb48687df /mnt: create/symlink failed, no inodes free cp: /mnt/COPYRIGHT: No space left on device # Apr 11 20:37:28 syslogd: last message repeated 4 times Apr 11 20:37:59 kernel: pid 713 (cp), uid 0 inumber 2 on /mnt: out of inodes # df -i Filesystem 512-blocks UsedAvail Capacity iused ifree %iused Mounted on /dev/ufs/FreeBSD_Install 782968 737016 -16680 102% 12129 285 98% / devfs 2 20 100% 0 0 100% /dev tmpfs 6553660864928 1% 75 114613 0% /var tmpfs 40960 840952 0% 6 71674 0% /tmp /dev/da1 60901560 16 56029424 0% 2 4012796 0% /mnt NetBSD can actually make some sense of this FreeBSD filesystem though: # fsck -n /dev/mapper/rscratch-fbsd--test.0 ** /dev/mapper/rscratch-fbsd--test.0 (NO WRITE) Invalid quota magic number CONTINUE? yes ** File system is already clean ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups SUMMARY INFORMATION BAD SALVAGE? no BLK(S) MISSING IN BIT MAPS SALVAGE? no ** Phase 6 - Check Quotas CLEAR SUPERBLOCK QUOTA FLAG? no 2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation) * UNRESOLVED INCONSISTENCIES REMAIN * I'm not sure if those problems are to be expected with a FreeBSD-created filesystem or not. Probably the "Invalid quota magic number" is normal, but I'm not sure about the "BLK(s) MISSING IN BIT MAPS". Have FreeBSD and NetBSD FFS diverged this much? I won't try to mount it, especially not from the dom0. Dumpfs shows the following: file system: /dev/mapper/rscratch-fbsd--test.0 format FFSv2 endian little-endian location 65536 (-b 128) magic 19540119timeSun Apr 11 13:46:15 2021 superblock location 65536 id [ 60735d32 358197c4 ] cylgrp dynamic inodes FFSv2 sblock FFSv2 fslevel 5 nbfree 951584 ndir2 nifree 4012796 nffree 21 ncg 50 size7864320 blocks 7612695 bsize 32768 sh
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
nt # fsck /dev/da1 ** /dev/da1 ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes PARTIALLY TRUNCATED INODE I=325128 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877864 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877866 SALVAGE? [yn] n PARTIALLY TRUNCATED INODE I=877879 SALVAGE? [yn] ^C * FILE SYSTEM MARKED DIRTY * Back on the NetBSD side: # xl block-detach fbsd-test 2064 # fsck /dev/mapper/rscratch-fbsd--test.0 ** /dev/mapper/rscratch-fbsd--test.0 ** Last Mounted on /mnt ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups FREE BLK COUNT(S) WRONG IN SUPERBLK SALVAGE? [yn] n SUMMARY INFORMATION BAD SALVAGE? [yn] n BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] n 12076 files, 91642 used, 7647797 free (293 frags, 955938 blocks, 0.0% fragmentation) * UNRESOLVED INCONSISTENCIES REMAIN ***** -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpUlrkicjNvs.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Sun, 11 Apr 2021 23:04:29 - (UTC), mlel...@serpens.de (Michael van Elst) wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > wo...@planix.ca ("Greg A. Woods") writes: > > >SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe > > That seems to be a message from the disk driver: Yes, exactly, that's from the FreeBSD kernel as fsck was trying to update the superblock and mark the filesystem as dirty (their fsck_ffs always opens the device for write, even with '-n'); and the error is of course because the backend has attached the disk as a read-only device. > The latter case should log a message on Dom0 about DIOCCACHESYNC > failing. I haven't seen anything like that yet. > But if you have sectors of DEV_BSIZE like here there is no difference > and no conflict. Yes as far as I've seen the FreeBSD domU reports a sector size of 512 bytes in every xbd(4) device and for every GEOM partition it creates or finds on those devices. FreeBSD newfs seems to concur that sectors are 512 bytes even when writing to a raw (i.e. un-labeled) /dev/da1 (which has a 30GB LVM LV backing it): # newfs /dev/da1 /dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096 using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. $ echo 62914560 \* 512 / 1024 / 1024 | bc -l 30720. The NetBSD dom0 reported the attachment of this device with a matching number of (512-byte) sectors: xbd backend: attach device scratch-fbsd--t (size 62914560) for domain 2 > The FreeBSD-12.2-RELEASE-amd64-mini-memstick.img I just fetched > has two MBR partitions: > > Partition table: > 0: EFI system partition (sysid 239) > start 1, size 1600 (1 MB, Cyls 0/0/2-0/50/1) > 1: FreeBSD or 386BSD or old NetBSD (sysid 165) > start 1601, size 789520 (386 MB, Cyls 0/50/2-386/18/17), Active > > Making our disklabel program read the FreeBSD disklabel was a bit > tricky, there is a bug that makes it segfault, but: > > type: unknown > disk: > label: > flags: > bytes/sector: 512 > sectors/track: 1 > tracks/cylinder: 1 > sectors/cylinder: 1 > cylinders: 789520 > total sectors: 789520 > rpm: 3600 > interleave: 0 > trackskew: 0 > cylinderskew: 0 > headswitch: 0 # microseconds > track-to-track seek: 0 # microseconds > drivedata: 0 > > 8 partitions: > #sizeoffset fstype [fsize bsize cpg/sgs] > a:78950416 4.2BSD 0 0 0 # (Cyl. 16 - > 789519) > c:789520 0 unused 0 0# (Cyl. 0 - > 789519) > > > Apparently the MBR partition 1 starting at sector 1601 is a disk > image itself and the disklabel is in sector 1 of that image. Well I think in FreeBSD parlance it just is an MBR partition that has a BSD label confined within its limits, and that BSD label further divides its MBR partition into more disk partitions. That's just the FreeBSD way -- if I understand correctly their BSD labels are restricted to the confines of the MBR partition where they sit. And yes, FreeBSD's disklabel output matches: # disklabel da0s2 # /dev/da0s2: 8 partitions: # size offsetfstype [fsize bsize bps/cpg] a: 789504 164.2BSD0 0 0 c: 789520 0unused0 0 # "raw" part, don't edit So in FreeBSD the filesystem there is at "/dev/da0s2a" -- where "da0" is the "device", "s2" is the second MBR partition, and "a" is of course the BSD label's "a" partition. They use more or less the same naming for GPT entries as well. > Adding a wedge to access the partition at offset 16 (+1601) gives: > > # dkctl vnd0 addwedge freebsd 1617 789504 ffs > dk6 created successfully. I had not thought to try that yet. It's good to see it works! Now that I can get vnd0d to export the .img file to FreeBSD I think I've effectively eliminated worries about vnd(4) causing the bigger problems. Speaking of which, I think this might be evidence that the FreeBSD system was suffering the effects of accessing the corrupted filesystem I was experimenting with. Note the SIGSEGV's from processes apparently after the kernel has gone into its halt-spin loop (this is the first time I've seen this particular misbehaviour): # halt -pq Waiting (max 60 seconds) for system process `vnlru' to stop... done Waiting (max 60 seconds) for system process `syncer' to stop... Syncing disks, vnodes remaining... 0 0 done Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done Waiting (max 60 seconds) for sys
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Sun, 11 Apr 2021 13:55:36 -0700, "Greg A. Woods" wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD > xbd(4) with a new filesystem created on it, is impossible. So, having run out of "easy" ideas, and working under the assumption that this must be a problem in NetBSD-current dom0 (i.e. not likely in Xen or Xen tools) I've been scanning through changes and this one, so far, is one that would seem to me to have at least some tiny possibility of being the root cause. RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/xen/xen/xbdback_xenbus.c,v revision 1.86 date: 2020-04-21 06:56:18 -0700; author: jdolecek; state: Exp; lines: +175 -47; commitid: 26JkIx2V3sGnZf5C; add support for indirect segments, which makes it possible to pass up to MAXPHYS (implementation limit, interface allows more) using single request request using indirect segment requires 1 extra copy hypercall per request, but saves 2 shared memory hypercalls (map_grant/unmap_grant), so should be net performance boost due to less TLB flushing this also effectively doubles disk queue size for xbd(4) I don't see anything obviously glaringly wrong, and of course this is working A-OK on my same machines with NetBSD-5 and a NetBSD-current (and originally somewhat older NetBSD-8.99) domUs. However I'm really not very familiar with this code and the specs for what it should be doing so I'm unlikely to be able to spot anything that's missing. I did read the following, which mostly reminded me to look in xenstore's db to see what feature-max-indirect-segments is set to by default: https://xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/ Here's what is stored for a file-backed device: backend = "" vbd = "" 3 = "" 768 = "" frontend = "/local/domain/3/device/vbd/768" params = "/build/images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img" script = "/etc/xen/scripts/block" frontend-id = "3" online = "1" removable = "0" bootable = "1" state = "4" dev = "hda" type = "phy" mode = "r" device-type = "disk" discard-enable = "0" vnd = "/dev/vnd0d" physical-device = "3587" hotplug-status = "connected" sectors = "792576" info = "4" sector-size = "512" feature-flush-cache = "1" feature-max-indirect-segments = "17" Here's what's stored for an LVM-LV backed vbd: 162 = "" 2048 = "" frontend = "/local/domain/162/device/vbd/2048" params = "/dev/mapper/vg1-fbsd--test.0" script = "/etc/xen/scripts/block" frontend-id = "162" online = "1" removable = "0" bootable = "1" state = "4" dev = "sda" type = "phy" mode = "r" device-type = "disk" discard-enable = "0" physical-device = "43285" hotplug-status = "connected" sectors = "83886080" info = "4" sector-size = "512" feature-flush-cache = "1" feature-max-indirect-segments = "17" So "17" seems an odd number, but it is apparently because of "Need to alloc one extra page to account for possible mapping offset". It is currently the maximum for indirect-segments, and it's hard-coded. (Linux apparently has a max of 256, and the linux blkfront defaults to only using 32.) Maybe it should be "16", so matching max_request_size? I did take a quick gander at the related code in FreeBSD (both the domU code that's talking to this code in NetBSD, and the dom0 code that would be used if dom0 was running FreeBSD), and besides seeing that it is quite different, I also don't see anything obviously wrong or incompatible there either. (I do note that the FreeBSD equivalent to xbdback(4) has a major advantage of being able to directly access files, i.e. without the need for vnd(4). Not quite as exciting as maybe full 9pfs mounts through to domUs would be, but still pretty neat!) FreeBSD's equivalent to xbdback(4) (i.e. sys/dev/xen/blkback/blkack.c) doesn't seem to mention "feature-max-indirect-segments", so apparently they don't offer it yet, though it does mention "feature-flush-cache". However their front-end code does detect it and seems to make use of it, and has done for some 6 years now according to "git blame" (with no recent fixes beyond fixing a memory leak on their end). Here we see it live from FreeBSD's sysctl output, thus my concern that this feature may be the source of the problem: hw.xbd.xbd_enable_indirect: 1 dev.xbd.0.max_request_size: 65536 dev.xbd.0.max_request_segments: 17 dev.xbd.0.max_requests: 32 -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpjqWl9lIxPf.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Tue, 13 Apr 2021 18:20:39 -0700, "Greg A. Woods" wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > So "17" seems an odd number, but it is apparently because of "Need to > alloc one extra page to account for possible mapping offset". Nope, changing that to 16 didn't make any difference. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpKKEyzDjq3_.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Wed, 14 Apr 2021 19:53:47 +0200, Jaromír Doleček wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > You can test if this is the problem by disabling the feature in > negotiation in NetBSD xbdback.c - comment out the code which sets > feature-max-indirect-segments in xbdback_backend_changed(). With the > feature disabled, FreeBSD DomU should not use indirect segments. Ah, yes, thanks! I should have thought of that. That's especially useful since on the client side it's a read-only flag: # sysctl -w hw.xbd.xbd_enable_indirect=0 sysctl: oid 'hw.xbd.xbd_enable_indirect' is a read only tunable sysctl: Tunable values are set in /boot/loader.conf Apparently in the Linux implementation the number of indirect segments used by a domU can be tuned at boot time, and that appears to be done by setting a driver option on the guest kernel command line. When I first read that it didn't make so much sense to me to be giving this kind of control to the domU. Perhaps it would be better to make this a tuneable in xl.cfg(5) such that it can be tuned on a per-guest basis. Then setting it to zero for a given guest would not advertise the feature at all. I've some other things to do before I can reboot -- I'll report as soon as that's done.... -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgp9ArzMYs191.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
48 dev.xbd.0.features: flush dev.xbd.0.ring_pages: 1 dev.xbd.0.max_request_size: 65536 dev.xbd.0.max_request_segments: 17 dev.xbd.0.max_requests: 32 dev.xbd.0.%parent: xenbusb_front0 dev.xbd.0.%pnpinfo: dev.xbd.0.%location: dev.xbd.0.%driver: xbd dev.xbd.0.%desc: Virtual Block Device For reference the bug behaviour remains the same (at least for this simplest quick and easy test): # newfs /dev/da0 /dev/da0: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096 using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 43596352, 44878592, 46160832, 47443072, 48725312, 50007552, 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 60265472, 61547712, 62829952 # fsck /dev/da0 ** /dev/da0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups CG 0: BAD CHECK-HASH 0x49168424 vs 0xe610ac1b SUMMARY INFORMATION BAD SALVAGE? [yn] n BLK(S) MISSING IN BIT MAPS SALVAGE? [yn] n CG 1: BAD CHECK-HASH 0xfa76fceb vs 0xb9e90a55 CG 2: BAD CHECK-HASH 0x41f444c vs 0x5efb290e CG 3: BAD CHECK-HASH 0xad63fe7e vs 0x7ab3861f CG 4: BAD CHECK-HASH 0xfd2043f3 vs 0xadb781f4 CG 5: BAD CHECK-HASH 0x545cf9c1 vs 0xcec5661e CG 6: BAD CHECK-HASH 0xaa354166 vs 0x7dd269d3 CG 7: BAD CHECK-HASH 0x349fb54 vs 0x3078e065 CG 8: BAD CHECK-HASH 0xab23a7c vs 0xc8aa7e98 CG 9: BAD CHECK-HASH 0xa3ce804e vs 0x205a6b0d CG 10: BAD CHECK-HASH 0x5da738e9 vs 0x604d5ecf CG 11: BAD CHECK-HASH 0xf4db82db vs 0xfef11ffc CG 12: BAD CHECK-HASH 0xa4983f56 vs 0xc7e701c8 CG 13: BAD CHECK-HASH 0xde48564 vs 0x42072fba CG 14: BAD CHECK-HASH 0xf38d3dc3 vs 0xad98cf7b CG 15: BAD CHECK-HASH 0x5af187f1 vs 0xbacadeb1 CG 16: BAD CHECK-HASH 0xe07abf93 vs 0xe4ca225 CG 17: BAD CHECK-HASH 0x490605a1 vs 0xe2917802 CG 18: BAD CHECK-HASH 0xb76fbd06 vs 0xa895abc CG 19: BAD CHECK-HASH 0x1e130734 vs 0x6a8bc135 CG 20: BAD CHECK-HASH 0x4e50bab9 vs 0x44719a4a CG 21: BAD CHECK-HASH 0xe72c008b vs 0xadb0c6e9 CG 22: BAD CHECK-HASH 0x1945b82c vs 0x3aeca102 CG 23: BAD CHECK-HASH 0xb039021e vs 0xb99f957d CG 24: BAD CHECK-HASH 0xb9c2c336 vs 0xd384be85 CG 25: BAD CHECK-HASH 0x10be7904 vs 0x649e2abf CG 26: BAD CHECK-HASH 0xeed7c1a3 vs 0x95f7 CG 27: BAD CHECK-HASH 0x47ab7b91 vs 0x3fb02d8b CG 28: BAD CHECK-HASH 0x17e8c61c vs 0xa2b4ca67 CG 29: BAD CHECK-HASH 0xbe947c2e vs 0x65972e04 CG 30: BAD CHECK-HASH 0x40fdc489 vs 0x4219223f CG 31: BAD CHECK-HASH 0xe9817ebb vs 0x36eb9a37 CG 32: BAD CHECK-HASH 0x3007c2bc vs 0xd1916e1d CG 33: BAD CHECK-HASH 0x997b788e vs 0x5204f64d CG 34: BAD CHECK-HASH 0x6712c029 vs 0xe291bcf0 CG 35: BAD CHECK-HASH 0xce6e7a1b vs 0x136ff032 CG 36: BAD CHECK-HASH 0x9e2dc796 vs 0x78ea85c8 CG 37: BAD CHECK-HASH 0x37517da4 vs 0x40c2cf31 CG 38: BAD CHECK-HASH 0xc938c503 vs 0x9b844ab6 CG 39: BAD CHECK-HASH 0x60447f31 vs 0x23129481 CG 40: BAD CHECK-HASH 0x69bfbe19 vs 0xa81f5e9 CG 41: BAD CHECK-HASH 0xc0c3042b vs 0xbd37ebd1 CG 42: BAD CHECK-HASH 0x3eaabc8c vs 0xfadfd8d1 CG 43: BAD CHECK-HASH 0x97d606be vs 0xf41513bc CG 44: BAD CHECK-HASH 0xc795bb33 vs 0xad4e6069 CG 45: BAD CHECK-HASH 0x6ee90101 vs 0xbeab94a9 CG 46: BAD CHECK-HASH 0x9080b9a6 vs 0x2688acd1 CG 47: BAD CHECK-HASH 0x39fc0394 vs 0xb5a37e85 CG 48: BAD CHECK-HASH 0x83773bf6 vs 0xd779cc90 CG 49: BAD CHECK-HASH 0xe0d3fd3c vs 0xb8083ca 2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation) * FILE SYSTEM MARKED DIRTY * * PLEASE RERUN FSCK * -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpwH2P4OJhnc.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
So I wrote a little awk script so that I could write 512-byte blocks with varying values of bytes. (Awk is the only decent programming language on the FreeBSD mini-memstick.img which I could think of that would do something close to what I wanted it to do. I could have combined awk+sh+dd and done things faster, but I had all day to let it run while I worked on some small engine repairs.) https://github.com/robohack/experiments/blob/master/tblocks.awk and then I used it to write 30GB to two different LVM LVs, each of identical size, and each exported to the domU, one written on the dom0 and the other written on the domU. Then I ran a cmp of both drives on each the dom0 and domU. On the dom0 side were no differences. All 30GB of what was written directly in the dom0 to one of the LVs was identical to what was written in the FreeBSD domU to the other LV. I.e. the FreeBSD domU side seems to be writing reliably through to the disk. The FreeBSD domU though is _really_ slow at reading with cmp (perhaps not unexpectedly given that it is using stdio to do the read and only managing 4KB requests, at a rate of just under 500 requests per second on each disk). I'm going to send this and go to bed before it finishes, but I'm guessing it's about 2/3's of the way through (it has run for nearly 11,000 seconds), and thus so far there are no differences from the FreeBSD domU's point of view either. Anyway, what the heck is FreeBSD newfs and/or fsck doing different!?!?!?? They're both writing and reading the very same raw device(s) that I wrote and read to/from with awk and cmp. These awk/cmp tests did very sequential operations, and the data are quite uniform and regular; whereas newfs/fsck write/read a much more complex data structure using operations scattered about in the disk. These tests are also writing then reading enough data to flush through the buffer caches in each dom0 and domU several times over. The dom0 has only 4GB and the domU has 8GB, but Xen says it's only using under 2GB. What else is different? What am I missing? What could be different in NetBSD current that could cause a FreeBSD domU to (mis)behave this way? Could the fault still be in the FreeBSD drivers -- I don't see how as the same root problem caused corruption in both HVM and PVH domUs. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpnTBJp7jfyq.pgp Description: OpenPGP Digital Signature
Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0
At Fri, 16 Apr 2021 11:44:08 +0100, David Brownlee wrote: Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0 > > On Fri, 16 Apr 2021 at 08:41, Greg A. Woods wrote: > > > What else is different? What am I missing? What could be different in > > NetBSD current that could cause a FreeBSD domU to (mis)behave this way? > > Could the fault still be in the FreeBSD drivers -- I don't see how as > > the same root problem caused corruption in both HVM and PVH domUs. > > Random data collection thoughts: > > - Can you reproduce it on tiny partitions (to speed up testing) > - If you newfs, shutdown the DOMU, then copy off the data from the > DOM0 does it pass FreeBSD fsck on a native boot > - Alternatively if you newfs an image on a native FreeBSD box and copy > to the DOM0 does the DOMU fsck fail > - Potentially based on results above - does it still happen with a > reboot between the newfs and fsck > - Can you ktrace whichever of newfs or fsck to see exactly what its > writing (tiny *tiny* filesystem for the win here :) So, the root filesystem is clean (from the factory, and verified by at least NetBSD's fsck as OK), but when '-f' is used it is found to be corrupt. Unfortunately I don't have any real FreeBSD machines available (though I could possibly get it installed on my MacBookPro again, but that's probably a multi-day effort at this point). However I've just found a way to reproduce the problem reliably and with a working comparison with a matching-sized memory disk. First off attach a tiny 4mb LVM LV to FreeBSD -- that's the smallest LV possible apparently: dom0 # lvm lvs LV VG Attr LSize Origin Snap% Move Log Copy% Convert build scratch -wi-a- 250.00g fbsd-test.0 scratch -wi-a- 30.00g fbsd-test.1 scratch -wi-a- 30.00g nbtest.pkg vg0 -wi-a- 30.00g nbtest.root vg0 -wi-a- 30.00g nbtest.swap vg0 -wi-a- 8.00g nbtest.var vg0 -wi-a- 10.00g tinytestvg0 -wi-a- 4.00m dom0 # xl block-attach fbsd-test format=raw, vdev=sdc, access=rw, target=/dev/mapper/vg0-tinytest Now a run of the test on the FreeBSD domU (first showing the kernel seeing the device attachment): # xbd3: 4MB at device/vbd/2080 on xenbusb_front0 xbd3: attaching as da2 xbd3: features: flush xbd3: synchronize cache commands enabled. GEOM: new disk da2 # dd if=/dev/zero of=tinytest.fs count=8192 8192+0 records in 8192+0 records out 4194304 bytes transferred in 0.081106 secs (51713998 bytes/sec) # mdconfig -a -t vnode -f tinytest.fs md0 # newfs -o space -n md0 /dev/md0: 4.0MB (8192 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 2304, 4416, 6528 # newfs -o space -n da2 /dev/da2: 4.0MB (8192 sectors) block size 32768, fragment size 4096 using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 2304, 4416, 6528 # dumpfs da2 >da2.dumpfs # dumpfs md0 >md0.dumpfs # diff md0.dumpfs da2.dumpfs 1,2c1,2 < magic 19540119 (UFS2) timeFri Apr 16 18:48:55 2021 < superblock location 65536 id [ 6079dc17 1006b3b4 ] --- > magic 19540119 (UFS2) timeFri Apr 16 18:49:57 2021 > superblock location 65536 id [ 6079dc55 348e5947 ] 27c27 < magic 90255 tell2 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell2 timeFri Apr 16 18:49:57 2021 40c40 < magic 90255 tell128000 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell128000 timeFri Apr 16 18:49:57 2021 53c53 < magic 90255 tell23 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell23 timeFri Apr 16 18:49:57 2021 66c66 < magic 90255 tell338000 timeFri Apr 16 18:48:55 2021 --- > magic 90255 tell338000 timeFri Apr 16 18:49:57 2021 # fsck md0 ** /dev/md0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 1 files, 1 used, 870 free (14 frags, 107 blocks, 1.6% fragmentation) * FILE SYSTEM IS CLEAN * # fsck da2 ** /dev/da2 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ROOT INODE UNALLOCATED ALLOCATE? [yn] n * FILE SYSTEM MARKED DIRTY * So I ktraced the fsck_ufs run, and though I haven't looked at it with a fine-tooth comb and the source open, the only thing that seems a wee bit different about what fsck does is that it opens the device twice, with O_RDONLY, then shortly before it prints the first "** /dev/da2" line it reopens it O_RDRW a third time, closes the second one, and then closes the second one and calls dup() on the third one so that it has the same FD# as the
Xen FreeBSD domU block I/O problem begins somewhere between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10)
So I was just reminded that I do still have a Xen server that's still running the 8.99.32 kernel and Xen-4.11. I had not been testing on it because it still of course has the vnd(4) CHS size bug (and because it's also hosting my $HOME and /usr/src and I don't want to crash it), and I had not remembered until just now that I can work around that by simply padding out the mini-memstick.img file! And, so It works, A-OK, with all other things remaining the same: # ls -l /dev/xbd0 crw-r- 1 root operator 0x3a Apr 17 04:31 /dev/xbd0 # newfs /dev/xbd0 /dev/xbd0: 20480.0MB (41943040 sectors) block size 32768, fragment size 4096 using 33 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. super-block backups (for fsck_ffs -b #) at: 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472, 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152, 38467392, 39749632, 41031872 # fsck /dev/xbd0 ** /dev/xbd0 ** Last Mounted on ** Phase 1 - Check Blocks and Sizes ** Phase 2 - Check Pathnames ** Phase 3 - Check Connectivity ** Phase 4 - Check Reference Counts ** Phase 5 - Check Cyl groups 2 files, 2 used, 5076797 free (21 frags, 634597 blocks, 0.0% fragmentation) * FILE SYSTEM IS CLEAN * # So the problem is almost certainly in NetBSD-current itself, and somewhere in the vast gulf between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10). Unfortunately I don't have enough hardware that's Xen-capable and up and running well enough to allow me to do any brute-force bisecting. -- Greg A. Woods Kelowna, BC +1 250 762-7675 RoboHack Planix, Inc. Avoncote Farms pgpTsPzyBFUd7.pgp Description: OpenPGP Digital Signature