could fstat(1) show files in use by vnd(4)?

2024-08-11 Thread Greg A. Woods
I had a XEN3_DOM0 kernel crash after doing "umount -f /build" (which I
did because "fstat /build" didn't find anything using that filesystem).

However there were several vnd(4) devices open by Xen domUs using files
on /build that I had completely forgotten about!

[ 178445.6211804] fatal integer divide fault in supervisor mode
[ 178445.6211804] trap type 8 code 0 rip 0x80936fb7 cs 0xe030 
rflags 0x10246 cr2 0x9f82f9e21000 ilevel 0 rsp 0x9f8305c9ee00
[ 178445.6211804] curlwp 0x9f8020cd81c0 pid 0.1268 lowest kstack 
0x9f8305c9a2c0
kernel: integer divide fault trap, code=0
Stopped in pid 0.1268 (system) at   netbsd:vndthread+0x677: idivq   
ff78(%rbp),%rax
vndthread() at netbsd:vndthread+0x677
ds  edf0
es  0
fs  81c0
gs  800
rdi 0
rsi 6
rbp 9f8305c9eef0
rbx e0b0dc0
rdx 0
rcx 135
rax 0
r8  0
r9  97a7c
r10 9f806119f040
r11 fffe
r12 0
r13 800
r14 9f8021e3c800
r15 0
rip 80936fb7vndthread+0x677
cs  e030
rflags  10246
rsp 9f8305c9ee00
ss  e02b
netbsd:vndthread+0x677: idivq   ff78(%rbp),%rax
db{4}>

After a reboot we can see the vnd(4) uses:

# vndconfig -l
vnd0: /build (/dev/mapper/vg0-build) inode 861956
vnd1: /build (/dev/mapper/vg0-build) inode 861966
vnd2: /build (/dev/mapper/vg0-build) inode 861953
vnd3: not in use

So, might it be possible to have fstat show these somehow?  (perhaps
with the/a kernel thread identified as having them open)

Also, is this a crash that should be fixed, or is "umount -f" always a
Buyer-Beware operation with expected "undefined behaviour"?

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpciMYzV_vbE.pgp
Description: OpenPGP Digital Signature


kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-29 Thread Greg A. Woods
ng "portability" in mind in the design and implementation of
Netgraph too.

I encourage anyone who's read this far, but who doesn't yet know so much
about Netgraph, to have a look at Archie Cobbs' DaemonNews article and
Julian's slides describing what's been worked on in Netgraph more
recently:

http://people.freebsd.org/~julian/netgraph.html
http://people.freebsd.org/~julian/BAFUG/talks/Netgraph/Netgraph.pdf

(BTW, Kohler's "Click Modular Router" is another interesting project!)

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpfCHbBPXKwo.pgp
Description: PGP signature


Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-29 Thread Greg A. Woods
At Fri, 29 Jan 2010 14:43:38 -0600, David Young  wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
> 
> On Fri, Jan 29, 2010 at 02:56:31PM -0500, Greg A. Woods wrote:
> > I need advanced kernel-level multilink PPP (MLPPP) support, including
> > the ability to create bundle links via UDP (and maybe TCP) over IP.
> 
> Why do you need "kernel-level multilink PPP" support?  Do you need to
> interoperate with existing multilink PPP systems?

Partly, but the biggest concern is performance.

I.e.:

1. We absolutely do need to use MLPPP.  We do control both ends of the
connection, and we may someday look at other protocols, but our current
production head-end concentrators are using MLPPP.

2. We also need to do it over multiple connections that are up to many
tens of megabits/sec each, perhaps sometimes even 100mbps each.  Home
cable connections are now 10-50mbps down or more in many places, and
truly high-speed ADSL2 is also growing in availability.  We aggregate
such connections for both speed and reliability reasons.

Our current low-end FreeBSD-based CPE device, which has a board with a
500 MHz AMD Geode LX800 on it, when connected to a 50mbps+2mbps cable
connection that has been split into two tunnels, can achieve 8-mbps max
(download) with userland MLPPP, period; but as much as 34mbps with MPD
using Netgraph MLPPP via UDP, and that was just a quick&dirty test
without tuning anything or using truly independent connections.

As I'm sure you know it's just not feasible to move data fast enough in
and out of userland to split and reassemble packets them on commodity
CPE devices.  We also need to do ipsec (with hardware crypto), ipfilter,
ethernet bridging and vlans, etc., all on the same little processors.

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpezIhV4d9wy.pgp
Description: PGP signature


Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-30 Thread Greg A. Woods
At Sat, 30 Jan 2010 11:37:41 +0900, Masao Uebayashi  
wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
> 
> What you need is something like npppd/pipex which OpenBSD has just imported?


Not as it is, as far as I can tell.  (I don't see any new documentation
imported for it -- just a couple of kernel files and the usr.sbin/npppd
stuff, also without manual pages it seems, sigh.)

Does it actually do MLPPP?  I only find mention of Multilink PPP (which
they abbreviate "MP" for some silly reason) in usr.sbin/npppd/npppd/ppp.h.

usr.sbin/npppd seems to be server-only.  I need client code first, then
eventually server support.

The kernel code (if indeed it has any client code -- not sure yet)
doesn't seem to allow forwarding through UDP or TCP.  It does mention
PPTP, and PPPoE in places but those don't really help me directly.

The document I eventually found here:

http://www.seil.jp/download/eng/doc/npppd_pipex.pdf

confirms that this seems to be server/concentrator only.  (that link
sure would have helped me figure this out faster!)

The more I think about it, the more I highly desire the simple way
Netgraph modules can be composed into any graph that meets one's current
requirements, and it's all done without recompiling anything.

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpQ5Byu9pjMk.pgp
Description: PGP signature


Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-30 Thread Greg A. Woods
At Sat, 30 Jan 2010 15:11:03 -0600, David Young  wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
> 
> On Sat, Jan 30, 2010 at 03:59:29PM -0500, Greg A. Woods wrote:
> > The kernel code (if indeed it has any client code -- not sure yet)
> > doesn't seem to allow forwarding through UDP or TCP.  It does mention
> > PPTP, and PPPoE in places but those don't really help me directly.
> 
> You can operate gre(4) over UDP without involving userland, does that
> help any?

Well, if/when whatever does client-side MLPPP can be configured to use
GRE tunnels as members of a bundle, and assuming I can convince MPD on
the server side to stick a ng_gre node in before the ng_ppp node on each
incoming bundle, then yes, it would help.

Ideally though I just want to encapsulate the PPP frames in UDP to be
directly compatible with MPD on the server side.

> Is MLPPD necessary/desirable for some reason?

I'm not sure what MLPPD is -- Did you mean MLPPP?  If so, then yes,
MLPPP is, currently, a core feature of the project I'm working on.

(MLPP is something else entirely I think -- the closest thing to network
protocols I can find is MLPP-over-IP.)

-- 
    Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgppY6nawJxxO.pgp
Description: PGP signature


Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-31 Thread Greg A. Woods
At Sun, 31 Jan 2010 10:35:44 +0900 (JST), Yasuoka Masahiko  
wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph
> 
> npppd and pipex don't support multilink PPP.   "MP" in ppp.h have been
> drived from RFC 3145.

Thank you for confirming!

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgplX9ydxnmCZ.pgp
Description: PGP signature


Re: kernel level multilink PPP and maybe (re)porting FreeBSD netgraph

2010-01-31 Thread Greg A. Woods
At Sat, 30 Jan 2010 19:35:47 -0500, Thor Lancelot Simon  wrote:
Subject: Re: kernel level multilink PPP and maybe (re)porting FreeBSD   netgraph
> 
> As far as I know, the standard *is* "MP".  MLPPP -- in my years-ago
> experience anyway -- was Livingston's proprietary predecessor of the
> standard protocol; they don't interoperate.

Well long ago there was RFC 1717, which was written by authors from
Newbridge, UCB, and Lloyd Internetworking, and indeed the title of that
RFC appears to abbreviate "PPP Multilink Protocol" to "MP" (though
perhaps it should be called "PPP-MP").  There was also a protocol from
Ascend called Multichannel Protocol Plus (MP+) and I don't know if/how
it was related to PPP-MP.  Livingston did support RFC 1717 and they also
called it "MP", or sometimes "multi-line load balancing".  If I remember
correctly Lucent bought Livingston, then Ascend.

Initially I need to inter-operate with a concentrator running MPD on
FreeBSD using Netgraph, thus ng_ppp(4), which implements RFC 1990 PPP
Multilink Protocol, probably using UDP encapsulation. (RFC1990 obsoletes
RFC1717)

Porting Netgraph still seems to be the most optimal solution all round,
though perhaps not with the fastest result, unless I can get help on the
FreeBSD side at making the code more portable.

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpZiiprzulJ0.pgp
Description: PGP signature


Re: blocksizes

2010-02-01 Thread Greg A. Woods
At Mon, 1 Feb 2010 15:34:39 -0500 (EST), der Mouse  
wrote:
Subject: Re: blocksizes
> 
> > This can easily happen if you copy the image between disks with
> > different block sizes.
> 
> Now _there_ is a valid argument for doing everything in terms of bytes
> (as was discussed briefly upthread).

Indeed.  Or at least using only _one_ logical block size that's
consistent for the system across all hardware that can be used by the
system.

Otherwise one must have a working equivalent NetBSD system that can make
use of both kinds of disks in order to copy an image from one kind of
disk to another.  Instead I think it would be best to be able to use any
kind of host system to make an image copy of a NetBSD disk even across
disks with different sector sizes, i.e. without having to use a system
which can understand both the on-disk filesystem and how it deals with
different hardware sector sizes.

In the pure sense of trying to do what's most optimal for a given system
on a given type of hardware, I think I can understand the desire to use
the hardware sector size, or multiples thereof, in the disk driver and
to map logical sectors to match.  However for a portable system I think
the on-disk filesystem representation should try to use a single logical
sector size across all hardware.

I hesitate to say even this much, never mind any more, because I still
feel like I'm sitting firmly and safely on the fence.  :-)

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpqrFxhJMuKN.pgp
Description: PGP signature


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-07 Thread Greg A. Woods
At Sun, 7 Mar 2010 20:50:03 +, Quentin Garnier  wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> 
> On Sun, Mar 07, 2010 at 06:43:49PM +0900, Masao Uebayashi wrote:
> [...]
> You're barking up the wrong tree.  What's annoying is not that the
> numbering changes.  It is that the numbering is relevant to the use of
> the device.  I expect dk(4) devices to be given names (be it real names
> or GUIDs), and I expect to be able to use that whenever I currently have
> to use a string of the form "dkN".

Indeed.  This needs carving in stone somewhere, since folks seem to
forget it.  I think even I have been known to forget it sometimes.  ;-)

> Wrong.  Device numbers should be irrelevant to anything but operations
> on device_t objects.

Indeed.

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpI8t1YQjNaV.pgp
Description: PGP signature


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-10 Thread Greg A. Woods
At Wed, 10 Mar 2010 08:56:36 + (GMT), Iain Hibbert  
wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> 
> So, you want to be able to mount a disk by the label:
> 
>   $ mount -t msdosfs -o label "foobar" /external_disk_foobar

Yes, something like that, using fs_volname of course.  I've wanted this
kind of feature for decades.

And of course all the other filesystem tools should have this interface
as well.  It's no good if it's not uniformly usable.  newfs and tunefs
need to be able to set and change fs_volname to start with.  Disk tools
could be made to work with disk label names too for added fun, but let
us not confuse fs_volname with pack names, disklabel names, etc.

Naturally this should not replace the use of the device file, but rather
be added in addition to it, as an optional way to specify the ultimate
device used to access the filesystem.

In fact I'd much rather see lots of work go into this feature than into
anything even remotely related to devfs.

BTW, we don't want to end up with the horrid mess some GNU/Linux systems
now use when their kernel config's specify root=LABEL=xxx -- I think we
can do much better.


> or, if you know the UUID
> 
>   $ mount -t msdosfs -o uuid 3478374923723423 ~/thumb_drive

I think UUID's, as I understand them so far (fs_id, right?), are really
too fragile, too meaningless and difficult to read, and too dangerous,
to use for this purpose.

They are not actually unique, to start with, so labelling them so is
just plain wrong.

Search google for Russell Coker's discussion on Label vs. UUID.

Filesystem volume names can be said to have many of the same problems,
except to start with we know and understand that they're not unique
right off the bat, and we can assign human meaning to them and make them
memorable.

Let's at least get filesystem access by volume names working right, then
we can go on to think about other things, if they still seem worthwhile.

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgp02dHW7siee.pgp
Description: PGP signature


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread Greg A. Woods
At Thu, 11 Mar 2010 10:22:29 +0900, Masao Uebayashi  wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> 
> On Thu, Mar 11, 2010 at 4:33 AM, Greg A. Woods  wrote:
> > At Wed, 10 Mar 2010 08:56:36 + (GMT), Iain Hibbert 
> >  wrote:
> > Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> >>
> >> So, you want to be able to mount a disk by the label:
> >>
> >>   $ mount -t msdosfs -o label "foobar" /external_disk_foobar
> >
> > Yes, something like that, using fs_volname of course.  I've wanted this
> > kind of feature for decades.
> 
> While I understand usefulness of human-readable labels, I don't think
> it should be handled in kernel.  Because labels are arbitrary.  They
> are not ensured to be unique.

The fs_id value is _NOT_ going to be any more unique than the fs_volname
value.

The fs_id value is also not guaranteed to be unique to start with,
especially not across the operational lifetime of a filesystem.

There are a plethora of ways the fs_id can be duplicated, and just about
as many ways for it to get lost (or changed without change control) too.

Sure, labels are arbitrary -- at least to the machine.  They are not,
necessarily, arbitrary to the human who creates them though.

In any case the label doesn't have to be _guaranteed_ to be unique to be
useful to both the human and the machine.

Also, the filesystem identifier doesn't have to be a meaningless lengthy
string of impossible to memorize sequences of digits to be useful to the
system either -- a human created, human meaningful, label can be just as
useful to the machine.


> I think labels should be resolved by some name service.  It's not
> different than /etc/hosts -> IP address.

Sorry, but I'm flabbergasted!   What the heck does that mean in this
context of filesystem identification?

Do you really want to add more complexity, goo, and mess, and places for
errors to happen by adding a translation layer?

First off, there's really nowhere to store your magical mappings.

K.I.S.S.  Please!

We do have a place to store a human readable/meaningful filesystem
identifier.

Let the human provide this label.

If the system finds duplicate labels then tell the human which devices
have conflicting labels and where those filesystem were last mounted and
let the human decide which device should be used.  (i.e. the labels do
need to be unique for a successful automatic initialisation of the
system, but there needs to be a manual way to work around them not being
unique regardless of what data they consist of)

In my opinion the fs_id value is truly useless anywhere outside of the
on-disk storage of a single filesystem copy where its sole valid use is
(IIUC) to help to match valid backup superblock copies.  The fact I'm
not even sure it's safe or sane to derive the NFS filesystem filehandle
from it in any way.


-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpUEhptX6Bge.pgp
Description: PGP signature


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-11 Thread Greg A. Woods
At Fri, 12 Mar 2010 00:35:24 +0900, Masao Uebayashi  wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> 
> Speaking of tracking state...  I've found that keeping track of state
> in devfsd is very wrong.

Indeed -- I do agree with that much at least!

I've had diskless systems running for a long while now (since 2003)
where /dev is created by init(8) on every boot (by running
/sbin/MAKEDEV, as I've renamed it).

In the extremely rare cases where I've wanted to change permissions or
similar on a device node I can just use the normal commands:

chmod 666 /dev/tty001

and if I want to make such a change persistent across boots I just add
that exact same command to /etc/rc.local.

There's no magic needed.

I think the only key feature necessary is that devfs handle the normal
permissions and ownership changes, but to do so of course with no more
persistence than tmpfs, md. or mfs.

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpbWA7E47MWU.pgp
Description: PGP signature


Re: (Semi-random) thoughts on device tree structure and devfs

2010-03-13 Thread Greg A. Woods
At Fri, 12 Mar 2010 20:22:25 +0100, Manuel Bouyer  
wrote:
Subject: Re: (Semi-random) thoughts on device tree structure and devfs
> 
> On Thu, Mar 11, 2010 at 05:54:11PM -0500, Greg A. Woods wrote:
> > Indeed -- I do agree with that much at least!
> > 
> > I've had diskless systems running for a long while now (since 2003)
> > where /dev is created by init(8) on every boot (by running
> > /sbin/MAKEDEV, as I've renamed it).
> > 
> > In the extremely rare cases where I've wanted to change permissions or
> > similar on a device node I can just use the normal commands:
> > 
> > chmod 666 /dev/tty001
> > 
> > and if I want to make such a change persistent across boots I just add
> > that exact same command to /etc/rc.local.
> > 
> > There's no magic needed.
> > 
> > I think the only key feature necessary is that devfs handle the normal
> > permissions and ownership changes, but to do so of course with no more
> > persistence than tmpfs, md. or mfs.
> 
> This wouldn't work very well for hot-plug devices.
> As I understand it, nodes would be created at plug time, and removed at unplug
> time (correct me if I'm wrong). So you would need to run you chmod
> when your e.g. USB device is plugged (which is also the time at which you
> know where it will how up in the device space).

Hmmm well, we have had "hot plug" devices of a sort ever since 1.6
or earlier (when I began using MFS /dev).

The only magic trick there is to be able to predict all the possible
major and minor numbers at the time you write your MAKEDEV script, or at
least be able to update that script as necessary.  In the past this has
been sufficient, eg. with SCSI probe and scan detecting new devices.

However even that kind of magic really isn't truly necessary.

Indeed without devfs it could be as easy as the kernel to simply
spitting out a message saying that "a device at major N, minor Y" was
available to be used (when it was detected), and then leave it entirely
up to the user, or some agent of the user (eg. a script monitoring for
such messages), to run "mknod" as appropriate, and perhaps adjusting
permissions and ownerships at the same time, possibly even updating
/etc/MAKEDEV.local.  In fact I've wanted the kernel to tell me what
major/minor number(s) to use for new SCSI devices, though to some extent
the way MAKEDEV is written to use unit numbers, it works well enough.

Obviously there are other ways for the kernel to notify userland of such
events as device attach/detach besides having a script monitor
/dev/console output or kernel syslog messages.  Perhaps kqueue()
monitoring /dev itself is sufficient, though perhaps then only for a
"flat" file tree in /dev.

So, with a devfs implementation that creates the new /dev file node
automatically, the agent script could still be responsible for changing
permissions and ownerships as desired.

I.e. no magic for persistence of filesystem metadata is necessary in
devfs so long as there are ways to monitor for and handle events that
indicate changes have happened in the live state of devfs filesystem.

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpwkBzlUcDyn.pgp
Description: PGP signature


Re: config(5) break down

2010-03-16 Thread Greg A. Woods
At Tue, 16 Mar 2010 10:22:42 +, Andrew Doran  wrote:
Subject: Re: config(5) break down
> 
> Correctamundo.  95% of downloads in the week following the release of 5.0
> were for x86.  It doesn't say much about embedded but does tell us that
> a very large segment of the user population does commodity hardware.
> 
> (What the figures also revealed was that a number of the ports had as close
> to zero downloads as matters.  Which is, to be frank, a red flag for
> those that are not maintained.)

Please do not even think about using downloads as a measure of which
ports are used and how much they are used!

That's a completely invalid measurement of how NetBSD might be used.

Many of us just download the source.  We don't tell you which parts of
it that we use or don't use.

Even port-* mailing list subscriptions aren't a truly valid hint of
which ports are used or how much they are used.

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpVTS2ubXFh7.pgp
Description: PGP signature


Re: NetBSD & binary [was Re: config(5) break down]

2010-03-16 Thread Greg A. Woods
At Tue, 16 Mar 2010 13:35:34 -0400 (EDT), der Mouse 
 wrote:
Subject: Re: NetBSD & binary [was Re: config(5) break down]
> 
> Yes, this excludes the people who don't understand and don't want to.
> To steal a term from marketing, I don't think NetBSD should try to
> serve that market segment; it's already well-served by others, and I
> see no percentage in trying to join them.  It doesn't serve them better
> (indeed, by adding yet another alternative they neither are nor want to
> be competent to choose among, it serves them worse) and it doesn't
> serve NetBSD (people who don't even want to understand are among the
> least likely to turn into developers and contribute back).  So what's
> to like?

Thank you!  Very well said.

I don't know if it helps to go any further, but perhaps I could add that
we don't really want that market segment anyway -- they would only
increase our "costs".  We need to let them go waste other people's time.

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpiAhnW3Pc0J.pgp
Description: PGP signature


MIPS SoC systems (was: Dead ports [Re: config(5) break down])

2010-03-20 Thread Greg A. Woods
At Fri, 19 Mar 2010 21:23:35 +, Herb Peyerl  wrote:
Subject: Re: Dead ports [Re: config(5) break down]
> 
> On Fri, Mar 19, 2010 at 05:19:47PM -0400, Thor Lancelot Simon wrote:
> > Have a look at
> > http://www.rmicorp.com/assets/docs/2070SG_XLR_XLS_Product_Selection_Guide_2008-12-16.pdf
> > specifically at the bottom few rows on the "XLS" chart.  You're looking at
> > parts that have 3 or 4 Gig-E interfaces, tons of useful hardware offload,
> > and are, by published reports, way down in the sub-$50 range.  You can
> > get very similar stuff from Cavium.
> 
> Last time I bought a cavium board it was >$5k USD... An Octeon 3850
> was $700 for 1521 piece part... I didn't think they had anything
> reasonable down below $500?  (and as far as I remember, they already
> had FreeBSD running on the Octeons).  Admittedly it's been a few 
> years.

FreeBSD is re-doing all its MIPS support, with quite a bit of work going
into the Atheros and Cavium ports.  Atheros is running, and some Cavium
are running too, but not yet all the most interesting ones.  Check out
Warner Losh's postings:

   http://bsdimp.blogspot.com/search/label/mips

I'm interested in bringing over some of those ports to NetBSD (though
if I try to do it for my day job I'll need to bring over Netgraph first).

Here's one company making Cavium-based systems at a reasonable price:

http://www.portwell.com/products/detail.asp?CUSTCHAR1=CAM-0100

This one doesn't run FreeBSD yet, but someone is working on it and they
are very close (it's not much different from the Cavium eval board
Warner shows booting).

They have a bunch of higher-end systems based on Cavium CPUs too (and
some other CPUs too):

http://www.portwell.com/products/MIPS.asp

This company isn't as low-priced, but has similar devices.  This one is
just under $500, single unit:


http://www.lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms/Desktop_NPU_Platforms/MR-320

and they also have a wide product range:


http://lannerinc.com/Network_Application_Platforms/Network_Processor_Platforms


One of the cheapest Atheros boards is the Ubiquiti RouterStation
series.  You can get one in a case with power supply from various
vendors now for just over $100, single unit pricing (the board is $80).

http://www.ubnt.com/rspro

This is one that FreeBSD runs on already, and I think adapting our
AR53xx port to also work on its AR71xx SoC would be relatively easy.
It's pretty snappy, but it has a poorly supported Ethernet switch chip
that as yet limits it for use in my day job.


When you start looking at what the GNU/Linux OpenWRT project supports,
there are dozens of very interesting little systems available at
relatively low prices.

http://oldwiki.openwrt.org/TableOfHardware.html

Routerboard.com (MikroTik) sell a bunch of interesting boards that even
including their own "proprietary" GNU/Linux port licensing, are still
quite cost effective.  Most of the more powerful ones are AR71xx based.

-- 
Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpZAMKw7TNbn.pgp
Description: PGP signature


Re: Hardware RAID problem with NetBSD 5?

2010-03-29 Thread Greg A. Woods
At Tue, 30 Mar 2010 00:38:05 + (UTC), John Klos  wrote:
Subject: Hardware RAID problem with NetBSD 5?
> 
> ataraid0: found 1 RAID volume
> ld0 at ataraid0 vendtype 3 unit 0: nVidia ATA RAID-1 array
> ld0: 931 GB, 121601 cyl, 255 head, 63 sec, 512 bytes/sect x 1953525120  
> sectors

I guess ataraid(4) is broken in NetBSD-5, as it is in NetBSD-4 and
-current.

See PRs #42985 and #38273 for starters.


> Strange... Does anyone have any ideas? Has anyone seen behaviour like 
> this, particularly the reset button getting disabled?

I booted today's kernel and encountered a rather harder lockup than
previously (DDB hung doing a backtrace, sending BREAK to the serial
console had no effect), though the reset button, at least on my machine
downstairs, still worked fine.

I can imagine some machines where the reset button is more of a software
controlled feature -- I've seen that kind of design mistake several
times before -- but I don't know any details of your MSI board (and I
can't find any manuals or other information about it on MSI's site).

-- 
        Greg A. Woods
Planix, Inc.

   +1 416 218 0099http://www.planix.com/


pgpCNy4FPf9eD.pgp
Description: PGP signature


kernel network interface incompatibilities between netbsd-4 and netbsd-5?

2010-10-15 Thread Greg A. Woods
Are there known kernel network interface incompatibilities between
netbsd-4 and netbsd-5?

I mention this because in considering upgrading one of my servers from
netbsd-4 to netbsd-5 I noticed that a static-linked arpwatch binary
built on netbsd-4 was complaining about bogons on my network even though
they were not bogons -- they were all in the same subnet:

Oct 15 12:36:39 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db
Oct 15 12:36:48 historically last message repeated 10 times
Oct 15 12:36:49 historically arpwatch: bogon 204.92.254.244 0:f:d3:0:5:83
Oct 15 12:36:50 historically arpwatch: bogon 204.92.254.6 8:0:20:21:99:db
Oct 15 12:37:05 historically last message repeated 14 times


I also noticed that unbound wasn't answering DNS queries ether, though
it didn't make any complaints.

Unfortunately the old netbsd-4 fstat is useless against a netbsd-5
kernel to see if unbound was actually listening on the right interfaces.

-- 
        Greg A. Woods

+1 416 218-0098VE3TCP  RoboHack 
Planix, Inc.   Secrets of the Weird 


pgp4moR2WFLUD.pgp
Description: PGP signature


getrusage() problems with user vs. system time reporting

2011-10-27 Thread Greg A. Woods
all
  count_bits() = 0.1968 us/c user,  0.0019 us/c sys, 0.0729 us/c wait,  
0.2716 us/c wall
   count_ul_bits() = 0.2629 us/c user,  0.0026 us/c sys, 0.0835 us/c wait,  
0.3489 us/c wall


similarly good with clang (again with -O0):

Apple clang version 1.7 (tags/Apple/clang-77) (based on LLVM 2.9svn)

time() = 0.1796 us/c user,  0.0016 us/c sys, 0.0175 us/c wait,  
0.1987 us/c wall
nulltime() = 0.1841 us/c user,  0.0014 us/c sys, 0.0185 us/c wait,  
0.2040 us/c wall
countbits_sparse() = 0.2145 us/c user,  0.0019 us/c sys, 0.0308 us/c wait,  
0.2472 us/c wall
 countbits_dense() = 0.3065 us/c user,  0.0026 us/c sys, 0.0744 us/c wait,  
0.3835 us/c wall
  COUNT_BITS() = 0.1918 us/c user,  0.0016 us/c sys, 0.0407 us/c wait,  
0.2341 us/c wall
  count_bits() = 0.1961 us/c user,  0.0018 us/c sys, 0.0929 us/c wait,  
0.2907 us/c wall
   count_ul_bits() = 0.2548 us/c user,  0.0029 us/c sys, 0.1576 us/c wait,  
0.4153 us/c wall


Interestingly, totally as an aside, with clang -O3 the differences in
the algorithms are so far in the noise as to be invisible, and it's
almost as if the compiler recognised every one of my functions and just
did something magic instead:

tcountbits: now running each algorithm for 3000 iterations
time() = 0.1782 us/c user,  0.0010 us/c sys, 0.0059 us/c wait,  
0.1851 us/c wall
nulltime() = 0.1782 us/c user,  0.0011 us/c sys, 0.0057 us/c wait,  
0.1850 us/c wall
countbits_sparse() = 0.1782 us/c user,  0.0010 us/c sys, 0.0073 us/c wait,  
0.1864 us/c wall
 countbits_dense() = 0.1782 us/c user,  0.0011 us/c sys, 0.0058 us/c wait,  
0.1852 us/c wall
  COUNT_BITS() = 0.1782 us/c user,  0.0011 us/c sys, 0.0051 us/c wait,  
0.1844 us/c wall
  count_bits() = 0.1782 us/c user,  0.0010 us/c sys, 0.0076 us/c wait,  
0.1869 us/c wall
   count_ul_bits() = 0.1782 us/c user,  0.0011 us/c sys, 0.0073 us/c wait,  
0.1866 us/c wall


I want to try using OS X's mach_absolute_time() on my Mac instead of
gettimeofday(), and perhaps also in parallel to getrusage() since on OS
X the gettimeofday() calls to seed the values are done without context
switches (due to the magic of OS X's COMMPAGE) and thus I think I can
safely assume that there will be approximately no measurable system time
used for each iteration.  That might give me reliable timing of each
algorithm to compare with getrusage().

I have also already looked at gprof(1) results to compare here as well,
but use of gprof(1) isn't always possible, and it doesn't necessarily
meet the same needs either -- sometimes "time -l" is all you've got, and
that means one must be able to trust getrusage() to give reproducible
results.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/

#include 

#include 

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

/*
 * WARNING:  time can appear to have gone backwards with getrusage(2)!
 *
 * See NetBSD Problem Report #30115 (and PR#10201).
 * See FreeBSD Problem Report #975 (and PR#10402).
 *
 * Problem has existed in all *BSDs since 4.4BSD if not earlier.
 *
 * Only FreeBSD has implemented a "fix" (as of rev.1.45 (svn r44725) of
 * kern_resource.c (etc.) on April 13, 1999)
 *
 * But maybe it is even worse than that -- distribution of time between user
 * and system doesn't seem to match reality!
 *
 * See the GNU MP Library (GMP)'s tune/time.c code for better timing?
 */

#if defined(__APPLE__)
# define MILLIONS   10  /* my mac is way faster!  :-) */
#else
# define MILLIONS   1
#endif

static unsigned int iter = MILLIONS * 100UL;

char *argv0 = "progname";


/*
 * for info about the worker algorithms used here see:
 * http://graphics.stanford.edu/~seander/bithacks.html>
 */

/*
 * do nothing much, but make sure you do it!
 */
unsigned int nullfunc(unsigned long);
unsigned int
nullfunc(unsigned long v)
{
volatile unsigned int bc = (unsigned int) v;

return bc;
}

/*
 * return the number of bits set to one in a value
 *
 * old-fashioned bit-by-bit bit-twiddling  very slow!
 */
unsigned int count_ul_bits(unsigned long);
unsigned int
count_ul_bits(unsigned long v)
{
unsigned int c;

c = 0;
/*
 * we optimize away any high-order zero'd bits...
 */
while (v) {
c += (v & 1);
v >>= 1;
}

return c;
}

/*
 * return the number of bits set to one in a value
 *
 * Subtraction of 1 from a number toggles all the bits (from right to left) up
 * to and including the righmost set bit.
 *
 * So, if we decrement a number by 1 and do a bitwise and (&) with itself
 * ((n-1) & n), we will clear the righmost set bit in the number.
 *
 * Therefore if we do 

Re: getrusage() problems with user vs. system time reporting

2011-10-28 Thread Greg A. Woods
On Thu, Oct 27, 2011 at 07:24:03 +-300, Jukka Ruohonen wrote:
> 
> This is a well-known bug that is over 15 year old. The much simpler tests in
> atf(7) replicate it well. The used tracker PR is kern/30115. Michael van
> Elst suggested therein couple of reasonable (IMO) solutions.

Part of the point of this new discussion is that I am attempting,
perhaps poorly, to show that I think PR#30115 and its historical
counterpart, and similar reports in the PR databases of other *BSDs,
represent a separate, unique, problem.

It is possible that the problem I'm trying to show here shares, or is at
least related to, the same cause as the problem shown in PR#30115.
That's part of what I'm trying to discover here.

However FreeBSD's solution to PR#30115 is not in any way a valid
solution to the problem I'm trying to show here, regardless of whether
the problem I'm trying to show has the same cause or not.  That solution
will prevent the little wobbles that the simplistic tests demonstrate,
but it won't make overall getrusage() timing results any more meaningful
and consistent.  Indeed it may even make them a wee bit more wrong,
though I'm not sure this last part matters so much.

From what I understand currently, especially if the root cause of these
problems is related, then David's proposed solution would be on the
right track:

On Fri, 28 Oct 2011 08:48:19 +0100, David Laight wrote:
> 
> If you are willing to take the cost of getting the timestamp (in
> some units) on every kernel entry/exit (as well as the process switch)
> then the time in usr/sys can be added to the clock tick counts and
> used when the actual execution time is split.
> (Doing it that way means the units don't have to be THAT accurate)

Hmmm if we could save the current time on every kernel entry, and
then increment a new "l_systime" variable with the elapsed time on every
return to user mode, and of course use the same clock as is used for
l_rtime (i.e. binuptime()), then the only wild-card variable left is
interrupt time.

Just how expensive is updatertime() and the associated bookkeeping it
needs?   Hmmm

So, then user time would be the difference between the sum of thread
runtimes and the sum of thread systimes, less some value for interrupt
time.

Ideally interrupt time would also be measured similarly (using the same
clock again) by the interrupt dispatcher and accumulated against
whatever thread (kernel or user) was interrupted (e.g. in l_intrtime).
However I don't quite see how this could be possible to do safely,
especially in conjunction with SMP, though I'm not familiar enough with
the details of the locking that might be required to know for sure.  If
I'm wrong and it is possible to do then directly measuring and
accounting for interrupt time would also be a very good thing, (assuming
it wouldn't be so costly as to radically change overall system
performance).


In any case with the current state of affairs I'm beginning to think the
interrupt ticks are the real wild-cards here and I'm wanting to modify
getrusage() to return a new ru_itime value as well (or add a new system
call to return the raw p_rtime and p_*ticks values along with stathz).
After all, how likely is it that the average of time accounted to
p_iticks will actually match the true time used by interrupts.  I'm
guessing average interrupt service times are far less than stathz
intervals.

I'm also wondering if I can force "stathz=0" at runtime, perhaps with a
sysctl, so that I can also avoid the perturbations caused by having a
different (and possibly changing) statclock rate.  It's all well and
good to try to reduce the cost of statclock handling by giving it a much
lower rate than hardclock, but in the end that just makes the division
of p_rtime as returned by getrusage() effectively meaningless, and thus
some of the work done by statclock may as well be simply not done at all
in the first place when stathz is non-zero.  It would be much less
misleading, to say the least.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgp06bJ5UY0f1.pgp
Description: PGP signature


Re: getrusage() problems with user vs. system time reporting

2011-10-31 Thread Greg A. Woods
At Mon, 31 Oct 2011 21:10:40 +, David Laight  wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
> 
> There is an kernel option in i386 (and maybe amd64) to do some
> per-syscall stats. One of those counts 'time in syscall' and IIRC
> could easily be used to weight the tick counts so that getrusage
> gives more accurate times.

I had no idea there was a SYSCALL_TIMES_PROCTIMES option as well!  (and
I see it's been sitting there un-documented since 2007, and so it is
already in the netbsd-5 sources I'm experimenting with!)  This is
exciting!  This is what I was looking for!

I had ignored SYSCALL_TIMES because it seemed from the manual pages to
be lacking per-process hooks, though I was getting to the place where I
might have noticed that this would be an appropriate framework in which
to add per-process support. :-)


So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it
effectively converts the struct proc p_*ticks fields into cycle counts
instead of stathz tick counts.  (though it seems enabling this does not
disable the additional accumulation of stathz ticks, nor does it adjust
the calculations in kern_resource.c to give expected values)


It looks like SYSCALL_TIMES is indeed on both i386 and amd64 at this
time, which will do fine for me for now, but given how it seems to work
it looks like it could be made to work on alpha, mips, powerpc, sparc64,
and ia64 with relative ease.


> The problem is that getting an accurate timestamp is relatively
> expensive. It has been almost the dominant part of process switch.

Because they do use cpu_counter32(), I'm surprised they would be that
expensive to keep.

If one were to get rid of the big syscall_counts and syscall_times
tables and just use the bits necessary for SYSCALL_TIMES_PROCTIMES,
would that help reduce the overhead to a more acceptable level?


BTW, have you ever built and tested a kernel with appropriate instances
of SYSCALL_TIME_ISR_ENTRY() and SYSCALL_TIME_ISR_EXIT() put into place?
If so, do you have suggestions as to where I could try putting those
macros, especially in a netbsd-5 kernel?

In my estimation it's useless to try to make getrusage() show more
accurate user time without also firmly accounting for ISR times as well.


Relatively speaking I don't mind at all taking a small, equitable, hit
in context switching if as a result I can get relatively accurate user
and system (and ISR) times per process as a result.

-- 
    Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpYwaAKjBz1j.pgp
Description: PGP signature


Re: getrusage() problems with user vs. system time reporting

2011-11-01 Thread Greg A. Woods
At Mon, 31 Oct 2011 23:28:49 +, David Laight  wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
> 
> On Mon, Oct 31, 2011 at 04:08:13PM -0700, Greg A. Woods wrote:
> > 
> > So, if I understand the way SYSCALL_TIMES_PROCTIMES is implemented it
> > effectively converts the struct proc p_*ticks fields into cycle counts
> > instead of stathz tick counts.  (though it seems enabling this does not
> > disable the additional accumulation of stathz ticks, nor does it adjust
> > the calculations in kern_resource.c to give expected values)
> 
> It doesn't matter. With the cycle counter values, the stathz ticks
> are noise. The counts are then a bit like doing the stathz count
> on every tick of the cycle counter!

Ah, yes, of course!  I realized that shortly after posting while I was
adding #ifdef's to turn off the counting in statclock().  :-)

> > Because they do use cpu_counter32(), I'm surprised they would be that
> > expensive to keep.
> 
> If a cpu's TSC rate changes (eg with power saving) they they'll give
> different results. So you'd really need a nanotime function.
> OTOH using the valuse to split the total execution time is probably
> always better than the current code.

Wikipedia's entry on "Time Stamp Counter" (and Intel's app-note about
using RDTSC for performance monitoring) also mention that on any
processor since the Pentium Pro with out-of-order execution an accurate
cycle count can only be obtained by preceding the RDTSC instruction with
something like CPUID, or on CPUs which support one, the RDTSCP
instruction.

It is also mentioned that some processors run the time-stamp cycle
counter at a constant rate (not the actual current CPU clock rate)
(though apparently this quirk can be identified 

And of course there's the issue of multiple processors, since as I
understand it the TSC on different cores are not synchronised.

Finally though I'm still learning more about virtual TSCs on VMware and
VirtualBox, I'm not so sure the TSC will be at all useful in such a
virtual machine environment.

Indeed with all this doom and gloom about TSC it seems it might be
better just to use binuptime() -- that probably won't be as fast
though  Perhaps if I'm inspired tomorrow I'll try to re-do the
sysctl_stats.h macros to use it instead of cpu_counter32(), and then use
this real measure of system time in calcru() instead of pretending the
stathz ticks mean anything.


> Getting the TSC is (IIRC) 30-40 clocks on i386 - because it is a
> synchronising instruction. But it might be the delays only really
> affect back to back reads. ad@ knows more - it will be in the
> archives somewhere.

I found some references saying that it could be 150-200 cycles, and
another saying that it was closer to 80 cycles.


BTW, I don't seem to have any luck identifying the CPU in the VirtualBox
VM that I'm running NetBSD-5 in:

# cpuctl identify 0
cpuctl: cpuset_create: Cannot allocate memory

ktrace seems to say the error is coming from sysctl(), not calloc():

   354  1 cpuctl   CALL  __sysctl(0xbfbfe954,2,0x80758e0,0xbfbfe95c,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  open(0x806b261,2,0x51)
   354  1 cpuctl   NAMI  "/dev/cpuctl"
   354  1 cpuctl   RET   open 3
   354  1 cpuctl   CALL  __sysctl(0xbfbfe888,2,0xbfbfe8e0,0xbfbfe8e4,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  __sysctl(0xbfbfe858,2,0xbfbfe8b0,0xbfbfe8b4,0,0)
   354  1 cpuctl   RET   __sysctl 0
   354  1 cpuctl   CALL  __sysctl(0x8072130,2,0xbfbfe8e0,0xbfbfe8e4,0,0)
   354  1 cpuctl   RET   __sysctl -1 errno 12 Cannot allocate memory


I was about to try to copy over a sysctl.debug symbol file from my build
machine, after turning on the network, and I got a crash as I started
the rcp, and it's the first time I've seen such a crash and the only
difference is that I've turned on SYSCALL_TIMES_PROCTIMES et al.

Mutex error: lockdebug_barrier: spin lock held

lock address : 0xc0d4de54 type :   spin
initialized  : 0xc04e7086
shared holds :  0 exclusive:  1
shares wanted:  0 exclusive:  0
current cpu  :  0 last held:  0
current lwp  : 0xd3d02840 last held: 0xd3d02840
last locked  : 0xc04e622c unlocked : 0xc04e624b
owner field  : 0x00010700 wait/spin:0/1

panic: LOCKDEBUG
Begin traceback...
copyright(c0d50643,0,0,c0c36d90,d2a69c40,d2a69bd8,c0d4de54,c0c33a24,d3d02840,c04e622c)
 at 0xc0b8d29d
End traceback...

dumping to dev 0,1 offset 2000263
dump 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 
26 25 24 23 22 21 20 19 18 17

questions about clocks, and statclock() in particular....

2011-11-02 Thread Greg A. Woods
  * time it can be highly inaccurate, especially
+* for interrupt service routines which may
+* routinely be much shorter-running than
+* stathz.
+*
+* Statistically though the likelyhood of a
+* statclock() call comming while another ISR
+* is already executing is therefore relatively
+* low, unless perhaps the interrupt rate is
+* really high, in which case taking a whole
+* stathz tick ``hit'' for ISRs might, on
+* average, be OK.
+*/
p->p_iticks++;
+#endif
}
spc->spc_cp_time[CP_INTR]++;
} else if (p != NULL) {
+#ifndef SYSCALL_PROCTIMES
+   /*
+* Unfortunately since this allocates a whole stathz
+* interval to the process as system time it can be
+* highly inaccurate, especially for some system calls
+* which may take far less than a stathz interval to
+* finish.
+*
+* Unlike interrupts though, some system calls may
+* actually run fairly long.  Does that make this fair?
+*/
p->p_sticks++;
+#endif
spc->spc_cp_time[CP_SYS]++;
} else {
spc->spc_cp_time[CP_IDLE]++;
}
}
-   spc->spc_pscnt = psdiv;
 
if (p != NULL) {
+   struct vmspace *vm = p->p_vmspace;
+   long rss;
+
+   /*
+* If the CPU is currently scheduled to a non-idle process,
+* then charge that process with the appropriate VM resource
+* utilization for a tick.
+*
+* Assume that the current process has been running the entire
+* last tick, and account for VM use regardless of whether in
+* user mode or system mode (XXX or interrupt mode?).
+*
+* rusage VM stats are expressed in kilobytes *
+* ticks-of-execution.
+*/
+   /* based on code from 4.3BSD kern_clock.c and from FreeBSD ... 
*/
+
+# define pg2kb(n)  (((n) * PAGE_SIZE) / 1024)
+
+   p->p_stats->p_ru.ru_idrss += pg2kb(vm->vm_dsize); /* unshared 
data */
+   p->p_stats->p_ru.ru_isrss += pg2kb(vm->vm_ssize); /* unshared 
stack */
+   p->p_stats->p_ru.ru_ixrss += pg2kb(vm->vm_tsize); /* "shared" 
text? */
+
+   rss = pg2kb(vm_resident_count(vm));
+   if (rss > p->p_stats->p_ru.ru_maxrss)
+   p->p_stats->p_ru.ru_maxrss = rss;
+
+   /* finally account overall for one stathz tick */
atomic_inc_uint(&l->l_cpticks);
+
+   /* we're done with mucking with proc stats fields */
mutex_spin_exit(&p->p_stmutex);
}
+
+       /*
+* reset the profhz divisor counter
+*
+* Note we must use the global variable here, not spc->spc_psdiv, since
+* the statclock rate may already have been lowered by another CPU
+*/
+   spc->spc_pscnt = psdiv;
 }

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpulhZDl617g.pgp
Description: PGP signature


Re: getrusage() problems with user vs. system time reporting

2011-11-02 Thread Greg A. Woods
At Tue, 01 Nov 2011 01:43:23 -0700, "Greg A. Woods"  wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
> 
> Indeed with all this doom and gloom about TSC it seems it might be
> better just to use binuptime() -- that probably won't be as fast
> though  Perhaps if I'm inspired tomorrow I'll try to re-do the
> sysctl_stats.h macros to use it instead of cpu_counter32(), and then use
> this real measure of system time in calcru() instead of pretending the
> stathz ticks mean anything.

So, I've done that now.  (including removing dependency on
SYSCALL_COUNTS and SYSCALL_TIMES, etc.; all except for figuring out how
to hook in the ISR hooks...)

Seems binuptime() is indeed way too expensive to run at every
mi_switch(), syscall(), etc. (it more than doubles the time it takes for
gettimeofday() to run), but getbinuptime() seems to be sufficiently
low-cost to use in these situations.

Unfortunately getbinuptime() isn't immediately looking a whole lot
better than the statistical sampling in statclock(), though perhaps,
with enough benchmark runtime, it is, as expected, being _much_ more
fair at splitting between user and system time.

At least this is the case on a VirtualBox VM.  I need to see it on real
hardware next.

With some further analysis, and with addition of new time values to
struct proc so that statclock() ticks can also be accounted (right now I
re-use the p_*ticks storage for 64-bit nanosecond values), it may be
possible to come up with a simple algorithm so that calcru() can use
balance out the difference between the getbinuptime() values and the
true binuptime() stored in {l,p}_rtime.  Storing the raw getbinuptime()
values would also avoid having to do 64-bit adds in mi_switch(),
syscall(), et al.

If anyone's interested in more details I can post some of my results, as
well as the changes I've made.

Any comments about this would be appreciated!


One thing that's confusing me is that though normally for short-running
processes I'm seeing the getbinuptime() values be either zero, or
somewhat less than the binuptime() value from p_rtime, on rare occasions
I also see "vastly" larger getbinuptime() values.

For example (this from calcru(), as it is called from in kern_exit.c):

exit|tty: atrun[377]: rt=2216 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[361]: rt=2066 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[409]: rt=2207 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[130]: rt=2272 ms, u+s=1694 ms (u=0 ns / s=1694070 ns), it=0 
ticks
exit|tty: atrun[434]: rt=4048 ms, u+s=10151 ms (u=10151849 ns / s=0 ns), it=0 
ticks
exit|tty: atrun[162]: rt=3209 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks
exit|tty: atrun[458]: rt=2576 ms, u+s=0 ms (u=0 ns / s=0 ns), it=0 ticks


"rt" is the real-time (p_rtime) value calculated by calcru(), in ms
"u" is the accumulated user time from getbinuptime() calls, in ns
"s" is the accumulated system time from getbinuptime() calls, in ns
"it" is the old-style statistically gathered p_iticks value
"u+s" is of course "u" + "s", converted to ms


(the longest running sample above was when the VM was idle except for
cron, but the VirtualBox host, my desktop, may have been quite busy)


-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpVuuQcViaeH.pgp
Description: PGP signature


Re: getrusage() problems with user vs. system time reporting

2011-11-07 Thread Greg A. Woods
BTW, there's a bug in my code -- min() should be max(), I think


At Sun, 6 Nov 2011 14:26:02 + (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: getrusage() problems with user vs. system time reporting
> 
> wo...@planix.ca ("Greg A. Woods") writes:
> 
> >I tried using hardclock_ticks in sys/syscall_stats.h, but even with
> >HZ=1000, the resolution was too fine compared to the time taken by the
> >some system calls, and even the average time slice for user mode.
> >It is _way_ better than statclock ticks though, and "almost" free.
> 
> Isn't statclock there to have measurements that are _not_ synchronized
> with hardclock?

Yes, sort of.

If statistical sampling of the program counter is being done by the
clock interrupt(s) (i.e. SYSCALL_* options are not enabled), then yes
definitely.  Using hardclock() alone can allow a process to accidentally
or purposefully become synchronised to the system clock leading to
either inaccurate resource utilisation or even deliberate resource
hiding.  Note that on i386 so far as I can tell stathz is always zero so
hardclock() is always used to collect usage samples anyway.

With SYSCALL_* and using hardclock_ticks as the timer, I'm not sure.
Perhaps an evil process could call a cheap system call such as getpid()
just before a hardclock_tick would be incremented, which would
potentially give it an extra tick of user-mode CPU time.

That's why I would like to use a higher-resolution timer with SYSCALL_*

Using the CPU time-stamp counter is much higher resolution of course,
and in theory with it then it would be impossible for any process to
avoid resource usage detection.  However since the timer is so fast that
I am forced to somehow divide that counter down to a reasonable rate
such that the calculations using it don't overflow, there may still be
room for problems.

It's really too bad that the timecounter infrastructure seems only
geared to providing one timer and there's no easy/obvious way (that I
can see) for any other kernel subsystem to reach in and access any other
potentially usable monotonic timer in a more machine independent way
(but without having to use all of binuptime() or similar).  Obviously I
could do this in a platform specific way where possible, but

With a true monotonic timer, one that gives known units of time, then
the SYSCALL_* feature could account for the actual time used in each
mode, using context switches to mark the divisions instead of having to
do any statistical sampling, and instead of having to use some timer
with an unknown rate (such as the CPU TSC) as a ticker that's then used
to divide out p_rtime between system, user, and interrupt time.

I wonder if I could get by using the frequency discovered by the
timecounter code for the CPU TSC and then from that calculate an
estimate for the amount of real time spent in each mode?  The issues
with CPU timestamp counters would still remain, but perhaps that's
better than trying to use only some bits of the TSC sums.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpK0soNxDKww.pgp
Description: PGP signature


clues to getrusage() problems with user vs. system time reporting

2011-11-08 Thread Greg A. Woods
 count_bits() = 0.0515 us/c user,  9.6292 us/c sys, 0.0128 us/c wait,  
9.6935 us/c wall
   count_ul_bits() = 0.0561 us/c user,  9.8984 us/c sys, 0.0280 us/c wait,  
9.9826 us/c wall
   68.87 real 0.37 user68.47 sys
 20356  maximum resident set size
 7  average shared memory size
   408  average unshared data size
 7  average unshared stack size
59  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
45  involuntary context switches


Would anyone who bothered to read this far, and who knows a thing or two
about context switching and such care to try to help see if the hooks in
 can be better placed or otherwise improved?



FYI, with random() my test program is still showing wonky wobbly results
on my iMac, and though they are not terrible, they're not really usable
at this resolution either:

$ /usr/bin/time -l ./tcountbits -t -r -i 100

tcountbits: now running each algorithm for 1 iterations
  random() = 0.0056 us/c user,  0. us/c sys, 0.0002 us/c wait,  
0.0058 us/c wall
nulltime() = 0.0113 us/c user,  0. us/c sys, 0.0006 us/c wait,  
0.0119 us/c wall
countbits_sparse() = 0.0616 us/c user,  0.0001 us/c sys, 0.0028 us/c wait,  
0.0645 us/c wall
 countbits_dense() = 0.1372 us/c user,  0.0004 us/c sys, 0.0102 us/c wait,  
0.1479 us/c wall
  COUNT_BITS() = 0.0152 us/c user,  0. us/c sys, 0.0006 us/c wait,  
0.0158 us/c wall
  count_bits() = 0.0205 us/c user,  0. us/c sys, 0.0007 us/c wait,  
0.0212 us/c wall
   count_ul_bits() = 0.0949 us/c user,  0.0003 us/c sys, 0.0044 us/c wait,  
0.0996 us/c wall
   36.71 real34.62 user 0.09 sys
368640  maximum resident set size
 0  average shared memory size
 0  average unshared data size
 0  average unshared stack size
   110  page reclaims
 0  page faults
 0  swaps
 0  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 0  voluntary context switches
 25945  involuntary context switches



-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpTQEj5UbRbm.pgp
Description: PGP signature


Re: language bindings (fs-independent quotas)

2011-11-21 Thread Greg A. Woods
At Fri, 18 Nov 2011 16:27:53 +0200, Alan Barrett  wrote:
Subject: Re: language bindings (fs-independent quotas)
> 
> On Fri, 18 Nov 2011, Manuel Bouyer wrote:
> >> Assuming that there's no need to handle fields with embedded
> >> spaces, perl's split() function will DTRT.
> >
> > No, it does not because there are fields that can be empty.
> 
> The common way of dealing with that is to have a placehloder like "-"
> for empty fields.

I dunno (and don't want to know :-)) about perl, but it's easy enough to
insert proper field separators into fixed-width columnar input with Awk
and then go about using split() or whatever uses FS normally.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpRRsykMmxKn.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-09 Thread Greg A. Woods
At Tue, 6 Dec 2011 12:44:16 -0500, Donald Allen  wrote:
Subject: Re: Lost file-system story
> 
> much more clear. When I read this before the fun started, I took it to
> mean, perhaps unjustifiably, what I know to be true -- there is some
> non-zero probability that fsck of an async file-system will not be
> able to verify and/or restore the filesystem to correctness  after a
> crash. You are saying that the probability, in the case of NetBSD, is
> 1. If that's true, that there's no periodic sync, I would say that's
> *really* a mistake. It should be there with a knob the administrator
> can turn to adjust the sync frequency.

Just to be clear:  There is such a knob, or rather binary switch.  It's
called umount(2).

sync(2) might work too, but I seem to vaguely remember something about
it not working for async-mounted filesystems, and some obscure reason
why it wouldn't/couldn't work for them, though that doesn't seem logical
to me any more.  sync(2) should, IMHO, even go so far as to cause the
dirty flag to be cleared on the disk once all the writes to flush all
necessary updates have completed (and assuming of course that no further
changes of any kind are made to the filesystem after sync(2) scheduled
all the writes, and assuming of course that writes cached in the storage
interface controller or in the drive controller will be written out in
order.

In theory "mount -u -r" should work too, but then there's PR#30525.

Steve Bellovin asked a question some time ago on netbsd-users about why
umount(2) works, but "mount -u -r" doesn't, and to the best of my
understanding it hasn't been answered yet (though mention was made of a
possible fix to be found in FreeBSD, followed by some musings on how
hard it is to find and use such fixes in the diverging code bases of
FreeBSD and NetBSD).

Perhaps sync(2) will fail for async-mounted filesystems, or even without
MNT_ASYNC, for the same reason that "mount -u -r" fails, though that's
pure speculation based on my vague ideas, and is not based on anything
in the code.  The question was asked in PR#30525 about "mount -u -r"
vs. filesystems mounted with MNT_SYNC, but nobody knew if that would
make any significant difference or not (and I would naively suspect not).

Perhaps the superblock should also record when a filesystem has been
mounted with MNT_ASYNC so that fsck(8) can print a warning such as:

"FS is dirty and was mounted async.  Demons will fly out of your nose"

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgppoVyhhnBug.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-09 Thread Greg A. Woods
At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen  wrote:
Subject: Re: Lost file-system story
> 
> "does not guarantee to keep a consistent file system structure on the
> disk" is what I expected from NetBSD. From what I've been told in this
> discussion, NetBSD pretty much guarantees that if you use async and
> the system crashes, you *will* lose the filesystem if there's been any
> writing to it for an arbitrarily long period of time, since apparently
> meta-data for async filesystems doesn't get written as a matter of
> course.

I'm not sure what the difference is.  You seem to be quibbling over
minor differences and perhaps one-off experiences.  Both OpenBSD and
NetBSD also say that you should not use the "async" flag unless you are
prepared to recreate the file system from scratch if your system
crashes.  That means use newfs(8) [and, by implication, something like
restore(8)], not fsck(8), to recover after a crash.  You got lucky with
your test on OpenBSD.


> And then there's the matter of NetBSD fsck apparently not
> really being designed to cope with the mess left on the disk after
> such a crash. Please correct me if I've misinterpreted what's been
> said here (there have been a few different stories told, so I'm trying
> to compute the mean).

That's been true of Unix (and many unix-like) filesystems and their
fsck(8) commands since the beginning of Unix.

fsck(8) is designed to rely on the possible states of on-disk filesystem
metadata because that's now Unix-based filesystems have been guaranteed
to work (barring use of MNT_ASYNC, obviously).

And that's why by default, and by very strong recommendation, filesystem
metadata for Unix-based filesystems (sans WABPL) should always be
written synchronously to the disk if you ever hope to even try to use
fsck(8).


> I am not telling the OpenBSD story to rub NetBSD peoples' noses in it.
> I'm simply pointing out that that system appears to be an example of
> ffs doing what I thought it did and what I know ext2 and journal-less
> ext4 do -- do a very good job of putting the world into operating
> order (without offering an impossible guarantee to do so) after a
> crash when async is used, after having been told that ffs and its fsck
> were not designed to do this.

You seem to be very confused about what MNT_ASYNC is and is not.  :-)

Unix filesystems, including Berkeley Fast File System variant, have
never made any guarantees about the recoverability of an async-mounted
filesystem after a crash.

You seem to have inferred some impossible capability based on your
experience with other non-Unix filesystems that have a completely
different internal structure and implementation from the Unix-based
filesystems in NetBSD.

Perhaps the BSD manuals have assumed some knowledge of Unix history, but
even the NetBSD-1.6 mount(8) manual, from 2002, is _extremely_ clear
about the dangers of the "async" flag, with strong emphasis in the
formatted text on the relevant warning:

 async   All I/O to the file system should be done asyn-
 chronously.  In the event of a crash, _it_is_
 _impossible_for_the_system_to_verify_the_integrity_of_
 _data_on_a_file_system_mounted_with_this_option._  You
 should only use this option if you have an applica-
 tion-specific data recovery mechanism, or are willing
 to recreate the file system from scratch.

According to CVS that wording has not changed since October 1, 2002, and
the emphasised text has been there unchanged since September 16, 1998.

> So I'd love it if my experience encourages someone to improve NetBSD
> ffs and fsck to make use of async practical

As others have already said, this has already been done.  It's called
WABPL.  See wapbl(4) for more information.  Use "mount -o log" to enable
it.

(BTW, I personally don't think you would want to use softdep -- it can
suffer almost as badly as async after a crash, though perhaps without
totally invalidating fsck(8)'s ability to at least recover files and
directories which were static since mount; and it does also offer vastly
improved performance in many use cases, but as the manual says, it
should still be used with care (i.e. recognition of the risks of
less-tested, much more complex code, and vastly changed internal
implmentation semantics implying radically different recovery modes.)

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgp7bEgL4qiOc.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Fri, 9 Dec 2011 22:12:25 -0500, Donald Allen  wrote:
Subject: Re: Lost file-system story
> 
> On Fri, Dec 9, 2011 at 8:43 PM, Greg A. Woods  wrote:
> > At Fri, 9 Dec 2011 15:50:35 -0500, Donald Allen  
> > wrote:
> > Subject: Re: Lost file-system story
> > >
> > > "does not guarantee to keep a consistent file system structure on the
> > > disk" is what I expected from NetBSD. From what I've been told in this
> > > discussion, NetBSD pretty much guarantees that if you use async and
> > > the system crashes, you *will* lose the filesystem if there's been any
> > > writing to it for an arbitrarily long period of time, since apparently
> > > meta-data for async filesystems doesn't get written as a matter of
> > > course.
> >
> > I'm not sure what the difference is.
> 
> You would be sure if you'd read my posts carefully. The difference is
> whether the probability of an async-mounted filesystem is near zero or
> near one.

I think perhaps the misunderstanding between you and everyone else is
because you haven't fully appreciated what everyone has been trying to
tell you about the true meaning of "async" in Unix-based filesystems,
and in particular about NetBSD's current implementation of Unix-based
filesystems, and what that all means to implementing algorithms that can
relibably repair the on-disk image of a filesystem after a crash.

I would have thought the warning given in the description of "async" in
mount(8) would be sufficient, but apparently you haven't read it that
way.

Perhaps the problem is the last occurance of the word "or" in the last
sentence of that warning should be changed to "and".  To me that would
at least make the warning a bit stronger.


> > And that's why by default, and by very strong recommendation, filesystem
> > metadata for Unix-based filesystems (sans WABPL) should always be
> > written synchronously to the disk if you ever hope to even try to use
> > fsck(8).
> 
> That's simply not true. Have you ever used Linux in all the years that
>  ext2 was the predominant filesystem? ext2 filesystems were routinely
> mounted async for many years; everything -- data, meta-data -- was
> written asynchronously with no regard to ordering. 

DO NOT confuse any Linux-based filesystem with any Unix-based
filesystem.  They may have nearly identical semantics from the user
programming perspective (i.e. POSIX), but they're all entirely different
under the hood.

Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have
never ever Ever EVER given any guarantee about the repariability of the
filesystem after a crash if it has been mounted with MNT_ASYNC.

Indeed it is more or less _impossible_ by design for the system to make
any such guarantee given what MNT_ASYNC actually means for Unix-based
filesystems, and especially what it means in the NetBSD implementation.


> > Unix filesystems, including Berkeley Fast File System variant, have
> > never made any guarantees about the recoverability of an async-mounted
> > filesystem after a crash.
> 
> I never thought or asserted otherwise.

Well, from my perspective, especially after carefully reading your
posts, you do indeed seem to think that async-mounted Unix-based
filesystems should be able to be repaired, at least some of the time,
despite the documentation, and all the collected wisdom of those who've
replied to your posts so far, saying otherwise.


> > You seem to have inferred some impossible capability based on your
> > experience with other non-Unix filesystems that have a completely
> > different internal structure and implementation from the Unix-based
> > filesystems in NetBSD.
> 
> Nonsense -- I have inferred no such thing. Instead of referring you to
> previous posts for a re-read, I'll give you a little summary. I am
> speaking about probabilities. I completely understand that no
> filesystem mounted async (or any other way, for that matter), whether
> Linux or NetBSD or OpenBSD, is GUARANTEED to survive a crash.

OK, let's try stating this once more in what I hope are the same terms
you're trying to use:  The probablility of any Unix-based filesystem
being repariable after a crash is zero (0) if it has been mounted with
MNT_ASYNC, and if there was _any_ activity that affected its structure
since mount time up to the time of the crash.  It still might survive
after some types of changes, but it _probably_ won't.  There are no
guarantees.  Use "newfs" and "restore" to recover.

Linux ext2 is not a Unix-based filesystem and Linux itself is not a
Unix-based kernel.  The meaning of "async" to ext2 is apparently very
different than it is to any

Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Mon, 12 Dec 2011 15:08:40 +, David Holland  
wrote:
Subject: Re: Lost file-system story
> 
> On Sun, Dec 11, 2011 at 06:53:26PM -0800, Greg A. Woods wrote:
> No, as far as I can tell he understands perfectly well; he just
> doesn't consider the behavior acceptable.
> 
> It appears that currently a ffs volume mounted -oasync never writes
> back metadata. I don't think this behavior is acceptable either.

I agree there are conditions and operations which _should_ guarantee
that the on-disk state of the filesystem is identical to what the user
perceives and thus that the filesystem is 100% consistent and secure.

It seems umount(2) works to make this guarantee, for example.

The two other most important of these that come to mind are:

mount -u -r /async-mounted-fs

and

mount -u -o noasync /async-mounted-fs

It is my understanding that neither works at the moment, and that this
is a known and reported and accepted bug, as I outlined in an earlier
post to this thread.

I think sync(2) should probably also work, but _only_ if the filesystem
is made entirely quiescent from before the time sync() is called, and
until after the time all the writes it has scheduled have completed, all
the way to the disk media.  (and of course once activity starts on the
filesystem again, all guarantees are lost again)

It might be nice if sync(2) could schedule all the needed writes to
happen in an order which would ensure consistency and repairability of
the on-disk image at any given time, but I'm guessing this might be too
much to ask, at least without some more significant effort.

However without enforcing the "synchronous" ordering of writes, sync(2)
is effectively useless for the purposes Mr. Allen appears to have,
though perhaps his level of risk tollerance would still make it useful
to him while others of us would still be unable to tolerate its dangers
in any scenarios where we were not prepared to use newfs to recover.

Besides, the only way I know to guarantee a filesystem remains quiescent
is to unmount it, so if you do that first then there's nothing for
sync(2) to do afterwards, so nothing new to implement.  :-)


>  > DO NOT confuse any Linux-based filesystem with any Unix-based
>  > filesystem.  They may have nearly identical semantics from the user
>  > programming perspective (i.e. POSIX), but they're all entirely different
>  > under the hood.
>  > 
>  > Unix-based filesystems (sans WABPL, and ignoring the BSD-only LFS) have
>  > never ever Ever EVER given any guarantee about the repariability of the
>  > filesystem after a crash if it has been mounted with MNT_ASYNC.
> 
> What on earth do you mean by "Unix-based filesystems" such that this
> statement is true?

I mean exactly what it sounds like -- nothing more.

Having almost no knowledge about ext2 or any other non-Unix-based
filesystems, I'm trying to be careful to avoid making any claims about
those non-Unix-based filesystems.

I included FFS as a Unix-based filesystem because I know for sure that
it shares many of the attributes of the original Unix filesystems with
respect to the issues surrouding MNT_ASYNC.

>  > Perhaps this sentence from McKusick's memo about fsck will help you to
>  > understand:  "fsck is able to repair corrupted file systems using
>  > procedures based upon the order in which UNIX honors these file system
>  > update requests."  This is true for all Unix-based filesystems.
> 
> No, it is true for ffs, and possibly for our ext2 implementation
> (which shares a lot of code with ffs) but nothing else.

Well, if you follow what I by Unix-based filesystems, and you ignore LFS
and options like WABPL, as I've said, then I believe it is entirely true
since within my definition that leaves just FFS, and.

V7, though it didn't have MNT_ASYNC, would suffer the same as if
MNT_ASYNC were implemented for it -- indeed I'm guessing that NetBSD's
reimplementation of v7fs will have the same problems with MNT_ASYNC.

As I say, I don't know enough about the non-Unix-based filesystems in
NetBSD, such as those compatible AmigaDOS, Acorn, Windows NT, or even
MS-DOS, to know if they would be adversely affected by MNT_ASYNC.
Indeed I'm not even sure if they all have reasonable filesystem repair
tools (NetBSD has none, except maybe for ext2fs and msdos, though in my
experience NetBSD's MS-DOS filesystem implementation is very fragile and
it does not have a truly useful fsck_msdos, even without trying to use
MNT_ASYNC with it).  SysVbfs may suffer too, but I don't know enough
about it either despite it being by definition Unix-based, and we don't
have an fsck for it in any case.

I'd also be guessing about EFS, and I'm not sure I'd categorize it as
Unix-based any more than I do LFS.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpVUXizbZcol.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Mon, 12 Dec 2011 11:09:44 -0500 (EST), Mouse  
wrote:
Subject: Re: Lost file-system story
>
> They _can_ be repaired...some of the time.  When they can, it is
> because, by coincidence, it just so happens that the stuff that got
> written produces a filesystem fsck can repair.

That's totally irrellevant.

Possibilities other than zero or one are not useful in manual pages, and
they are only useful to an end user as a very last resort -- equivalent
to calling out the army to put Humpty Dumpty back together again.

For all useful intents and purposes any probablity of irreparable damage
of greater than zero is, for the end user, and for all planning purposes,
as good as a probability of one.  Plan to use "newfs" and "restore" after
every crash and you'll be OK.  Plan otherwise and you will eventually be
disappointed.

> That's not how I feel about it when I've lost a filesystem.  I'll take
> a filesystem with a nonzero probability of recovering something useful
> from over one that guarantees to trash everything any day (other things
> being equal, of course).

Heh.  Yup, there are those of use who will find it a challenge to see
just how much we can recover from a damaged file system no matter how
useful the outcome may be.

You don't put that in the manual page though, and you never give the end
user that expectation (unless it's already too late for them and they've
got yolk all over their face).

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgp42zLQBCM7L.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Sun, 11 Dec 2011 23:23:33 -0500, Donald Allen  wrote:
Subject: Re: Lost file-system story
> 
> How can you possibly say such a thing and hope to be taken seriously?
> What you just said means that P(survival) = .999 is the same as
> P(survival) = 0.
> 
> There are a LOT of situations (e.g., mine) where P(survival) = .999
> would be very acceptable and P(survival) = 0 would not.

The manual page must not give probabilities or even speak of
possiblities.

So, as-is you have been warned properly by the manual page.

For planning purposes you _must_ expect that your filesystem will be
damaged beyond repair after a crash and that you will have to use
"newfs" and restore to recover.  Learn these expectations well and you
will be happier in the long run.  Fail to learn them and you have no
recourse but to wallow in your own sorrows.  I.e. you can't come to the
mailing list and say that you expected something better just because you
say you can get something better from something else entirely different.
You have false expectations based on your experiences with entirely
foreign environments.

Maybe Humpty Dumpty can be put back together again, sometimes, but even
if you have all the King's horses and all the King's men on call to
respond to a disaster at a moment's notice, you must not expect that you
can have the egg put back together successfully, even just once, even if
it does look like just a minor crack this time.

-- 
        Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpiHkVmGsc5g.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Mon, 12 Dec 2011 14:23:40 -0600, Eric Haszlakiewicz  
wrote:
Subject: Re: Lost file-system story
> 
> Donald, don't listen to Greg.  Just in case it needs to be repeated, you're
> not the only one that thinks it is reasonable to expect a non-0 probability
> that things will be recovereable, even if something goes wrong.

Eric, what part of MNT_ASYNC don't you understand?

-- 
        Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpz7FPSKpwfe.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Mon, 12 Dec 2011 14:17:35 -0600, Eric Haszlakiewicz  
wrote:
Subject: Re: Lost file-system story
> 
> On Mon, Dec 12, 2011 at 11:39:38AM -0800, Greg A. Woods wrote:
> > Having almost no knowledge about ext2 or any other non-Unix-based
> > filesystems, I'm trying to be careful to avoid making any claims about
> > those non-Unix-based filesystems.
> 
> hmm.. so then how can you claim that it is "entirely different" (as you did
> in an earlier email)?  It sounds like you're talking our of your, ahem.. 
> depth.

As I said, I'm trying to be careful to avoid making claims one way or
another about non-Unix-based filesystems.

I'm also trying to keep in mind that MNT_ASYNC can be an attribute of
the OS implementation well above the filesystems and I'm also trying to
avoid making claims about non-Unix filesytem structures which may be
faced with this "feature" for the first time.

Once upon a time I was quite familiar with the use of the tools that
came before fsck.  I have a great deal of experience with the on-disk
structure of V7fs, SysVfs, and many of the minor variants of these
filesystems.  I'm experienced with many of the things that can go wrong
with these filesystems and I'm moderately experienced with how they can
be repaired as best as is humanly possible with low-level bit
manipulating tools when bugs in either the kernel or fsck cause
unexpected failures (not unlike what can happen when MNT_ASYNC is used).
I'm moderately experienced with more modern filesystems such as with
SysVr4's native FS and Berkeley FFS, though less experienced with
low-level on-disk repair of those filesystems (since on these modern
Unix-based filesystems the standard repair tools, especially fsck, have
been vastly improved; and kernel bugs which destroy the ordered writing
of metadata have effectively been eliminated).

> > I included FFS as a Unix-based filesystem because I know for sure that
> > it shares many of the attributes of the original Unix filesystems with
> > respect to the issues surrouding MNT_ASYNC.
> 
> Have you tried actually comparing the current NetBSD ffs sources against
> whatever "Unix" sources you are talking about?  While I'm sure that there
> are many "attributes" that are shared, if you even compare the current NetBSD
> sources with those from, say, 1994, you will find a ton of differences.

This has nothing to do with any given pile of source code per se.  The
issues that affect repariability of a Unix-based filesystem are higher
level design considerations that are common to the implementations of
fsck and the filesystems they can repair from the v7 addenda tape all
the way through to the implementation of modern day NetBSD's
fsck_ffs(8).

You might find McKusick and Kowalski's paper about BSD FFS fsck
enlightening.  (I can supply a copy if you can't find it elsewhere.  It
would be nice if it could be included in the NetBSD distribution, even
if not cleaned up to reflect the current implementation.  It was in
4.4BSD-Lite2, after all.)


Like I said earlier:

Perhaps the superblock(s) should also record when a filesystem has been
mounted with MNT_ASYNC so that fsck(8) can print a warning such as:

"FS is dirty and was mounted async.  Demons will fly out of your nose"


-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgprM7NvSBuE4.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-12 Thread Greg A. Woods
At Mon, 12 Dec 2011 16:15:56 -0500, Greg Troxel  wrote:
Subject: Re: Lost file-system story
> 
> Donald came here not complaining, just surprised that things were
> somewhat worse than one would have expected.  And he's right - "async"
> doesn't mean "and data might never be written indefinitely", just that
> there are no ordering or completion guarantees.

Are you sure?

I see nothing which says MNT_ASYNC will write anything at all out at any
time before a umount(2) call.  Personally I think that's a good thing --
it is, perhaps, an indication that MNT_ASYNC is being as efficient as it
can possibly be, though of course it may also just be accidental that
NetBSD's implemenation doesn't behave more as some folks seem to expect
it to do.

In any case I'm not so sure it matters in the long run.

The relative damage to the filesystem is all a matter of circumstance.

The fact is that use of MNT_ASYNC means the filesystem is very easily
damaged beyond the ability of fsck's algorithms to repair it in a useful
manner.

One can concoct special circumstances where NetBSD's implementation
fairs worse than others, but equally it is possible to concoct
circumstances where no true fully async implementation can ever do very
well.


>  I'm not 100% clear what
> is wrong, but it seems likely that this discussion has surfaced a bug or
> two

The only real bug I see is that "mount -u -o noasync" might not work
(just as "mount -u -r" is known not to work).  But I seem to be the only
one really focusing on that side of the issue here.

Indeed it might be nice if an otherwise idle system would _eventually_
clean all its dirty buffers one way or another even if they are part of
a filesystem that has been mounted with MNT_ASYNC, but I don't see that
as a requirement of MNT_ASYNC, and I certainly wouldn't want that to
give allowance for the manual to be less foreboding than it already is.
Indeed I would still want to see fsck spit out the warning I suggested,
or at least one with as much or even more force in setting the user's
expectations for failure.

Perhaps it is so simple that fixing the known accepted bug(s), i.e. such
that "mount -u -r" and "mount -u -o noasync" work will have the fallout
of also making MNT_ASYNC mounted filesystems also eventually gain better
consistency on idle systems.

:-)

(I am waffling though on whether I think sync(2) should have any
beneficial affect on the consistency of MNT_ASYNC-mounted filesytems.)

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpk33GV80xWn.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-13 Thread Greg A. Woods
At Wed, 14 Dec 2011 09:06:23 +1030, Brett Lymn  
wrote:
Subject: Re: Lost file-system story
> 
> On Tue, Dec 13, 2011 at 01:38:57PM +0100, Joerg Sonnenberger wrote:
> > 
> > fsck is supposed to handle *all* corruptions to the file system that can
> > occur as part of normal file system operation in the kernel. It is doing
> > best effort for others. It's a bug if it doesn't do the former and a
> > potential missing feature for the latter.
> > 
> 
> There are a lot of slips twixt cup and lip.  If you are really unlucky
> you can get an outage at just the wrong time that will cause the
> filesystem to be hosed so badly that fsck cannot recover it.  Sure, fsck
> can run to completion but all you have is most of your FS in lost+found
> which you have to be really really desperate to sort through.  I have
> been working with UNIX for over 20years now and I have only seen this
> happen once and it was with a commercial UNIX.

I've seen that happen more than once unfortunately.  SunOS-4 once I think.

I agree 100% with Joerg here though.

I'm pretty sure at least some of the times I've seen fsck do more damage
than good it was due to a kernel bug or more breaking assumptions about
ordered operations.

There have of course also been some pretty serious bugs in various fsck
implementations across the years and vendors.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpYVEF362Y36.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-13 Thread Greg A. Woods
At Mon, 12 Dec 2011 18:49:31 -0500 (EST), "Matt W. Benjamin" 
 wrote:
Subject: Re: Lost file-system story
> 
> Why would sync not be effective under MNT_ASYNC?  Use of sync is not
> required to lead to consistency expect with respect to an arbitrary
> point in time, but I don't think anyone ever believed otherwise.
> However, there should be no question of metadata never being written
> out if sync was run?

Well sync(2) _could_ be effective even in the face of MNT_ASYNC, though
I'm not sure it will, or indeed even should be required to, have a
guaranteed ongoing beneficial affect to the on-disk consistency of
filesystem that was mounted with MNT_ASYNC while activity continues to
proceed on the filesystem.

I.e. I don't expect sync(2) to suddenly enforce order on the writes that
it schedules to a MNT_ASYNC-mounted filesystem.  The ordering _may_ be a
natural result of the implementation, but if it's not then I wouldn't
consider that to be a bug, and I certainly wouldn't write any
documentation that suggested it might be a possible outcome.  MNT_ASYNC
means, to me at least, that even sync(2) can get away with doing writes
to a filesystem mounted with that flag in an order other than one which
would guarantee on-disk consistency to a level where fsck could repair
it.

I.e. sync(2) could possibly make things worse for MNT_ASYNC mounted
filesystems before it makes them better, and I don't see how that could
be considered to be a bug.

I do agree that IFF the filesystem is made quiescent, AND all writes
necessary and scheduled by sync(2) are allowed to come to completion,
THEN the on-disk state of an MNT_ASYNC-mounted filesystem must be
consistent (and all data blocks must be flushed to the disk too).

However if you're going to go to that trouble (i.e. close all files open
on the MNT_ASYNC-mounted filesystem and somehow prevent any other file
operations of any kind on that filesystem until such time that you think
the sync(2) scheduled writes are all done), then it should be just as
easy, if not even easier, to do a "mount -u -r" (or "mount -u -o
noasync", or even "umount"), in which case you'll not only be sure that
the filesystem is consistent and secure, but you'll know when it reaches
this state (i.e. you won't have to guess about when sync(2)'s scheduled
work completes).

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpcLcSlnWPyx.pgp
Description: PGP signature


Re: Lost file-system story

2011-12-14 Thread Greg A. Woods
At Wed, 14 Dec 2011 07:50:37 + (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: Lost file-system story
> 
> wo...@planix.ca ("Greg A. Woods") writes:
> 
> >easy, if not even easier, to do a "mount -u -r"
> 
> Does this work again?

Not that I know of, and PR#30525 concurs, as does the commit mentioned
in that PR to prevent it from falsely appearing to work, a change which
remains in netbsd-5 and -current to date.  See my discussion of this
issue earlier in this thread.

-- 
        Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpsPtoKtaNDu.pgp
Description: PGP signature


retrocomputing NetBSD style

2015-06-01 Thread Greg A. Woods
At Fri, 29 May 2015 10:22:35 +, David Holland  
wrote:
Subject: Re: Removing ARCNET stuffs
> 
> There's one other thing I ought to mention here, which is that I have
> never entirely understood the point of running a modern OS on old
> hardware; if you're going to run a modern OS, you can run it on modern
> hardware and you get the exact same things as on old hardware, except
> faster and smoother. It's always seemed to me that running vintage
> OSes (on old hardware or even new) is more interesting, because that
> way you get a complete vintage environment with its own, substantively
> different, set of things. This does require maintaining the vintage
> OSes, but that's part of the fun... nonetheless, because I don't
> understand this point I may be suggesting something that makes no
> sense to people who do, so take all the above with that grain of
> salt.

You're quite right that it is interesting to run classic software on
classic hardware, to the extent that retrocomputing is about preserving
a bit of history, or living in the past, or whatever, and to the extent
that one might enjoy such a thing.

However there were, and are, a lot of us who want(ed) a modern OS to run
on our old hardware because we want(ed) to re-purpose that fine old
hardware to do something new and exciting with it.  I.e. I am/was not
building a museum, but rather trying to get things done and learn new
things.

For example I started running NetBSD on Sun-3 and early sparc systems
because that's the hardware I had, and it was good an capable hardware.
However the original SunOS-4 was broken and decrepit for the uses I
wanted to put it to, and I didn't have source so I couldn't really fix
it.  NetBSD opened the door to doing modern things without paying
high-end prices for the latest commercial hardware and software.  At
that time the older hardware really was built better too, and it was
more "operational" -- i.e. it had proper serial console support, and
once I got to using Alphas, proper 24x7-Lights-Out support with the
ability to power cycle it and reset it remotely without extra control
hardware.

In many respects I still do the same thing, but because of the things
you were saying about how the pace of hardware change has dropped
significantly in recent years, now the hardware I use is just an older
variant of the same stuff you can buy new -- e.g. my new-to-me servers
are Dell PE2950's -- they're replacing a PE2650, but they're not really
all that much different from a brand new R710 or similar.  The old 2650
is really feeling dated now and its processors are missing a number of
features I want, but with the 2950 I can run the very same binaries
quite a range of hardware from the latest greatest back to these older
second-hand systems.

Also, w.r.t. supporting older and less-capable systems, I would now
treat them exactly the same as modern embedded systems with similar
limitations.  I don't expect I'll ever do many, if any, full builds on
my RPi or BBB, and hopefully not even build many packages on them
either, but rather I will cross-compile for them on my far more beefy
big build server.  Were I to try to run the latest NetBSD on an old
Micro-VAX, Sun3, etc., I would never expect to actually do self-hosted
builds on such systems.  I really don't understand anyone who has the
desire to try to run build.sh on a VAX-750 to build even just a kernel,
let alone the whole distribution.  I won't even bother trying that on my
Soekris board!

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgp0rTWOpmHiz.pgp
Description: PGP signature


Re: Removing ARCNET stuffs

2015-06-02 Thread Greg A. Woods
At Tue, 2 Jun 2015 11:47:01 -0700, Dennis Ferguson 
 wrote:
Subject: Re: Removing ARCNET stuffs
> 
> It's too long an argument, but I think any approach to a
> multiprocessor network stack that attempts to get there starting
> with the existing network L2/L3/interface code as a base is likely
> not on the table.  I would offer the rather herculean effort spent
> on FreeBSD to attempt to do exactly that, and the fairly mediocre
> result it produced, as evidence.  The resources to match that
> probably don't exist, and if there were a better, easier way to do
> this it would have been done already.  I think the least cost way to
> produce a better result is actually to make a big change, preserving
> the device drivers and the transport protocol code (which needs to run
> single-threaded per-socket in any case) and any non-IP protocol code
> that still works (running single-threaded) but doing a wholesale
> replacement of the code that moves packets between those things with
> something that can operate without locks.  Doing it this way has some
> risks, not the least of which is that it would leave you with networking
> code unlike anyone else's (though if it were well done I'm not sure this
> would last, everyone has trouble with the network stack), but I think
> this makes the problem tractable and has a good chance of producing
> something that scales quite well even without a lot of Linux-style
> micro-optimization effort.

Dennis, if you are able I wonder if you could comment on how well you
think the NetGraph implementation in FreeBSD fares with respect to being
part of a multiprocessor network stack, and if you think it offers any
advantages (and/or has any disadvantages) in an SMP environment.  I
understand that NetGraph gained some finer-grained SMP support as early
as FreeBSD-5.x.  I also read about some NetGraph locking and performance
issues in the 201309DevSummit notes, but I don't know any of the
details.

What if NetGraph was the _only_ network stack in the kernel?

And what about Luigi Rizzo's netmap?  (which claims to be specifically
targeted at multi-core machines)  (I'm going to try to learn a bit more
about netmap at BSDCan this year.)

And finally, what about the possibilities for a more formal STREAMS-like
implementation, or at least something that would be compatible with
existing STREAMS modules at the API (DDI/DKI) level, w.r.t. SMP?  This
would maybe allow independent maintenance and testing of less widely
used protocol modules (and perhaps even drivers) by third parties.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpjagMEFUHHu.pgp
Description: PGP signature


Re: retrocomputing NetBSD style

2015-06-03 Thread Greg A. Woods
At Wed, 3 Jun 2015 09:23:37 -0400 (EDT), Mouse  
wrote:
Subject: Re: retrocomputing NetBSD style
> 
> GAW Wrote:
> > I really don't understand anyone who has the desire to try to run
> > build.sh on a VAX-750 to build even just a kernel, let alone the
> > whole distribution.
> 
> I recall a time where NetBSD/vax was broken for a long time because
> everyone was cross-building; as soon as a native build was attemped,
> the brokenness showed up.
> 
> I native build on _everything_.  If it can't native build, it isn't
> really part of my stable, so to speak.

Yes, there is that issue!

See, for instance, my recent posts comparing assembler output from
kernel compiles done by the same compiler when run on amd64 vs. i386.

However those are the kinds of bugs one might hope can be caught by
decent enough regression tests of the compiler and its toolchain.

Unfortunately these are tests which we don't have now, in part because
in a sense we treat the whole system as the regression test, thus
forcing users to do native compiles to prove there are no noticeable
regressions.

Of course if we did have a proper cross-compiler regression test suite
then we would only have to build and run such tests on those less
capable machines.

In some sense though since I don't intend to use my Soekris board (or
RPi, or BBB, etc.) as development systems, I only really care that the
cross compiler generates working code for them, and we do have an
increasingly useful whole-system regression test suite that I do intend
to run on those smaller systems to prove they work well when their
binaries have been built on my build server.

However this issue does have me wanting to do builds on my RPi and BBB
and to dig my Alpha and another Sparc server out of storage, and find a
couple of MIPS systems of each type, just so I can try cross-compiling
on them all and prove that any future fixes to the compiler will then
result in identical code no matter what host it runs on, including
self-hosted.

So, I guess until/unless we have a good compiler regression test suite
then another awesome use for older and very different hardware from the
current melange of almost-identical i386 derivatives is to help run as a
test base for the toolchain.

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpv4E6v43hu2.pgp
Description: PGP signature


Not Groff! Heirloom Doctools!

2015-06-04 Thread Greg A. Woods
At Thu, 04 Jun 2015 14:53:56 +0200, Johnny Billquist  wrote:
Subject: Re: Groff
> 
> On 2015-06-04 12:44, Robert Swindells wrote:
> > 
> > Johnny Billquist  wrote:
> > 
> > > What happened to the original roff? I mean, groff is just a gnu
> > > replacement for roff. Maybe switch back to the original?
> >
> > The sources to all of DWB are available from AT&T:
> >
> > <http://www2.research.att.com/~astopen/download/>
> >
> > It needs a bit of work to get it to build on NetBSD though.
> 
> Hmm. What about roff from 2.11BSD? That shouldn't be so hard to get
> building on NetBSD...

Have my posts since 2009 about Heirloom Doctools somehow mostly going
into a black hole or something!?!?!?!  I get responses of "yes, please!"
on the lists, but nothing happens and people still keep posting truly
lame suggestions as if they've never heard of Heirloom Doctools.  I
posted about it in a response to this very thread just three days ago
(though I redirected to tech-userlevel then too)!

Yes, sorry Johnny, but your suggestion really is poor.  Ancient troff,
was a poor fit for "modern" use even 25 years ago with psroff to
generate PostScript from its C/A/T output -- it's full of bugs and
missing tons of features (beyond being device independent), and still
written in what's basically PDP11 assembler dressed up as C (i.e. it's
missing all of BWK's extensive rework), never mind that it's not
actually in the original 2.11BSD release, which contains just Berkeley's
bits (and the same small bits are in the 4.4BSD release too).

Heirloom Doctools _is_ the original troff, in its very latest form!
(well, there's a fork on github that's got a bunch more bug fixes)

A better place to get the original troff, in modern form, with an
open-source license would be Plan-9.

However Heirloom Doctools is equivalent to the Plan-9 version, but
without Plan-9 dependencies, and with more fixes and features.
I.e. Heirloom Doctools are the very most up-to-date code from the very
people who wrote and maintained it since the beginning (sans Joe
Ossanna, of course) .

Back before 2009 it already produced PDFs and handled UTF-8.

Heirloom Doctools already builds and works on NetBSD just fine, and
has done so since before 2009 (advertised as working on 2.0 in 2007).

Heirloom Doctools is the essentially the complete set of tools from the
AT&T Documenter's Work Bench suite -- i.e. it contains all the other
_necessary_ pre-processors like eqn, pic, tbl, grap, refer, and vgrind,
and it contains the back-end drivers and font tables for PostScript and
PDF and other printers.  The only thing it's really missing are the
papers from /usr/{share/}doc, but those are freely available elsewhere,
including from the DWB release.

As I discussed back in 2009, Heirloom Doctools is essentially better
quality and far more feature-full than the last DWB release, and
arguably has a much better license, and of course DWB since 2009 is
probably never going to see another public maintenance release now that
Glen Fowler has retired.  The only thing DWB has over Heirloom Doctools
is arguably better PostScript support (oh, and 'pm', but it's C++ :-)).

Why do people keep forgetting about it, and WTF are we still waiting for?

(once again re-directing to tech-userlelvel where this discussion is
more apropos)

-- 
Greg A. Woods
Planix, Inc.

   +1 250 762-7675http://www.planix.com/


pgpD3ABW7JR_U.pgp
Description: PGP signature


Re: semaphores options

2019-04-08 Thread Greg A. Woods
At Mon, 8 Apr 2019 21:19:32 +0300, Dima Veselov  
wrote:
Subject: semaphores options
>
> Greetings!
> Sorry for posting so many questions recently, but my production
> server failed to start PostgreSQL after system upgrade (8-STABLE).
>
> This was caused by semaphores, which I like to set in kernel options,
> which now are not working. Better say some are working, some are
> not.
>
> I solved the problem setting them via sysctl but I wonder what happened
> with options(4)?
> It seems that SEMMNI, SEMMNS, SEMMNU, NOFILE and CHILD_MAX do not
> work anymore, but SHMSEG and NMBCLUSTERS are good. I beleive they
> were always working because the system worked long time and had
> sysctl.conf empty. Any recent changes?

Indeed, something seems to have changed, and the problem continues with
-current as of late January (8.99.32).

I think the culprit was this change, which somehow didn't have an
accompanying change to any documentation (most notably options(4) still
documents all the removed settings):

RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v

revision 1.65
date: 2015-05-12 19:06:25 -0700;  author: pgoyette;  state: Exp;  lines: +4 -2; 
 commitid: G8nWAd1qbrsX8ely;
Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and
SYSVMSG options.  Move associated variables out of param.c and into
the module's source file.


This commit adds a great big ugly "#if XXX_PRG" around all the related
SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all
support for "options SEMMNI=NNN" and related.

Perhaps this only affects kernels which have the SysV IPC code baked in,
though I've no idea how the so-called modular world is supposed to work
for pre-set definitions -- I guess it doesn't, though perhaps there's
still some hook for config(1)?.

The real underlying problem may be that none of the SysV IPC options
from options(4) where ever properly set up with "defflag" or "defparam"
in the appropriate "files" file (sys/kern/files.kern probably), or as we
used to say, they were never "defopt'ed" for config.  See config(5).

Having "options FOO=1234" worked without "defparam" if the use was in
sys/conf/param.c, but it doesn't seem to work with the new regime.
Maybe it would work again if "defparam" lines were added to the right
place.

FYI, I have had the following in my kernel configs (in this particular
case edited into XEN3_DOMU) since a very long time ago (before 1.6), and
they continued to work up to and including 5.2_STABLE:

# System V compatible IPC subsystem.  (msgctl(2), semctl(2), and shmctl(2))
#
# Note: SysV IPC parameters could be changed dynamically, see sysctl(8).
#
options SYSVMSG # System V-like message queues
#
options MSGMNI=200  # max number of message queue identifiers 
(default 40)
options MSGMNB=32768# max size of a message queue (default 2048)
options MSGTQL=512  # max number of messages in the system (default 
40)
options MSGSSZ=128  # size of a message segment (must be 2^n, n>4) 
(default 8)
options MSGSEG=16384# max number of message segments in the system
# (must be less than 32767) (default 2048)
#
options SYSVSEM # System V-like semaphores
options SEMMNI=200  # max number of semaphore identifiers in system 
(def=10)
options SEMMNS=600  # max number of semaphores in system (def=60)
options SEMMNU=300  # number of undo structures in system (def=30)
options SEMUME=100  # max number of undo entries per process 
(def=10)
#
options SYSVSHM # System V-like memory sharing
options SHMMAXPGS=16384 # Size of shared memory map (def=2048)



But on my 8.99.32 XEN3_DOMU kernel these only give me:

# sysctl kern.ipc
kern.ipc.sysvmsg = 1
kern.ipc.sysvsem = 1
kern.ipc.sysvshm = 1
kern.ipc.shmmax = 2097152000
kern.ipc.shmmni = 128
kern.ipc.shmseg = 128
kern.ipc.shmmaxpgs = 512000
kern.ipc.shm_use_phys = 0
kern.ipc.msgmni = 200
kern.ipc.msgseg = 16384
kern.ipc.semmni = 10
kern.ipc.semmns = 60
kern.ipc.semmnu = 30


FYI, to show it did/does work on an older system:

23:02 [0.185] # uname -a
NetBSD central 5.2_STABLE NetBSD 5.2_STABLE (XEN3_DOMU) #0: Sun Jun  5 16:33:15 
PDT 2016  
woods@building:/build/woods/building/netbsd-5-amd64-amd64-obj/work/woods/m-NetBSD-5/sys/arch/amd64/compile/XEN3_DOMU
 amd64
23:02 [0.186] # sysctl kern.ipc
kern.ipc.sysvmsg = 1
kern.ipc.sysvsem = 1
kern.ipc.sysvshm = 1
kern.ipc.shmmax = 67108864
kern.ipc.shmmni = 128
kern.ipc.shmseg = 128
kern.ipc.shmmaxpgs = 16384
kern.ipc.shm_use_phys = 0
kern.ipc.msgmni = 200
kern.ipc.msgseg = 16384
kern.ipc.semmni = 200
kern.ipc.semmns = 600
kern.ipc.semmnu = 300


--
Greg A. Woods 

+1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgphcHNkIPNT0.pgp
Description: OpenPGP Digital Signature


Re: semaphores options

2019-04-08 Thread Greg A. Woods
At Mon, 08 Apr 2019 20:37:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: semaphores options
>
> RCS file: /cvs/master/m-NetBSD/main/src/sys/conf/param.c,v
> 
> revision 1.65
> date: 2015-05-12 19:06:25 -0700;  author: pgoyette;  state: Exp;  lines: +4 
> -2;  commitid: G8nWAd1qbrsX8ely;
> Create a new sysv_ipc module to contain the SYSVSHM, SYSVSEM, and
> SYSVMSG options.  Move associated variables out of param.c and into
> the module's source file.
> 
>
> This commit adds a great big ugly "#if XXX_PRG" around all the related
> SysV IPC settings in sys/conf/param.c, i.e. it entirely removes all
> support for "options SEMMNI=NNN" and related.

Note also this change only appears in NetBSD-8.0 in terms of releases.

The netbsd-7 branch and all its releases preserve the original behaviour
where these "options" worked -- or at least that's what I understand
from CVS.  I haven't tested this -- my only currently running 7.x kernel
didn't have my custom config edits.

--
Greg A. Woods 

+1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPmvsEu7Au3.pgp
Description: OpenPGP Digital Signature


Re: NULL pointer arithmetic issues

2020-02-24 Thread Greg A. Woods
#x27;s new undefined behaviour rules as interpreted by some
  compiler maintainers now allow the compiler to STUPIDLY assume that
  since the programmer has knowingly put a supposed de-reference of a
  pointer on the first line of the function, then any comparisons of
  that pointer with NULL further on are OBVIOUSLY never ever going to be
  true and so it can SILENTLY wipe out the whole damn security check.

  I guess I'm saying that modern compiler maintainers are not sane, and
  at least some of the more recent C Standards Committee are definitely
  NOT sane and/or friendly and considerate.

  C's primitive nature engenders the programmer to think in terms of
  what the target machine is going to do, and as such it is extremely
  sad and disheartening that the standards committee chose to endanger
  users in so many ways.

[[ in modern "Standard C" ]]
  It’s not that evaluating something like (1<<32) might have an
  unpredictable result, but rather that the entire execution of any
  program that evaluates such an expression is ENTIRELY meaningless!
  Indeed according to "Standard C" the execution is not even meaningful
  up to the point where undefined behaviour is encountered.  Undefined
  behaviour trumps ALL other behaviors of the C abstract machine.

  And it is all in the goal of attempting comprehensive maximum possible
  optimization of all code at any expense INCLUDING correct operation of
  the program.

  Not all so-called "undefined behaviours" are quite this bad, yet, but
  in general we would be infinitely better off with a more completely
  defined abstract machine that might force some target architectures to
  jump through hoops instead of forcing EVERY programmer to ALWAYS be
  more careful than EVERY conceivable optimizer.

  As Phil Pennock said:

If I program in C, I need to defend against the compiler maintainers.
[[ and future standards committee members!!! ]]
    If I program in Go, the language maintainers defend me from my mistakes.

  And I say:

Modern "Standard C" is actually "Useless C" and "Unusable C"


Indeed I now say if "Standard C" follows C++ then it will be safe to say
that a good optimizing compiler will soon be able to turn all C programs
into "abort()" calls.

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpdFaG78xjZR.pgp
Description: OpenPGP Digital Signature


Re: NULL pointer arithmetic issues

2020-02-25 Thread Greg A. Woods
At Mon, 24 Feb 2020 22:15:22 -0500 (EST), Mouse  
wrote:
Subject: Re: NULL pointer arithmetic issues
>
> > Greg A. Woods wrote:
> >
> >   NO MORE "undefined behaviour"!!!  Pick something sane and stick to it!
> >
> >   The problem with modern "Standard" C is that instead of refining
> >   the definition of the abstract machine to match the most common
> >   and/or logical behaviour of existing implementations, the standards
> >   committee chose to throw the baby out with the bath water and make
> >   whole swaths of conditions into so-called "undefined behaviour"
> >   conditions.
>
> Unfortunately for your argument, they did this because there are
> "existing implementations" that disagree severely over the points in
> question.

I don't believe that's quite right.

True "Undefined Behaviour" is not usually the explanation for
differences between implementations.  That's normally what the Language
Lawyers call "Implementation Defined" behaviour.

"Undefined behaviour" is used for things like dereferencing a nil
pointer.  There's little disagreement about that being "undefined by
definition" -- even ignoring the Language Lawyers.  We can hopefully
agree upon that even using the original K&R edition's language:

"C guarantees that no pointer that validly points at data will
contain zero"

The problem though is that C gives more rope than you might ever think
possible in some situations, such as for example, the chances of
dereferencing a nil pointer with poorly written code.

The worse problem though is when compiler writers, what I'll call
"Optimization Warrior Lawyers", start abusing any and every possible
instance of "Undefined Behaviour" to their advantage.

This is worse than ignoring Hoare's advice -- this is the very epitome
of premature optimization -- this is pure evil.

This is breaking otherwise readable and usable code.

I give you again my example:

> >   An excellent example are the data-flow optimizations that are now
> >   commonly abused to elide security/safety-sensitive code:
>
> > int
> > foo(struct bar *p)
> > {
> > char *lp = p->s;
> >
> > if (p == NULL || lp == NULL) {
> > return -1;
> > }
>
> This code is, and always has been, broken; it is accessing p->s before
> it knows that p isn't nil.

How do you know for sure?  How does the compiler know?  Serious questions.

What if all calls to foo() are written as such:

if (p) foo(p);

I agree this might not be "fail-safe" code, or in any other way
advisable, but it was perfectly fine in the world before UB Optimization
Warriors, however today's "Standard C" gives compilers license to
replace "foo()" with a trap or call to "abort()", etc.

I.e. it takes a real "C Language Lawyer(tm)" to know that past certain
optimization levels the sequence points prevent this from happening.

In the past I could equally assume the optimizer would rewrite the first
bit of foo() as:

if (! p || ! p->s) return -1;

In 35 years of C programming I've never before had to pay such close
attention to such minute details.  I need tools now to audit old code
for such things, and my current experience to date suggests UBSan is not
up to this task -- i.e. runtime reports are useless (perhaps even with
high-code-coverage unit tests).

This is the main point of my original rant.  "Undefined Behaviour" as it
has been interpreted by Optimization Warriors has given us an unusable
language.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpytwmS13Epg.pgp
Description: OpenPGP Digital Signature


Re: NULL pointer arithmetic issues

2020-02-25 Thread Greg A. Woods
At Wed, 26 Feb 2020 00:12:49 -0500 (EST), Mouse  
wrote:
Subject: Re: NULL pointer arithmetic issues
>
> > This is the main point of my original rant.  "Undefined Behaviour" as
> > it has been interpreted by Optimization Warriors has given us an
> > unusable language.
>
> I'd say that it's given you unusuable implementations of the language.
> The problem is not the language; it's the compiler(s).  (Well, unless
> you consider the language to be the problem because it's possible to
> implement it badly.  I don't.)

I don't think the C language (in all lower-case, un-quoted, plainly) is
the problem -- I think the problem is the wording of the modern
standard, and the unfortunate choice to use the phrase "undefined
behaviour" for certain things.  This has given "license" to optimization
warriors -- and their over-optimization is the root of the evil I see in
current compilers.  It is this unfortunate choice of describing things
as "undefined" within the language that has made modern "Standard C"
unusable (especially for any and all legacy code, which is most of it,
right?).

If we outlawed the use of the phrase "undefined behaviour" and made all
instances of it into "implementation defined behviour", with a very
specific caveat that such instances did not, would not, and could not,
ever allow optimizers to even think of violating any possible
conceivable principle of least astonishment.

E.g. in the example I gave, the only thing allowed would be for the
implementation to do as it please IFF and when the pointer passed was
actually a nil pointer at runtime (and perhaps in this case with a
strong hint that the best and ideal behaviour would be something akin to
calling abort()).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpEf9I3wNR2o.pgp
Description: OpenPGP Digital Signature


Re: NULL pointer arithmetic issues

2020-03-10 Thread Greg A. Woods
At Mon, 9 Mar 2020 17:36:24 +0100, Joerg Sonnenberger  wrote:
Subject: Re: NULL pointer arithmetic issues
>
> I consider it as something even worse. Just like the case of passing
> NULL pointers to memcpy and friends with zero as size, this
> interpretation / restriction in the standard is actively harmful to some
> code for the sake of potential optimisation opportunities in other code.
> It seems to be a poor choice at that. I.e. it requires adding
> conditional branches for something that behaves sanely everywhere but
> may the DS9k.

Indeed.

I way the very existence of anything called "Undefined Behaviour" and
its exploitation by optimizers is evil.  (by definition, especially if
we accept as valid the claim that "Premature optimization is the root of
all evil in programming" -- this is of course a little bit of a stretch
since my claim could be twisted to say that any and all automatic
optimzation by a compiler or toolchain is evil, but of course that's not
exactly my intent -- normal optimization which does not change the
behaviour and intent of the code is, IMO, OK, but defining "intent" is
obviously the problem)

So in Standard C all "Undefined Behaviour" should be changed to
"Implementation Defined" and it should be required that the
implementation is not allowed to abuse any such things for the evil
purpose of premature optimzation.  For this kind of thing adding an
integer to a pointer (or the equivalent, e.g. taking the address of a
field in a struct pointed to by a nil pointer) should always do just
that, even if the pointer can be proven to be a nil pointer at compile
time.  It is wrong to do anything else, and absolute insanity to remove
any other code just because the compiler assumes SIGSEGV
would/should/could happen before the other code gets a chance to run.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpGIvFrvwd81.pgp
Description: OpenPGP Digital Signature


So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods
  I?  0:00.01 rshd 
-L 
   12 16008   9130  85  0  7860   1140 kqueue  I?  0:00.01 
pickup -l -t unix -u 
0 16131  10040  85  0  2620644 select  I?  0:00.01 rshd 
-L 
 1000 20090 152700  85  0 25056   7028 select  Is   ?  0:00.24 
xterm -class UXTerm 
 1000  1768   6090  85  0 11708   1056 ttyraw  Is+  pts/1  0:00.09 -ksh 
0  2940  75840 117  0 17264236 nfscn2  D+   pts/2  0:00.14 
umount -f /future/build 
 1000  4103  40070  85  0  4120   1064 pause   Is   pts/2  0:00.09 -ksh 
0  7584  41030  85  0  7656   1088 pause   Ipts/2  0:00.35 ksh 
0  6722 276390 127  0 16768440 tstile  D+   pts/3  0:00.00 
fstat /future/build 
 1000 21172 14064  489  85  0  3728   1060 pause   Is   pts/3  0:00.09 -ksh 
0 27639 211720  85  0  9648   1048 pause   Ipts/3  0:00.08 ksh 
 1000 13000 20090  722  85  0  3600   1056 ttyraw  Is+  pts/4  0:00.11 -ksh 
0  3707 19523 1057  85  0 11736   1044 pause   Spts/5  0:00.08 ksh 
0  4176  3707 1057  43  0 12900624 -   O+   pts/5  0:00.00 ps 
-alx 
 1000 19523  1002 3550  85  0  3188   1056 pause   Ss   pts/5  0:00.09 -ksh 
0  1013 1  527  85  0  2660652 ttyraw  Is+  ttyE0  0:00.08 -ksh 
0   822 1  126  85  0  2524412 ttyraw  Is+  ttyE1  0:00.00 
/usr/libexec/getty Ws ttyE1 
0   828 1  126  85  0  2524416 ttyraw  Is+  ttyE2  0:00.00 
/usr/libexec/getty Ws ttyE2 
0   957 1  126  85  0  2528412 ttyraw  Is+  ttyE3  0:00.00 
/usr/libexec/getty Ws ttyE3 
0   862 1  126  85  0  4188408 ttyraw  Is+  ttyE4  0:00.00 
/usr/libexec/getty Ws ttyE4 
0  1023 1  126  85  0  2524424 ttyraw  Is+  ttyE5  0:00.00 
/usr/libexec/getty Ws ttyE5 
0  1050 1  126  85  0  2524428 ttyraw  Is+  ttyE6  0:00.00 
/usr/libexec/getty Ws ttyE6 
0   668 10  85  0  2528416 ttyraw  Is+  xencons0:00.01 
/usr/libexec/getty console constty 
12:00 [1.61] # crash
Crash version 8.99.32, image version 8.99.32.
Output from a running system is unreliable.
crash> bt /t 0t2940
trace: pid 2940 lid 1 at 0xaf808a4748f0
sleepq_block() at sleepq_block+0xfd
kpause() at kpause+0xdf
nfs_reconnect() at nfs_reconnect+0x8b
nfs_request() at nfs_request+0xf3a
nfs_getattr() at nfs_getattr+0x175
VOP_GETATTR() at VOP_GETATTR+0x49
vn_stat() at vn_stat+0x3d
do_sys_statat() at do_sys_statat+0x97
sys___lstat50() at sys___lstat50+0x25
syscall() at syscall+0x9c
--- syscall (number 441) ---
43292a:
crash> bt /t 0t6722
trace: pid 6722 lid 1 at 0xaf808a488920
sleepq_block() at sleepq_block+0x99
turnstile_block() at turnstile_block+0x337
rw_vector_enter() at rw_vector_enter+0x169
genfs_lock() at genfs_lock+0x3c
VOP_LOCK() at VOP_LOCK+0x71
vn_lock() at vn_lock+0x90
nfs_root() at nfs_root+0x2b
lookup_once() at lookup_once+0x38e
namei_tryemulroot() at namei_tryemulroot+0x453
namei() at namei+0x29
fd_nameiat.isra.2() at fd_nameiat.isra.2+0x54
do_sys_statat() at do_sys_statat+0x87
sys___stat50() at sys___stat50+0x28
syscall() at syscall+0x9c
--- syscall (number 439) ---
43c94a:
crash> 


So it would seem that even though umount is trying to force an unmount
of an NFS mount, the kernel is first trying to reconnect to the server!


BTW, I have another system running a quite recent i386 build where
crash(8) is unable to do a backtrace:

# ktrace crash
Crash version 9.99.64, image version 9.99.64.
Kernel compiled without options LOCKDEBUG.
Output from a running system is unreliable.
crash> trace /t 0t4003
crash: kvm_read(0x4, 4): kvm_read: Bad address
trace: pid 4003 lid 4003
crash> 


-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp65J9WmWX7L.pgp
Description: PGP signature


Re: So it seems "umount -f /nfs/mount" still doesn't work.....

2020-06-30 Thread Greg A. Woods
At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods"  wrote:
Subject: So it seems "umount -f /nfs/mount" still doesn't work.
> 

Curiously the kernel now does something I didn't quite expect when one
tries to reboot a system with a stuck mount.  I was able to see this as
I was running a kernel that verbosely documents all its shutdown
unmounts and detaches.  In prior times I had reached for the power switch.

At first it just hangs:

lilbit# reboot -q
[ 1131744.8297338] syncing disks... 3 3 done
[ 1131744.9797408] unmounting 0xc1f27000 /more/work (more.local:/work)...
[ 1131744.9907053] ok
[ 1131744.9907053] unmounting 0xc1f24000 /more/archive (more.local:/archive)...
[ 1131745.0004431] ok
[ 1131745.0004431] unmounting 0xc1f21000 /more/home (more.local:/home)...
[ 1131745.0097426] ok
[ 1131745.0097426] unmounting 0xc1f1f000 /once/build (once.local:/build)...
[ 1131745.0097426] ok
[ 1131745.0210854] unmounting 0xc1f1b000 /future/build (future.local:/build)...
[ 1131745.0210854] ok
[ 1131745.0304676] unmounting 0xc1f11000 /building/build 
(building.local:/build)...

   this is me hitting ^T to try to see what's going on 

[ 1131753.2800902] load: 0.52  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1132107.6651517] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k
[ 1133247.8436109] load: 0.48  cmd: reboot 7414 [fstcnt] 0.00u 0.16s 0% 424k

    then I hit ^C and immediately it proceeded 

^C[ 1133249.3636755] unmounting 0xc1f0f000 /proc (procfs)...
[ 1133249.3636755] ok
[ 1133249.3636755] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133249.3788641] unmounting 0xc1ecb000 /kern (kernfs)...
[ 1133249.3843127] ok
[ 1133249.3843127] unmounting 0xc1ec9000 /cache (/dev/wd1a)...
[ 1133249.7636916] ok
[ 1133249.7636916] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133249.7736976] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133250.0737098] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133250.1537121] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0337515] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.0469644] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.0469644] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.0579007] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.0637673] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.0637673] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.0750403] sd0: detached
[ 1133251.0750403] scsibus0: detached
[ 1133251.0750403] gpio1: detached
[ 1133251.0853614] sysbeep0: detached
[ 1133251.0853614] midi0: detached
[ 1133251.0853614] wd1: detached
[ 1133251.0949369] uhub0: detached
[ 1133251.0949369] com1: detached
[ 1133251.0949369] usb0: detached
[ 1133251.1045456] gpio0: detached
[ 1133251.1045456] ohci0: detached
[ 1133251.1045456] pchb0: detached
[ 1133251.1151702] unmounting 0xc1f11000 /building/build 
(building.local:/build)...
[ 1133251.1151702] unmounting 0xc1f0d000 /dev/pts (ptyfs)...
[ 1133251.1279509] unmounting 0xc1ec6000 /home (/dev/wd0g)...
[ 1133251.1279509] unmounting 0xc1dd7000 /usr/pkg (/dev/wd0f)...
[ 1133251.1393918] unmounting 0xc1ab1000 /var (/dev/wd0e)...
[ 1133251.1448739] unmounting 0xc1804000 / (/dev/wd0a)...
[ 1133251.1448739] forcefully unmounting /building/build 
(building.local:/build)...
[ 1133251.1587138] forceful unmount of /building/build failed with error -3
[ 1133251.1653872] rebooting...


So it seems there's some contention between the internal attempt to
unmount the stuck NFS filesystem(s), and the reboot system call itself,
but if the reboot command is interrupted, then the kernel can get on
with its shutdown procedures, and eventually it actually forces the
unmount of the stuck NFS filesystem.

Another interesting thing to note is that /future/build was also stuck
as future.local is offline at this time.  However that's the filesystem
I tried to clear first by hand with "umount -f /future/build", but that
was stuck, apparently in the same call to nfs_reconnect().  It seems it
had done enough that when the reboot() triggered unmounting that it
could complete the unmount without problems.  (The other mounts on
more.local and once.local were responding so they unmounted normally.)

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpSyQ4PZfAFq.pgp
Description: PGP signature


Re: So it seems "umount -f /nfs/mount" still doesn't work.....

2020-07-01 Thread Greg A. Woods
At Tue, 30 Jun 2020 14:28:38 -0700, "Greg A. Woods"  wrote:
Subject: Re: So it seems "umount -f /nfs/mount" still doesn't work.
>
> At Tue, 30 Jun 2020 12:52:37 -0700, "Greg A. Woods"  wrote:
> Subject: So it seems "umount -f /nfs/mount" still doesn't work.
> >
>

So, I should have mentioned that "umount -f nfs.server:/remotefs" does
work (i.e. it does not hang waiting for the server to reconnect, and
provided that there are no processes with cwd or open files on the
remote filesystem, it can unmount the filesystem).

I.e. the problem is in how umount(8) looks up the parameters of the
mount point.  If it looks at the mount point it hangs, but if it looks
through the mount table, it works.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp_1vuSFtbtv.pgp
Description: OpenPGP Digital Signature


USB storage transfers halt when usbdevs is run: hardware bug or software bug?

2020-07-05 Thread Greg A. Woods
USB storage device transfers freeze when usbdevs is run:  hardware bug
or software bug?

While I was doing a "gzcat < *.gz > /dev/rsd2d", where sd2 was a USB
memory stick, I happened to run "usbdevs -dv" and the writes to the USB
device froze, and indeed the writing process was stuck in the kernel (I
couldn't even stop it with ^Z).

Luckily yanking the stick out seemed to unfreeze and kill the process
and clean everything up nicely and I was able to re-insert it and re-do
the write to it without incident.

This is on an amd64 server running 9.99.64.

Upon removal and subsequent re-insertion the kernel said the following
(but was silent before this when usbdevs ran):

[ 193334.306434] umass0: BBB reset failed, IOERROR
[ 193334.306434] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.318288] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.318288] umass0: BBB reset failed, IOERROR
[ 193334.329223] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.329223] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.341024] umass0: BBB reset failed, IOERROR
[ 193334.341024] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.351781] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.357775] sd2d: error writing fsbn 4053632 of 4053632-4053759 (sd2 bn 
4053632; cn 4021 tn 7 sn 23)
[ 193334.366963] umass0: BBB reset failed, IOERROR
[ 193334.366963] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.378283] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.378283] umass0: BBB reset failed, IOERROR
[ 193334.389225] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.389225] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.401026] umass0: BBB reset failed, IOERROR
[ 193334.401026] umass0: BBB bulk-in clear stall failed, IOERROR
[ 193334.411782] umass0: BBB bulk-out clear stall failed, IOERROR
[ 193334.417780] umass0: BBB reset failed, IOERROR
[ 193334.417780] sd2(umass0:0:0:0): generic HBA error
[ 193334.426444] sd2: detached
[ 193334.426444] scsibus1: detached
[ 193334.426444] umass0: detached
[ 193334.436445] umass0: at uhub6 port 2 (addr 5) disconnected

reinsertion:

[ 193341.516925] umass0 at uhub6 port 2 configuration 1 interface 0
[ 193341.516925] umass0: SMI Corporation (0x090c) USB DISK (0x1000), rev 
2.00/11.00, addr 5
[ 193341.526926] umass0: using SCSI over Bulk-Only
[ 193341.526926] scsibus1 at umass0: 2 targets, 1 lun per target
[ 193342.366983] sd2 at scsibus1 target 0 lun 0:  disk 
removable
[ 193342.376985] sd2: 7712 MB, 15744 cyl, 16 head, 63 sec, 512 bytes/sect x 
15794176 sectors
[ 193342.386986] sd2: GPT GUID: d1e3490c-b0e6-42e9-9d9e-3ac286a0f7e0
[ 193342.396989] dk6 at sd2: "EFI system", 262144 blocks at 2048, type: msdos
[ 193342.396989] dk7 at sd2: "d3aa0396-d911-4aac-baa8-f2478557d31a", 7544832 
blocks at 264192, type: ffs


I'm guessing it's a software bug with bad locking order somewhere.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgppIi4jGdYQ5.pgp
Description: OpenPGP Digital Signature


Re: style change: explicitly permit braces for single statements

2020-07-12 Thread Greg A. Woods
At Sun, 12 Jul 2020 10:01:36 +1000, Luke Mewburn  wrote:
Subject: style change: explicitly permit braces for single statements
>
> I propose that the NetBSD C style guide in to /usr/share/misc/style
> is reworded to more explicitly permit braces around single statements,
> instead of the current discourgement.
>
> IMHO, permitting braces to be consistently used:
> - Adds to clarity of intent.
> - Aids code review.
> - Avoids gotofail: 
> https://en.wikipedia.org/wiki/Unreachable_code#goto_fail_bug

Well, if you s/permit/require/g, I strongly concur (with possibly one
tiny exception allowed in rare cases -- when there's no newline).

Personally I don't think there's any good excuse for not always putting
braces around all single-statement blocks.  The only bad execuse is that
the language doesn't strictly require them.  People are lazy, I get that
(I am too), but in my opinion C is just not really safe without them.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgplEU3KR6NQt.pgp
Description: OpenPGP Digital Signature


Re: style change: explicitly permit braces for single statements

2020-07-13 Thread Greg A. Woods
At Mon, 13 Jul 2020 09:48:07 -0400 (EDT), Mouse  
wrote:
Subject: Re: style change: explicitly permit braces for single statements
>
> Slavishly always
> adding them makes it difficult to keep code from walking into the right
> margin:

These days one really should consider the right margin to be a virtual
concept -- there's really no valid reason not to have and use horizontal
scrolling (any code editor I'll ever use can do it on any display), and
even most any small-ish laptop can have a nice readable font at 50x132,
or even 50x160.  (i.e. that's another style guide rule that should die)

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpezCK6Ft2EX.pgp
Description: OpenPGP Digital Signature


python3.7 rebuild stuck in kernel in "entropy" during an "import" statement

2021-03-30 Thread Greg A. Woods
So I've been running a pkg-rolling_replace and one of the packages being
rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
wait in the kernel, and it's been in this state for over 24hrs as you
can see.

The only things the process has open appear to be its stdio descriptors,
two of which are are open on the log file I was directing all output to.

This is on a Xen domU of a machine running:

$ uname -a
NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 PDT 
2021  
woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
 amd64


09:51 [504] $ ps -lwwp 19875
UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
  0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
-Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [505] $ ps -uwwp 19875
USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
 -d /usr/pkg/lib/python3.7 -f -x 
bad_coding|badsyntax|site-packages|lib2to3/tests/data 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
09:51 [506] $ fstat -p 19875
USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
root python 19875   wd  /build10645634 drwxr-xr-x1024 r
root python 198750  /dev/pts 3 crw---   pts/0 rw
root python 198751  /build 3721223 -rw-r--r--  28287492 w
root python 198752  /build 3721223 -rw-r--r--  28287492 w
09:51 [507] $ find /build -inum 3721223
/build/packages/root/pkg_roll.out
09:51 [508] $


It was killable -- I sent SIGINT from the tty and it died as expected.


Running "make replace" gets it stuck in the same place again, an the
SIGINT shows the following stack trace:

PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
  LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
./python -E -Wi 
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
  -d /usr/pkg/lib/python3.7 -f  -x 
'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
^T
[ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^T
[ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
^?Traceback (most recent call last):
  File 
"/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
 line 20, in 
from concurrent.futures import ProcessPoolExecutor
  File "", line 1032, in _handle_fromlist
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
 line 43, in __getattr__
from .process import ProcessPoolExecutor as pe
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/process.py",
 line 53, in 
import multiprocessing as mp
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/__init__.py",
 line 16, in 
from . import context
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/context.py",
 line 5, in 
from . import process
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 363, in 
_current_process = _MainProcess()
  File 
"/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/multiprocessing/process.py",
 line 347, in __init__
self._config = {'authkey': AuthenticationString(os.urandom(32)),
KeyboardInterrupt
*** Error code 1 (ignored)
*** Signal 2
*** Signal 2



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMapUqkjr1L.pgp
Description: OpenPGP Digital Signature


nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
) blocking due to lack of entropy
[ 563844.834413] entropy: pid 7903 (python) blocking due to lack of entropy
[ 566365.511377] entropy: pid 9001 (python) blocking due to lack of entropy
[ 577473.897830] entropy: pid 9350 (python) blocking due to lack of entropy
[ 579179.381600] entropy: pid 25728 (od) blocking due to lack of entropy
[ 579186.994440] entropy: pid 11107 (cat) blocking due to lack of entropy
[ 579202.264290] entropy: pid 7248 (cat) blocking due to lack of entropy
[ 579669.831978] entropy: ready


--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


At Tue, 30 Mar 2021 10:06:19 -0700, "Greg A. Woods"  wrote:
Subject: python3.7 rebuild stuck in kernel in "entropy" during an "import" 
statement
>
> So I've been running a pkg-rolling_replace and one of the packages being
> rebuilt is python3.7, and it has got stuck, apparently on an "entropy"
> wait in the kernel, and it's been in this state for over 24hrs as you
> can see.
>
> The only things the process has open appear to be its stdio descriptors,
> two of which are are open on the log file I was directing all output to.
>
> This is on a Xen domU of a machine running:
>
> $ uname -a
> NetBSD xentastic 9.99.81 NetBSD 9.99.81 (XEN3_DOM0) #1: Tue Mar 23 14:39:55 
> PDT 2021  
> woods@xentastic:/build/woods/xentastic/current-amd64-amd64-obj/build/src/sys/arch/amd64/compile/XEN3_DOM0
>  amd64
>
>
> 09:51 [504] $ ps -lwwp 19875
> UID   PID  PPID CPU PRI NI   VSZ   RSS WCHAN   STAT TTY  TIME COMMAND
>   0 19875 11551   0  85  0 55412 11324 entropy Ipts/0 0:00.27 ./python -E 
> -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [505] $ ps -uwwp 19875
> USER   PID %CPU %MEM   VSZ   RSS TTY   STAT STARTEDTIME COMMAND
> root 19875  0.0  0.1 55412 11324 pts/0 I 9:09PM 0:00.27 ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>  -d /usr/pkg/lib/python3.7 -f -x 
> bad_coding|badsyntax|site-packages|lib2to3/tests/data 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> 09:51 [506] $ fstat -p 19875
> USER CMD  PID   FD  MOUNT INUM MODE SZ|DV R/W
> root python 19875   wd  /build10645634 drwxr-xr-x1024 r
> root python 198750  /dev/pts 3 crw---   pts/0 rw
> root python 198751  /build 3721223 -rw-r--r--  28287492 w
> root python 198752  /build 3721223 -rw-r--r--  28287492 w
> 09:51 [507] $ find /build -inum 3721223
> /build/packages/root/pkg_roll.out
> 09:51 [508] $
>
>
> It was killable -- I sent SIGINT from the tty and it died as expected.
>
>
> Running "make replace" gets it stuck in the same place again, an the
> SIGINT shows the following stack trace:
>
> PYTHONPATH=/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
>   LD_LIBRARY_PATH=/build/package-obj/root/lang/python37/work/Python-3.7.1  
> ./python -E -Wi 
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py
>   -d /usr/pkg/lib/python3.7 -f  -x 
> 'bad_coding|badsyntax|site-packages|lib2to3/tests/data'  
> /var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7
> ^T
> [ 563859.5589422] load: 0.39  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^T
> [ 563866.4606073] load: 0.36  cmd: make 15726 [wait] 0.23u 0.07s 0% 9184k
> make: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> make: Working in: /build/package-obj/root/lang/python37/work/Python-3.7.1
> make[1]: Working in: /work/woods/m-NetBSD-pkgsrc-current/lang/python37
> ^?Traceback (most recent call last):
>   File 
> "/var/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/compileall.py",
>  line 20, in 
> from concurrent.futures import ProcessPoolExecutor
>   File "", line 1032, in _handle_fromlist
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures/__init__.py",
>  line 43, in __getattr__
> from .process import ProcessPoolExecutor as pe
>   File 
> "/build/package-obj/root/lang/python37/work/.destdir/usr/pkg/lib/python3.7/concurrent/futures

Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
At Tue, 30 Mar 2021 23:53:43 +0200, Manuel Bouyer  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> On Tue, Mar 30, 2021 at 02:40:18PM -0700, Greg A. Woods wrote:
> > [...]
> >
> > Perhaps the answer is that nothing seems to be contributing anything to
> > the entropy pool.  No matter what device I exercise, none of the numbers
> > in the following changes:
>
> yes, it's been this way since the rnd rototill. Virtual devices are
> not trusted.
>
> The only way is to manually seed the pool.

Ah, so that is definitely not what I expected!

Previously wasn't it up to the local admin what to trust?  I guess
throwing bits into /dev/random is one way to play that game, but

I have to trust the dom0 implicitly and utterly anyway, so why not trust
the devices it presents?

This is especially true for xbd block devices.  All my blocks are belong
to dom0.

The network device is in effect no different than if it were real
hardware, so if I want to trust network traffic, then I should be able
to enable it, just as I could if it were real hardware.

The CPUs are also probably the least "virtual" things in Xen, so why not
trust them?  (Though I'm not sure I understand what entropy they can
offer in the first place.)

Finally, if the system isn't actually collecting entropy from a device,
then why the heck does it allow me to think it is (i.e. by allowing me
to enable it and show it as enabled and collecting via "rndctl -l")?

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpPeLoehMD2G.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? (causing python3.7 rebuild to get stuck in kernel in "entropy" during an "import" statement)

2021-03-30 Thread Greg A. Woods
[[ sorry I've not been catching up on mailing list discussions as fast
as I had hoped to, and I'm way behind on following the entropy rototill. ]]

At Wed, 31 Mar 2021 00:12:31 +, Taylor R Campbell  
wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  (causing python3.7 
rebuild to get stuck in kernel in "entropy" during an "import" statement)
>
> This is false.  If the VM host provided a viornd(4) device then NetBSD
> would automatically collect, and count, entropy from the host, with no
> manual intervention.

I'll leave that idea to others more up-to-date on Xen PV drivers to
respond to.  Booting a -current GENERIC kernel (which has both Xen PV
and virtio(4) devices configured into it) in a "type='pvh'" domU only
attaches the xenbus PV devices, no virtio devices, so adding virtio
might be a bit of a much bigger task that will need further support on
at least the backend, and perhaps on the front-end too, especially to do
it without QEMU.  I haven't tried if virtio devices show up in an HVM
domU precisely because I'm trying to avoid having to run and rely on
QEMU (never mind any performance implications of HVM).

> > Finally, if the system isn't actually collecting entropy from a device,
> > then why the heck does it allow me to think it is (i.e. by allowing me
> > to enable it and show it as enabled and collecting via "rndctl -l")?
>
> The system does collect samples from all those devices.  However, they
> are not designed to be unpredictable and there is no good reliable
> model for just how unpredictable they are, so the system doesn't
> _count_ anything from them.  See https://man.NetBSD.org/entropy.4 for
> a high-level overview.

I'm not sure the word "count" appears in entropy(4) any context I can
make sense of it in w.r.t. what it means to "collect" but not "count"
entropy from those devices.

Worse the "Flags" shown by "rndctl -l" don't seem to be directly
documented (i.e. they're not described in rndctl(8)), and even on a
kernel running on real hardware I don't see the word "count" showing
there.

After looking at the source I'm not sure the descriptions of the
RND_FLAG_* values in rnd(4) help me much either.

Based on my vague understanding of all of this, perhaps you meant to say
"estimate", instead of "count"?  That would make more sense in the
context of what I read in rnd(4) and rndctl(8), though "estimate" still
seems a little vague in meaning to me.

In any case, I don't see why an xbd disk, or a xennet interface, can't
be treated exactly as if they were real hardware (i.e. in terms of
extracting entropy from their behaviour).  This is exactly what
virtualization is all about to me -- even for paravirtualization.  After
all in a threat-free world (i.e. specifically where I also trust other
domUs) their entropy is going to reflect (though maybe not exactly
mirror) the entropy of the underlying hardware and/or network traffic.
So (but maybe not by default) if I as the admin want to trust the
entropy available from an xbd(4) or xennet(4) device, then I should be
able to enable it with rndctl(8) and have it "count".

More importantly though the system shouldn't mislead me into thinking it
is "counting" entropy from a device when it is actually not.  If I had
seen that there were no sources estimating/counting/whatever entropy,
and I tried to enable one and was given a nice error message about this
not being possible, then I would have looked elsewhere to find out how
to give the system more bits of entropy.  As is in my Xen domU system
the output of "rndctl -l" leads me to believe all of my devices are
collecting both timing and value samples, and using either one or the
other to gather entropy (though with '-v' I don't see that any bits of
entropy have been added from any of those amy millions of collected
samples).

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpHGwjWgu37A.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods
Vendor ID: "GenuineIntel"; CPUID level 11

Intel-specific functions:
Version 000206c2:
Type 0 - Original OEM
Family 6 - Pentium Pro
Model 12 -
Stepping 2
Reserved 8

Extended brand string: "Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz"
CLFLUSH instruction cache line size: 8
Initial APIC ID: 34
Hyper threading siblings: 32

Feature flags 1fc9cbf5:
FPUFloating Point Unit
DE Debugging Extensions
TSCTime Stamp Counter
MSRModel Specific Registers
PAEPhysical Address Extension
MCEMachine Check Exception
CX8COMPXCHG8B Instruction
APIC   On-chip Advanced Programmable Interrupt Controller present and enabled
SEPFast System Call
MCAMachine Check Architecture
CMOV   Conditional Move and Compare Instructions
FGPAT  Page Attribute Table
CLFSH  CFLUSH instruction
ACPI   Thermal Monitor and Clock Ctrl
MMXMMX instruction set
FXSR   Fast FP/MMX Streaming SIMD Extensions save/restore
SSEStreaming SIMD Extensions instruction set
SSE2   SSE2 extensions
SS Self Snoop
HT Hyper Threading

TLB and cache info:
5a: unknown TLB/cache descriptor
03: Data TLB: 4KB pages, 4-way set assoc, 64 entries
55: unknown TLB/cache descriptor
ff: unknown TLB/cache descriptor
b2: unknown TLB/cache descriptor
f0: unknown TLB/cache descriptor
ca: unknown TLB/cache descriptor
Processor serial: 0002-06C2----


I noted today though that entropy doesn't seem to be accumulating even
in the dom0 despite there being many useful sources configured to both
collect and "estimate" _and_ despite the fact there's a valid-looking
$random_file that was saved and reloaded by /etc/rc.d/random_seed (and
saved again every day by /etc/security):

# /etc/rc.d/random_seed rcvar
# random_seed
random_seed=YES
# ls -l /etc/entropy-file
-rw---  1 root  wheel  536 Mar 31 04:15 /etc/entropy-file
# rndctl -l
Source Bits Type  Flags
ipmi0-Temp0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp1   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp2   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp3   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Planar-Te   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-1   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-2   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-3   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Status  0 ???  estimate, collect, t, dt
ipmi0-Voltage 0 power estimate, collect, v, t, dv, dt
ipmi0-Voltage10 power estimate, collect, v, t, dv, dt
ipmi0-Status1 0 ???  estimate, collect, t, dt
ipmi0-Intrusion   0 ???  estimate, collect, t, dt
ipmi0-Temp4   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Temp6   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-4   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-FAN-MOD-5   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ipmi0-Ambient-T   0 env  estimate, collect, v, t, dv, dt
ums0  0 tty  estimate, collect, v, t, dt
ukbd0 0 tty  estimate, collect, v, t, dt
/dev/random   0 ???  estimate, collect, v
sd2   0 disk estimate, collect, v, t, dt
sd1   0 disk estimate, collect, v, t, dt
sd0   0 disk estimate, collect, v, t, dt
cpu0  0 vm   estimate, collect, v, t, dv
hardclock 0 skew estimate, collect, t
pckbd00 tty  estimate, collect, v, t, dt
system-power  0 power estimate, collect, v, t, dt
autoconf  0 ???  estimate, collect, t
seed  0 ???  estimate, collect, v
# sysctl kern.entropy
kern.entropy.collection = 1
kern.entropy.depletion = 0
kern.entropy.consolidate = -23552
kern.entropy.gather = -23552
kern.entropy.needed = 256
kern.entropy.pending = 0
kern.entropy.epoch = 19

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpZNU3eXL60M.pgp
Description: OpenPGP Digital Signature


Re: nothing contributing entropy in Xen domUs? or dom0!!!

2021-03-31 Thread Greg A. Woods
At Thu, 1 Apr 2021 04:13:59 + (UTC), RVP  wrote:
Subject: Re: nothing contributing entropy in Xen domUs?  or dom0!!!
>
> Does this /etc/entropy-file match what's there in your /boot.cfg?
>
> On my laptop $random_file is left at the default which is:
> /var/db/entropy-file

Yes I did change that as well (as /var isn't part of the root partition).

However that's not the problem for the dom0.

"rndseed" isn't currently used (at least not by me or any documentation
I'm aware of) when loading (multibooting) a Xen kernel and a NetBSD dom0
kernel.

/etc/rc.d/random_seed will do this (again) later anyway.

However since as I showed the hardware doesn't seem to be providing
entropy that can be "counted" ("estimated"), there's nothing to save,
and so nothing to load on the next boot either.

I know how to seed it -- but that's not the problem -- the hardware
should be providing plenty of entropy.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpfPcjeu55q3.pgp
Description: OpenPGP Digital Signature


Re: UVM behavior under memory pressure

2021-04-01 Thread Greg A. Woods
At Thu, 1 Apr 2021 21:03:37 +0200, Manuel Bouyer  wrote:
Subject: UVM behavior under memory pressure
>
> Of course the system is very slow
> Shouldn't UVM choose, in this case, to reclaim pages from the file cache
> for the process data ?
> I'm using the default vm.* sysctl values.

I almost never use the default vm.* values.

I would guess the main problem for your system's memory requirements, at
the time you showed it, is that the default for vm.anonmin is way too
low and so raising vm.anonmin might help.  If vm.anonmin isn't high
enough then the pager won't sacrifice other requirements already in play
for anon pages.

Lowering vm.filemax (and maybe also vm.filemin) might also help since
your system, at that time, appeared to be doing far less I/O on large
numbers of files than, say, a web server or a compile server might be
doing.  However with almost 3G dedicated to the file cache it would seem
your system did recently trawl through a lot of file data, and so with a
lower vm.filemax less of it would have been kept as pressure for other
types of memory increased.

Here are the values I use, with comments about why, from my default
/etc/sysctl.conf.  These have worked reasonably well for me for years,
though I did have a virtual machine struggle to do some builds when I
ran too many make jobs in parallel and then a gargantuan compiler job
came along and needed too much memory.  However there was enough swap
and eventually it thrashed its way through, and more importantly I was
still able to run commands, albeit slowly, and my one large interactive
process (emacs), sometimes took quite a while to wake up and respond.

# N.B.:  On a live system make sure to order changes to these values so that you
# always lower any values from their default first, and then raise any that are
# to be raised above their defaults.  This way, the sum of the minimums will
# stay within the 95% limit.

# the minimum percentage of memory always (made) available for the
# file data cache
#
# The default is 10, which is much too high, even for a large-memory
# system...
#
vm.filemin=5

# the maximum percentage of memory that will be reclaimed from other uses for
# file data cache
#
# The default is 50, which may be too high for small-memory systems but may be
# about right for large-memory systems...
#
#vm.filemax=25

# the minimum percentage of memory always (made) available for anonymous pages
#
# The default is 10, which is way too low...
#
vm.anonmin=40

# the maximum percentage of memory that will be reclaimed from other uses for
# anonymous pages
#
# The default is 80, which seems just about right, but then again it's unlikely
# that the majority of inactive anonymous pages will ever be reactivated so
# maybe this should be lowered?
#
#vm.anonmax=80

# the minimum percentage of memory always (made) available for text pages
#
# The default is 5, which may be far too low on small-RAM systems...
#
vm.execmin=20

# the maximum percentage of memory that will be reclaimed from other uses for
# text pages
#
# The default is 30, which may be too low, esp. for big programs on small-memory
# systems...
#
vm.execmax=40

# It may also be useful to set the bufmem high-water limit to a number which may
# actually be less than 5% (vm.bufcache / options BUFCACHE) on large-memory
# systems (as BUFCACHE cannot be set below 5%).
#
# note this value is given in bytes.
#
#vm.bufmem_hiwater=


--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpSINIeXL6Sx.pgp
Description: OpenPGP Digital Signature


Re: UVM behavior under memory pressure

2021-04-01 Thread Greg A. Woods
At Thu, 1 Apr 2021 23:15:42 +0200, Manuel Bouyer  wrote:
Subject: Re: UVM behavior under memory pressure
>
> Yes, I understand this. But, in an emergency situation like this one (there
> is no free ram, swap is full, openscad eventually gets killed),
> I would expect the pager to reclaim pages where it can;
> like file cache (down to vm.filemin, I agree it shouldn't go down to 0).
>
> In my case, vm.anonmax is at 80%, and I suspect it was not reached
> (I tried to increase it to 90% but this didn't change anything).

As I understand things there's no point to increasing any vm.*max value
unless it is already way too low and you want more memory to be used for
that category and there's not already more use in other categories
(i.e. where a competing vm.*max value is too high).

It is the vm.*min value for the desired category that isn't high enough
to allow that category to claim more pages from the less desired
categories.

I.e. if vm.anonmin is too low, and I believe the default of 10% is way
too low, then when file I/O gets busy for whatever reason, (and with the
default rather high vm.filemax value) large processes _will_ get
partially paged out as only 10% of their memory will be kept activated.

Simultaneously decreasing vm.filemax and increasing vm.anonmin should
guarantee more memory can be dedicated to processes needing it as
opposed to allowing file caching to take over.

I think in general the vm.*max limits (except maybe vm.filemax) are only
really interesting on very small memory systems and/or on systems with
very specific types of uses which might demand more pages of one
category or the other.  The default vm.filemax value on the other hand
may be too high for systems that don't _constantly_ do a lot of file I/O
_and_ access many of the same files more than once.

So if you regularly run large processes that don't necessarily do a
whole lot of file I/O then you want to reduce vm.filemax, perhaps quite
a lot, maybe even to just being barely above vm.filemin; and of course
you want to increase vm.anonmin.  One early guide suggested (with my
comments):

vm.execmin=2# this is too low if your progs are huge code
vm.execmax=4# but this should probably be as much as 20
vm.filemin=0
vm.filemax=1# too low for compiling, web serving, etc.
vm.anonmin=70
vm.anonmax=95

Note that increasing vm.anonmin won't dedicate memory to anon pages if
they're not currently needed of course, but it will guarantee at least
that much memory will be made available, and kept available, when and if
pressure for anon pages increases.

So all of these limits are not "hard limits", nor are they dedicated
allocations per-se.  A given category can use more pages than its max
limit, at least until some other category experiences pressure,
i.e. until the page daemon is woken.

(Just keep in mind that one cannot currently exceed 95% as the sum of
the lower (vm.*min) limits.  The total of the upper (vm.*max) limits can
be more than 100%, but there are caveats to such a state.)

Also if you have a really large memory machine and you don't have
processes that wander through huge numbers of files, then you might also
want to lower vm.bufcache so that it's not wasted.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp6P_y72diVe.pgp
Description: OpenPGP Digital Signature


regarding the changes to kernel entropy gathering

2021-04-03 Thread Greg A. Woods
So, I'm not sure what to say here.

I'm very surprised, quite confused, more than a little perturbed, and
even somewhat angry.  It's taken me quite some time to write this.

Now temper this with knowing that I do know I'm running -current, not a
release, and that I accept the challenges this might cause (thus see the
patch below).

Updating a system, even on -current, shouldn't cause what I can only
describe as _intentional_ breakage, even for matters so important as
system security and integrity, and especially not without clear mention
UPDATING, and perhaps also with documented and referenced tools to
assist in undoing said breakage.

Updating a system, even on -current, shouldn't create a long-lived
situation where the system documentation and the behaviour and actions
of system commands is completely out of sync with the behaviour of the
kernel, and in fact lies to the administrator about the abilities of the
system.

In any case, the following patch (and in particular the last hunk) fixes
all my problems and complaints in this domain.  It is fully tested, and
it works A-OK with Xen in both domU and dom0 kernels.  My systems once
again have consistent documentation, and tools that don't lie, and are
able to function as before w.r.t. matters related to /dev/random and
getrandom(2).

Now I'm not proposing this as the final solution -- I think there's some
middle ground to be found, but at least this gets things back to working.


--- sys/kern/kern_entropy.c.~1.30.~ 2021-03-07 17:23:05.0 -0800
+++ sys/kern/kern_entropy.c 2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

if (ISSET(flags, ENTROPY_SIG)) {
@@ -1577,6 +1577,16 @@
KASSERT(i == __arraycount(extra));
entropy_enter(extra, sizeof extra, 0);
explicit_memset(extra, 0, sizeof extra);
+
+   aprint_verbose("entropy: %s attached as an entropy source (", rs->name);
+   if (!(flags & RND_FLAG_NO_COLLECT)) {
+   printf("collecting");
+   if (flags & RND_FLAG_NO_ESTIMATE)
+   printf(" without estimation");
+   }
+   else
+   printf("off");
+   printf(")\n");
 }

 /*
@@ -1610,6 +1620,8 @@

/* Free the per-CPU data.  */
percpu_free(rs->state, sizeof(struct rndsource_cpu));
+
+   aprint_verbose("entropy: %s detached as an entropy source\n", rs->name);
 }

 /*
@@ -1754,21 +1766,21 @@
 rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, &value, sizeof value, 0);
+   rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint32(struct krndsource *rs, uint32_t value)
 {

-   rnd_add_data(rs, &value, sizeof value, 0);
+   rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY);
 }

 void
 _rnd_add_uint64(struct krndsource *rs, uint64_t value)
 {

-   rnd_add_data(rs, &value, sizeof value, 0);
+   rnd_add_data(rs, &value, sizeof value, sizeof value * NBBY);
 }

 /*

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp9RYQDmBbzG.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 09:49:58 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > Date: Sat, 03 Apr 2021 12:24:29 -0700
> > From: "Greg A. Woods" 
> >
> > Updating a system, even on -current, shouldn't create a long-lived
> > situation where the system documentation and the behaviour and actions
> > of system commands is completely out of sync with the behaviour of the
> > kernel, and in fact lies to the administrator about the abilities of the
> > system.
>
> It would help if you could identify specifically what you are calling
> a lie.
>
> > @@ -1754,21 +1766,21 @@
> >  rnd_add_uint32(struct krndsource *rs, uint32_t value)
> >  {
> >
> > -   rnd_add_data(rs, &value, sizeof value, 0);
> > +   rnd_add_data(rs, &value, sizeof value, sizeof value * ABBY);
> >  }
>
> The rnd_add_uint32 function is used by drivers to feed in data from
> sources _with no known model for their entropy_.

Indeed -- that's the idea.

> It's how drivers
> toss in data that might be helpful but might totally predictable, and
> the driver has no way to know.

Yeah, so?  They don't need to know this.  I'm not actually asking random
drivers to decide the amount of physical entropy they can collect.
That is controlled elsewhere.

> Your change _creates_ the lie that every bit of data entered this way
> is drawn from a source with independent uniform distribution.

No, my change _allows_ the administrator to decide which devices can be
used as estimating/counting entropy sources.  For example I know that
many of the devices on almost all of my machines (virtual or otherwise)
are equally good sources of entropy for their uses.

An addition change, one which I would also find totally acceptable,
would be to disable the current default of allowing "estimation" on
devices which are not true hardware RNGs.  I.e. maybe this simple change
would suffice (though I haven't checked beyond a quick grep to see that
this flag is the mostly commonly used one -- perhaps some real RNG
devices could also be changed to use explicit flags to enable estimation
by default):

--- sys/sys/rndio.h.~1.2.~  2016-07-23 14:36:45.0 -0700
+++ sys/sys/rndio.h 2021-04-04 12:39:15.609936311 -0700
@@ -91,8 +91,7 @@
 #define RND_FLAG_ESTIMATE_TIME 0x4000  /* estimate entropy on time */
 #define RND_FLAG_ESTIMATE_VALUE0x8000  /* estimate entropy on 
value */
 #defineRND_FLAG_HASENABLE  0x0001  /* has enable/disable 
fns */
-#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\
-RND_FLAG_ESTIMATE_TIME)
+#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME)

 #defineRND_TYPE_UNKNOWN0   /* unknown source */
 #defineRND_TYPE_DISK   1   /* source is physical disk */


There are a vast number of ways this re-tooling of entropy collection
could have been done better.

I'm asking for discussion on what amount to some VERY simple changes
which completely and totally solve many real-world uses of this code
while at the same time not just allowing, but defaulting to, the very
strict and secure operation for special situations.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpXj_p1tBVqr.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 04 Apr 2021 23:47:10 +0700, Robert Elz  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If we want really good security, I'd submit we need to disable
> the random seed file, and RDRAND (and anything similar) until we
> have proof that they're perfect.

Indeed, I concur.

I trust the randomness and in-observability and isolation of the
behaviour of my system's fans far more than I would trust Intel's RDRAND
or RDSEED instructions.

I even trust the randomness of the timings of the virtual disks in my
Xen domU virtual machines more-so, even with multiple sibling guests,
even if some of those other guests can be influenced by untrusted third
parties at critical times.

> Personally, I'm happy with anything that your average high school
> student is unlikely to be able to crack in an hour.   I don't run
> a bank, or a military installation, and I'm not the NSA.   If someone
> is prepared to put in the effort required to break into my systems,
> then let them, it isn't worth the cost to prevent that tiny chance.
> That's the same way that my house has ordinary locks - I'm sure they
> can be picked by someone who knows what they're doing, and better security
> is available, at a price, but a nice happy medium is what fits me best.

Indeed again.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpvuqMttwSyI.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 04 Apr 2021 21:14:31 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Do note, the existing randomness sources are still being sampled and
> mixed into the pool, so even if the starting state from the saved
> entropy may be known (by violating the security of the storage),
> it's still not possible to predict the complete stream of randomness
> data once the system has seen a bit of uptime (given that there are
> actual other sources of (unverified) entropy which aren't all of too
> low quality).

No amount of uptime and activity was increasing the entropy in my system
before I patched it.  /dev/random remained blocked after days of busy
system activity.  I would argue that most, if not all, of the sources of
entropy identified by rndctl(8) on my systems are high-quality and
secure sources in my circumstances and for my uses.

Perhaps the unpatched implementation isn't doing exactly what you think
it is?

The unpatched implementation completely and entirely prevents the system
from ever using any of those sources, despite showing that they are
enabled for use.

> However, in the new scheme of things, because most of the
> traditional sources have unknown quality, and we have no reliable
> method to estimate how much "actual entropy" those sources
> provide, they no longer count towards the *estimate* of what is
> now a lower bound on the "real" entropy available in the pool.

It really doesn't matter what can be determined in general and from a
distance.

What matters is what a given administrator can determine in particular
for a given application in a given circumstance.

Before my patch the system was not behaving as documented and could not
be made to behave as the documentation said it could be made to behave.

With my patch I can choose which to trust from amongst the available
sources.  Without that patch my choices are ignored and the system lies
to me about using my choices.  I would argue my patch fixes a critical
bug.

> Besides, the implementation has been thoroughly vetted.  E.g. the
> reference [7] from the wikipedia article states in the conclusion on
> page 20
>
>Overall, the Ivy Bridge RNG is a robust design with a large
>margin of safety that ensures good random data is generated even
>if the Entropy Source is not operating as well as predicted.

"design" != implementation

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpjs3QaPXmot.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 16:39:11 -0400 (EDT), Mouse  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > No amount of uptime and activity was increasing the entropy in my
> > system before I patched it.
>
> As I understand it, entropy was being contributed.  What wasn't
> happening was the random driver code recognizing and acknowledging that
> entropy, because it had no way to tell how much of it there really was.

Clearly there was no entropy being contributed in any way shape or form.

It wasn't the driver code at fault.

It was the code I fixed with my patch that was at fault.

I told the system to "count" the entropy being gathered by the
appropriate driver(s), but it was being ignored entirely.

After my fix the system behaved as I told it to.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpKRv3dDs3Kt.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 05 Apr 2021 00:07:49 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Indeed, that's also compatible with what I wrote.  The samples
> from whatever sources you have are still being mixed into the
> pool, but they are not being counted as contributing to the
> entropy estimate, because the quality of the samples is at best
> unknown.

Perhaps we're talking past each other?

Until I made the fix no amount of time or activity or of me telling the
system to make use of the driver inputs was unblocking getrandom(2) or
/dev/random, so it doesn't really matter if anything was being "mixed
into the pool" so to speak as the pool was empty.

> A possible workaround is, once you have some uptime and some bits
> mixed into the pool, you can do:

I don't need a work-around -- I found a fix.  I corrected some code that
was purposefully ignoring my orders for how it should behave.

> I am still of the fairly firm beleif that the mistrust in the
> hardware vendors' ability to make a reasonable and robust
> implementation is without foundation.

Well there are still millions of systems out there without the fancy
newer hardware RNGs available to make them more secure than Fort Knox.
At least a small handful of them run NetBSD for me, and want them to
work for my needs and I was, and am, quite happy with using entropy that
can be collected from various devices that my systems (virtual and real)
actually have.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpw8NF4N8YCU.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 05 Apr 2021 00:14:30 +0200 (CEST), Havard Eidnes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> > What about architectures that have nothing like RDRAND/RDSEED?  Are
> > they, effectively, totally unsupported now?
>
> Nope, not entirely.  But they have to be seeded once.  If they
> have storage which survives reboots, and entropy is saved and
> restored on reboot, they will be ~fine.

BTW, to me reusing the same entropy on every reboot seems less secure.

> Systems without persistent storage and also without RDRAND/RDSEED
> will however be ... a more challenging problem.

Leaving things like that would be totally silly.

With my patch the old way of gathering entropy from devices works just
fine as it always did, albeit with the second patch it does require a
tiny bit of extra configuration.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpgeBbtqrqWg.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Mon, 5 Apr 2021 01:05:58 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Part of the problem here is that most of the non-RNG data sources are
> easily observable either from the local system (e.g. any malicious user)
> or other VMs on the same machine (in case of a hypervisor) or local
> machines on the same network (in case of network interrupts).

It _Just_ _Doesn't_ _Matter_  (i.e. for many of us, most of the time).

Now ideally in the hypervisor scenario we would have a backend device
that read from /dev/random and offered it to the VM guest as a virtual
hardware RNG.  Or maybe it's as simple as passing a those few bytes
through a custom Xenstore string and having a script in the VM read them
and inject them into /dev/random.  But that's not been done yet.

BTW, personally, on at least on some machines, I don't have any worry
whatsoever at the moment about one VM guest spying on, or influencing
the PRNG, in another.  Zero worry.  They're all _me_.  I don't need some
theoretically perfect level of protection from myself.

--
    Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpFPOplfhwSl.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-04 Thread Greg A. Woods
At Sun, 4 Apr 2021 23:09:18 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If you know this (and this is something I certainly can't confidently
> assert!), you can write 32 bytes to /dev/random, save a seed, and be
> done with it.

I don't have random data easily available at install time.

I don't have random data easily available every time I boot a machine
with non-persistent storage (e.g. a test ISO image).

I _do_ trust well enough the sources of randomness in some device
drivers to provide me with a secure enough amount of entropy, for my
purposes.

And so with my fix(es) I don't need to feed supposedly random data to
every system on every install and/or every reboot.

What's worse?  My fixes, or something like this in /etc/rc.local:

   echo -n "" > /dev/random

> But users who don't go messing around with obscure rndctl settings in
> rc.conf will be proverbially shot in the foot by this change -- except
> they won't notice because there is practically guaranteed to be no
> feedback whatsoever for a security disaster until their systems turn
> up in a paper published at Usenix like <https://factorable.net/>.

You're really stretching your argument thinly if you are assuming
everyone _needs_ perfect entropy here.

Also, that's only if the default RND_FLAG_ESTIMATE_* bits are turned off.

AND only if the system doesn't have some true hardware RNG.

> What your change does is equivalent to going around to every device
> driver that previously said `this provides zero entropy, or I don't
> know how much entropy it provides' and replacing that claim by `this
> is a sample of an independent and perfectly uniform random string of
> bits', which is a much stronger (and falser) claim than even the old
> `entropy estimation' confabulation that NetBSD used to do.

No, only if the default RND_FLAG_ESTIMATE_* bits are ***NOT*** turned off.

AND only if the user is like me and stuck with some poor second-grade
ancient hardware that doesn't have some fancy new true hardware RNG.

In the mean time a more productive approach would be to figure out
what's best for those of us who don't need perfection every time and/or
to fix those device drivers that could feed sufficiently random data to
the entropy pool, and then to recommend a suitable value for
rndctl_flags in /etc/rc.conf.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpnOADtmWrjC.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Sun, 4 Apr 2021 18:47:23 -0700, Brian Buhrow  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Hello.  As I understand it, Greg ran into this problem on a xen domu.
> In checking my NetBSD-9 system running as a domu under xen-4.14.1,
> there is no rdrand or rdseed feature exposed to domu's by xen.  This
> observation is confirmed by looking at the xen command line reference
> page: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html

The problem in the domU was really just the very tip of the iceberg.

The dom0 exhibits the exact same problem and for the same reasons.

> and NetBSD doesn't trust the random sources provided by the xennet(4)
> and xbd(4) drivers.  Therefore, the only solution to get randomness
> working for the first time on a newlyinstalled domu is to write 32
> bytes to /dev/random.

It's not that the xbd(4) devices, etc. are not trusted as entropy
sources -- the new entropy system doesn't trust anything, real or
virtual, despite the documentation saying that it can be made to do so.

My patch fixes that bug.  It was very obvious once I understood the root
of the issue.

As a result my patch fixes the bug for Xen dom0 and domU.

Writing randomness to /dev/random is _NOT_ a general solution (though it
could be IFF it can be reliably taken from /dev/urandom AND IFF the rest
of the system and documentation is completely and adequately fixed to
match the new regime).

What perturbs me the most and makes me rather angry is that the rest of
the system, and the system documentation, continued to lie and mislead
me for days (and it didn't help that nobody who knew this was pointing
helpfully and clearly at the root of the problem).  So, my patch ALSO
restores the kernel's behaviour to match the documentation and tools
(specifically rndctl).  That the core of it it is just a two-line patch
makes this fix extremely satisfying.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpWiXqui7McJ.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 10:46:19 +0200, Manuel Bouyer  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> If I understood it properly, there's no need for such a knob.
> echo 0123456789abcdef0123456789abcdef > /dev/random
>
> will get you back to the state we had in netbsd-9, with (pseudo-)randomness
> collected from devices.

Well, no, not quite so much randomness.  Definitely pseudo though!

My patch on the other hand can at least inject some real randomness into
the entropy pool, even if it is observable or influenceable by nefarious
dudes who might be hiding out in my garage.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpdkEisDB6Js.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 16:13:55 +1200, Lloyd Parkes  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The current implementation prints out a message whenever it blocks a
> process that wants randomness, which immediately makes this
> implementation superior to all others that I have ever seen. The
> number of times I've logged into systems that have stalled on boot and
> made them finish booting by running "ls -lR /" over the past 20 years
> are too many to count. I don't know if I just needed to wait longer
> for the boot to finish, or if generating entropy was the fix, and I
> will never know. This is nuts.

Indeed!

> We can use the message to point the system administrator to a manual
> page that tells them what to do, and by "tells them what to do", I
> mean in plain simple language, right at the top of the page, without
> scaring them.

Excellent idea!  :-)

However I have been wondering if sending the message just to the
console, and logging it, say in /var/log/kern, is sufficient.

It still took me a very long time to find the existing new message
because I don't hang out on the console -- this is a VM, after all, and
it's running in a city almost exactly 4200km driving distance from me
too!  As-is I feel I hang out on the console more often than the average
admin who doesn't use a physical console, and of course infinitely more
often than any user who doesn't admin his own server.

I have added the following comment to the kernel to remind me to think
more about this, as a uprintf(9) at the same time would pop right up on
the actual user's session too:

--- kern_entropy.c.~1.30.~  2021-03-07 17:23:05.0 -0800
+++ kern_entropy.c  2021-04-03 11:25:31.667067667 -0700
@@ -1306,7 +1306,7 @@

/* Wait for some entropy to come in and try again.  */
KASSERT(E->stage >= ENTROPY_WARM);
-   printf("entropy: pid %d (%s) blocking due to lack of entropy\n",
+   printf("entropy: pid %d (%s) blocking due to lack of 
entropy\n", /* xxx uprintf() instead/also? */
   curproc->p_pid, curproc->p_comm);

if (ISSET(flags, ENTROPY_SIG)) {


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpbil_4h9ofy.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 03:02:42 +0200, Joerg Sonnenberger  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> Except that's not what the system is doing. It removes the seed file on
> boot and creates a new one on shutdown.

That's not exactly what the documentation says it does (from rndctl(8)):

-L  Load saved entropy from file save-file and overwrite it with a
 seed derived by hashing it together with output from /dev/urandom
 so that the new seed has at least as much entropy as either the
 old seed had or the system already has.  If interrupted, either
 the old seed or the new seed will be in place.

The code seems to concur.

Also the system re-saves the $random_file via /etc/security
(unconditionally, i.e. always, but only if $random_file is set).

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpJ2gB7j21GX.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
ources is
needed to "stir" the pot in the first place, then why not just "count"
it as "real" entropy and be done with it -- at least then it is obvious
when enough entropy has been gathered and the currently implemented
algorithms handle things properly and securely and all inside the
kernel.  I.e. the admin doesn't have to put a "sleep 30" or whatever in
front of it and hope that's enough and that it's still not too
predictable.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpxsHTzqoenJ.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-05 Thread Greg A. Woods
At Mon, 5 Apr 2021 15:37:49 -0400, Thor Lancelot Simon  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Sun, Apr 04, 2021 at 03:32:08PM -0700, Greg A. Woods wrote:
> >
> > BTW, to me reusing the same entropy on every reboot seems less secure.
>
> Sure.  But that's not what the code actually does.
>
> Please, read the code in more depth (or in this case, breadth), then argue
> about it.

Sorry, I was eluding to the idea of sticking the following in
/etc/rc.local as the brain-dead way to work around the problem:

echo -n "" > /dev/random

However I have not yet read and understood enough of the code to know
if:

dd if=/dev/urandom of=/dev/random bs=32 count=1

is any more "secure" -- I'm guessing (hoping?) it depends on exactly
when this might be run, and also depends on which, if any, other device
sources are enabled for "collecting".  If in some rare case none were
enabled, or if it were run before any were able to "stir the pool", then
I'm guessing it would be no more secure than writing a fixed string.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpgF42U_yi8i.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
x70, ver=426
kern.entropy.consolidate (1.1260.1263): CTLTYPE_INT, size 4, flags 
0x70, func=0x8083f151, ver=427
kern.entropy.gather (1.1260.1264): CTLTYPE_INT, size 4, flags 0x70, 
func=0x8083dd4c, ver=428
kern.entropy.needed (1.1260.1265): CTLTYPE_INT, size 4, flags 
0x100, ver=429
kern.entropy.pending (1.1260.1266): CTLTYPE_INT, size 4, flags 
0x100, ver=430
kern.entropy.epoch (1.1260.1267): CTLTYPE_INT, size 4, flags 
0x100, ver=431

Perhaps function pointer values shouldn't be printed as integers?


And there are no text descriptions for some of the kern.entropy values:

17:27 [1.831] # sysctl -d kern.entropy.needed
kern.entropy.needed: (no description)
17:27 [1.832] # sysctl -d kern.entropy.pending
kern.entropy.pending: (no description)
17:27 [1.833] # sysctl -d kern.entropy.epoch
kern.entropy.epoch: (no description)


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpE52Jkajvwh.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
At Tue, 6 Apr 2021 12:08:54 +, Taylor R Campbell  
wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> The main issue that hits people is that the traditional mechanism by
> which the OS reports a potential security problem with entropy is for
> it to make applications silently hang -- and the issue is getting
> worse now that getrandom() is more widely used, e.g. in Python when
> you do `import multiprocessing'.

I think adding a uprintf(9) that the user who started the blocked
process (i.e. not just the admin) has a better chance of directly seeing
would be one step closer, and should be extremely easy.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpOvi5MZvUCj.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-06 Thread Greg A. Woods
At Tue, 6 Apr 2021 20:21:43 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 10:54:51AM -0700, Greg A. Woods wrote:
> >
> > And the stock implementation has no possibility of ever providing an
> > initial seed at all on its own (unlike previous implementations, and of
> > course unlike what my patch _affords_).
>
> Isn't it as simple as:
>
>   dd bs=32 if=/dev/urandom of=/dev/random

No, that still leaves the question of _when_ to run it.  (And, at least
at the moment, where to put it.  /etc/rc.local?)

Isn't something the following better (assuming you choose your devices
carefully):

echo 'rndctl_flags="-t env;-t disk;-t tty"' >> /etc/rc.conf

That's what my patches fix and allow, and this way you don't have to
guess when you can safely use /dev/urandom as an entropy seed -- the
seeding happens in real time, and only as entropy bits are made
available from those given devices.

That can also be done by sysinst, assuming a reasonably well worded
question can be answered, and that it might only need to be asked if
there are no "rng" type devices already.

Doing this also requires no network access (ever).

It can even be done, ahead of time, for use on immutable systems.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpW4B04umieR.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods
At Wed, 7 Apr 2021 09:52:29 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
>
> On Tue, Apr 06, 2021 at 03:12:45PM -0700, Greg A. Woods wrote:
> > > Isn't it as simple as:
> > >
> > >   dd bs=32 if=/dev/urandom of=/dev/random
> >
> > No, that still leaves the question of _when_ to run it.  (And, at least
> > at the moment, where to put it.  /etc/rc.local?)
>
> Of course not!
>
> You run it once. Manually. And never again.

Nope, sorry, that's not a good enough answer.  It doesn't solve the
problem of dealing with a lack of mutable storage.

A system _MUST_ be able to be booted and with no user intervention be
able to (eventually) get to the state where /dev/random and getrandom(2)
WILL NOT block, and it _MUST_ be able to do so without the help of any
hardware RNG, and without the ability to store (and read) a seed from a
file or other storage device.

I.e. we _MUST_ be _ABLE_ to choose to use other devices as sources for
entropy, even if they are not perfect.  We had this, it works fine, we
still need it.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpeaL6Xd0CAO.pgp
Description: OpenPGP Digital Signature


Re: regarding the changes to kernel entropy gathering

2021-04-07 Thread Greg A. Woods
At Wed, 7 Apr 2021 22:47:39 +0200, Martin Husemann  wrote:
Subject: Re: regarding the changes to kernel entropy gathering
> 
> When you create a custom setup like that, you will have to replace
> etc/rc.d/entropy with a custom solution (e.g. mounting some flash storage).

No storage means "NO storage.".

> Or you ignore the issue and do the dd at each boot - hopefully not generating
> any strong keys on that machine then (but you would have no good storage
> for those anyway).

Or I don't ignore the issue and instead I fix the code so that it's
still possible to get entropy estimates from non-hardware-RNG devices
and then things keep working the way they used to, and there's still
some possibility of _real_ entropy being used to seed the PRNGs.

From what I've seen here so far I'm far from alone in wanting that
ability.

What's most confusing is to why there's such animosity and stubborn
unwillingness to even consider that the old way of getting some entropy
from a few less-than-perfect sources was good enough for many, or even
most, of us.

It's better than no entropy when there are no "perfect" sources, and
that's also a situation that includes many of us.

It doesn't have to be the default.

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpgg9AaQiU92.pgp
Description: OpenPGP Digital Signature


I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
 \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0001760   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0   U 252
0002000


# dd if=/dev/rvnd0d count=17 msgfmt=quiet| od -c
000   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
002   \0  \0  \0  \0  \0  \0  \0  \0  \b  \0  \0  \0 020  \0  \0  \0
0020020  030  \0  \0  \0 230 005  \0  \0  \0  \0  \0  \0 377 377 377 377
0020040  367 360   p   `  \0  \0  \0 007 200 037  \0 027  \0  \0  \0
0020060   \0   @  \0  \0  \0  \b  \0  \0  \b  \0  \0  \0 005  \0  \0  \0
0020100   \0  \0  \0  \0   <  \0  \0  \0  \0 300 377 377  \0 370 377 377
0020120  016  \0  \0  \0 013  \0  \0  \0 004  \0  \0  \0  \0 020  \0  \0
0020140  003  \0  \0  \0 002  \0  \0  \0  \0  \b  \0  \0  \0  \0  \0  \0
0020160   \0  \0  \0  \0  \0 020  \0  \0 200  \0  \0  \0 004  \0  \0  \0
0020200   \0  \0  \0  \0 300 220 005  \0 001  \0  \0  \0  \0  \0  \0  \0
0020220  367 360   p   `   _   `   A   q 230 005  \0  \0  \0  \b  \0  \0
0020240   \0   @  \0  \0  \0  \0  \0  \0 300 220 005  \0 300 220 005  \0
0020260  027  \0  \0  \0 001  \0  \0  \0  \0   X  \0  \0   0   d 001  \0
0020300  001  \0  \0  \0 377 357 003  \0 375 347 007  \0 016  \0  \0  \0
0020320   \0 001  \0 200  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0020340   \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0021000


In fact the vnd0d device seems to give garbage forever -- it seems to
have been completely confused by trying to access a real disk image!


As a side note unfortunately even though access to this LVM-backed
mini-memstick.img file now seems OK enough to get the install booted and
a shell running, access to other FreeBSD xbd(4) devices is still not
working from FreeBSD (i.e. a fresh newfs'ed FS appears corrupt to an
immediate fsck, without mounting, and even fsck of the mounted root in
this IMG fails enormously).

# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# fsck /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cda0: disk error cmd=write 8145-8152 status: fffe

#
* FILE SYSTEM MARKED DIRTY *

#


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpELwDHrgUjQ.pgp
Description: OpenPGP Digital Signature


a working patch to _allow_ non-hardware-RNG entropy sources

2021-04-10 Thread Greg A. Woods
 RND_FLAG_COLLECT_VALUE|RND_FLAG_HASCB);
 }
Index: sys/rndio.h
===
RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/rndio.h,v
retrieving revision 1.2
diff -u -r1.2 rndio.h
--- sys/rndio.h 6 Sep 2015 06:01:02 -   1.2
+++ sys/rndio.h 9 Apr 2021 18:01:03 -
@@ -91,8 +91,29 @@
 #define RND_FLAG_ESTIMATE_TIME 0x4000  /* estimate entropy on time */
 #define RND_FLAG_ESTIMATE_VALUE0x8000  /* estimate entropy on 
value */
 #defineRND_FLAG_HASENABLE  0x0001  /* has enable/disable 
fns */
-#define RND_FLAG_DEFAULT   (RND_FLAG_COLLECT_VALUE|RND_FLAG_COLLECT_TIME|\
-RND_FLAG_ESTIMATE_TIME)
+#define RND_FLAG_DEFAULT   
(RND_FLAG_COLLECT_VALUE|RND_FLAG_ESTIMATE_VALUE| \
+RND_FLAG_COLLECT_TIME|RND_FLAG_ESTIMATE_TIME)
+/*
+ * N.B.:  It would appear from the above value that by default all devices 
using
+ * RND_FLAG_DEFAULT will be enabled directly to collect _and_ estimate(count)
+ * entropy based on both deltas in values they submit, and the time delta
+ * between submissions.  HOWEVER this is moderated by a switch in
+ * kern_entropy.c:rnd_attach_source() which will add either the NO_COLLECT
+ * and/or the NO_ESTIMATE flag depending on what type the device is.
+ *
+ * By default only RND_TYPE_SKEW, RND_TYPE_ENV, RND_TYPE_POWER, and 
RND_TYPE_RNG
+ * will avoid both of these flags being set.
+ *
+ * Network devices will be entirely disabled (from both colleciton and
+ * estimating) as they can possibly be easily influenced externally.
+ *
+ * All other devices will be given the NO_ESTIMATE flag such that they are not
+ * used to estimate(count) entropy by default.
+ *
+ * In any case either or both of the RND_FLAG_NO_* flags can be turned off at
+ * runtime by the RNDCTL ioctl on rnd(4), i.e. by rndctl(8) such that entropy
+ * collection and estimation can be enabled on a per-device or per-type basis.
+ */

 #defineRND_TYPE_UNKNOWN0   /* unknown source */
 #defineRND_TYPE_DISK   1   /* source is physical disk */
Index: sys/rndsource.h
===
RCS file: /cvs/master/m-NetBSD/main/src/sys/sys/rndsource.h,v
retrieving revision 1.7
diff -u -r1.7 rndsource.h
--- sys/rndsource.h 30 Apr 2020 03:28:19 -  1.7
+++ sys/rndsource.h 8 Apr 2021 18:15:01 -
@@ -45,8 +45,6 @@

 /*
  * struct rnd_delta_estimator
- *
- * Unused.  Preserved for ABI compatibility.
  */
 typedef struct rnd_delta_estimator {
uint64_tx;
@@ -68,8 +66,8 @@
 struct krndsource {
LIST_ENTRY(krndsource) list;/* the linked list */
 charname[16];   /* device name */
-   rnd_delta_t time_delta; /* unused */
-   rnd_delta_t value_delta;/* unused */
+   rnd_delta_t time_delta; /* */
+   rnd_delta_t value_delta;/* */
 uint32_ttotal;  /* number of bits added while cold */
 uint32_ttype;   /* type, RND_TYPE_* */
 uint32_tflags;  /* flags, RND_FLAG_* */
@@ -89,8 +87,10 @@
uint32_t);
 void   rnd_detach_source(struct krndsource *);

+#if 0
 void   _rnd_add_uint32(struct krndsource *, uint32_t); /* legacy */
 void   _rnd_add_uint64(struct krndsource *, uint64_t); /* legacy */
+#endif

 void   rnd_add_uint32(struct krndsource *, uint32_t);
 void   rnd_add_data(struct krndsource *, const void *, uint32_t, uint32_t);
Index: uvm/uvm_page.c
===========
RCS file: /cvs/master/m-NetBSD/main/src/sys/uvm/uvm_page.c,v
retrieving revision 1.250
diff -u -r1.250 uvm_page.c
--- uvm/uvm_page.c  20 Dec 2020 11:11:34 -  1.250
+++ uvm/uvm_page.c  8 Apr 2021 21:41:20 -
@@ -983,8 +983,7 @@
 * Attach RNG source for this CPU's VM events
 */
 rnd_attach_source(&ucpu->rs, ci->ci_data.cpu_name, RND_TYPE_VM,
-   RND_FLAG_COLLECT_TIME|RND_FLAG_COLLECT_VALUE|
-   RND_FLAG_ESTIMATE_VALUE);
+   RND_FLAG_DEFAULT);
 }

 /*

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpMtaoK9AMBK.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
At Sat, 10 Apr 2021 18:44:32 -0700, Brian Buhrow  wrote:
Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk 
images! (vnd(4) hides labels!)
>
>   hello.  This must be some kind of regression that's ben around a
> while.  I'm runing a xen dom0 with NetBSD-5.2 and xen-3.3.2, very old,
> but vnd(4) does expose the entire file to the domu's including FreeBSD
> 11 and 12 without any corruption or booting issues.  Do you know when
> this trouble began?

I don't know -- I think I've only ever successfully used ISO files, and
I think I gave up on some IMG file(s) previously (possibly not just from
FreeBSD) without trying to understand why they didn't work.

Have you tried specifically with a recent FreeBSD mini-memstick.img file?

I'm thinking (esp. given what I see from "od -c < /dev/rvnd0d") that
what's wrong is the vnd(4) driver is (also?) imposing some
mis-interpreted idea about the number of cylinders and heads or
something like that, especially given that "fdisk vnd0" is so totally
confused about what's in there.

There's a definite pattern of corruption anyway -- I just can't explain
it well enough yet.

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpbUW36DVCRL.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-10 Thread Greg A. Woods
On the other hand NetBSD's own .img files work OK.

However interestingly there's a small, but apparently insignificant
(because it works OK) difference between how fdisk sees the disk image
and the vnd0 device:

# fdisk -F images/NetBSD-9.99.81-amd64-live.img
Disk: images/NetBSD-9.99.81-amd64-live.img
NetBSD disklabel disk geometry:
cylinders: 972, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
# vndconfig -cv vnd0 images/NetBSD-9.99.81-amd64-live.img
/dev/rvnd0: 7999586304 bytes on images/NetBSD-9.99.81-amd64-live.img
# fdisk vnd0
Disk: /dev/rvnd0
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 64, sectors/track: 32 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)
21:10 [1.1496] # disklabel vnd0
# /dev/rvnd0:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)
# disklabel images/NetBSD-9.99.81-amd64-live.img
# images/NetBSD-9.99.81-amd64-live.img:
type: ESDI
disk: image
label: 
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 7629
total sectors: 15624192
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0 

8 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 a:  15622144  2048 4.2BSD   1024  819216  # (Cyl.  1 -   7628)
 c:  15622144  2048 unused  0 0# (Cyl.  1 -   7628)
 d:  15624192 0 unused  0 0# (Cyl.  0 -   7628)



From inside the NetBSD live image:

[   1.4412586] xbd4 at xenbus0 id 4: Xen Virtual Block Device Interface
[   1.4422594] xbd4: using event channel 20
[   1.7112647] entropy: xbd4 attached as an entropy source (collecting without 
estimation)
[   1.7112647] xbd4: 7629 MB, 512 bytes/sect x 15624192 sectors
[   1.7112647] xbd4: backend features 0x9



# df
Filesystem  1K-blocks UsedAvail %Cap Mounted on
/dev/xbd4a7562414  4699114  2485180  65% /
ptyfs   110 100% /dev/pts
# fdisk xbd4
Disk: /dev/rxbd4
NetBSD disklabel disk geometry:
cylinders: 7629, heads: 1, sectors/track: 2048 (2048 sectors/cylinder)
total sectors: 15624192, bytes/sector: 512

BIOS disk geometry:
cylinders: 973, heads: 255, sectors/track: 63 (16065 sectors/cylinder)
total sectors: 15624192

Partitions aligned to 2048 sector boundaries, offset 2048

Partition table:
0: NetBSD (sysid 169)
start 2048, size 15622144 (7628 MB, Cyls 0-972/143/3), Active
1: 
2: 
3: 
Bootselector disabled.
First active partition: 0
Drive serial number: 0 (0x)



The NetBSD live.img root filesystem seems fine and clean:

# fsck -n /dev/rxbd4a
** /dev/rxbd4a (NO WRITE)
** Last Mounted on /
** Root file system
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
32740 files, 2349557 used, 1431650 free (538 frags, 178889 blocks, 0.0% 
fragmentation)


-- 
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpqxRB084Uts.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Greg A. Woods
At Sun, 11 Apr 2021 16:06:27 - (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: I think I've found why Xen domUs can't mount some file-backed disk 
images! (vnd(4) hides labels!)
>
> k...@munnari.oz.au (Robert Elz) writes:
>
> >Date:Sun, 11 Apr 2021 14:25:40 - (UTC)
> >From:mlel...@serpens.de (Michael van Elst)
> >Message-ID:  
>
> >  | +   dg->dg_secperunit = vnd->sc_size / DEV_BSIZE;
>
> >While it shouldn't make any difference for any properly created image
> >file, make it be
>
> > (vnd->sc_size + DEV_BSIZE - 1) / DEV_BSIZE;
>
> >so that any trailing partial sector remains in the image.
>
>
> The trailing partial sector is already ignored. Fortunately no disk image
> can even have a partial trailing sector and some magically implicit
> padding would have unexpected side effects.
>
> But the code also needs to be adjusted for different sector sizes.

So since vnd->sc_size is in units of disk blocks

dg->dg_secperunit =
((vnd->sc_size * DEV_BSIZE) + DEV_BSIZE - 1) /
vnd->sc_geom.vng_secsize;

right?

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpHppeDklPmd.pgp
Description: OpenPGP Digital Signature


Re: I think I've found why Xen domUs can't mount some file-backed disk images! (vnd(4) hides labels!)

2021-04-11 Thread Greg A. Woods
ls 0/50/2-386/18/17), Active
2: 
3: 
First active partition: 1
Drive serial number: 2425393296 (0x90909090)



So, as you can see below I think it's better to round the device out to
a full number of cylinders if we're still going to play the CHS
silliness.

But for vnd(4) in particular I think it does beg the questions I ask
in the new comments below, especially the first one.

--- vnd.c.~1.278.~  2021-03-07 17:18:43.0 -0800
+++ vnd.c   2021-04-11 11:00:52.147530152 -0700
@@ -1480,20 +1480,41 @@
}
} else if (vnd->sc_size >= (32 * 64)) {
/*
-* Size must be at least 2048 DEV_BSIZE blocks
-* (1M) in order to use this geometry.
+* The file's size must be at least 2048 DEV_BSIZE
+* blocks (1M) in order to use this (fake) geometry.
+*
+* XXX why ever use this arbitrary fake setup instead 
of the next
 */
vnd->sc_geom.vng_secsize = DEV_BSIZE;
vnd->sc_geom.vng_nsectors = 32;
vnd->sc_geom.vng_ntracks = 64;
-   vnd->sc_geom.vng_ncylinders = vnd->sc_size / (64 * 32);
+   vnd->sc_geom.vng_ncylinders = (vnd->sc_size + (64 * 32) 
- 1) / (64 * 32);
} else {
+   /*
+* XXX is there anything that pretends which is worse:
+* rotational delay, or seeking?  Does it matter for < 
1M?
+*/
+#if 1
+   /* else pretend it's just one big platter of 
single-sector cylinders */
vnd->sc_geom.vng_secsize = DEV_BSIZE;
vnd->sc_geom.vng_nsectors = 1;
vnd->sc_geom.vng_ntracks = 1;
vnd->sc_geom.vng_ncylinders = vnd->sc_size;
+#else
+   /* else pretend it's just one big cylinder */
+   vnd->sc_geom.vng_secsize = DEV_BSIZE;
+   vnd->sc_geom.vng_nsectors = vnd->sc_size;
+   vnd->sc_geom.vng_ntracks = 1;
+   vnd->sc_geom.vng_ncylinders = 1;
+#endif
}

+   /*
+* n.b.:  this will round the disk's size up to an even cylinder
+* amount, but (if it is writeable) writing into the partly
+* empty cylinder, i.e. past current end of the file, will
+* simply extend the file
+*/
vnd_set_geometry(vnd);

if (vio->vnd_flags & VNDIOF_READONLY) {



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpqbNpL0b2M3.pgp
Description: OpenPGP Digital Signature


one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
So, with the vnd(4) issue more or less sorted, there seems to be one
major mystery remaining w.r.t. whatever has gone wrong with the ability
of NetBSD-current XEN3_DOM0 to host FreeBSD domUs.

I still can't create a clean filesystem on a writeable disk.  The
"newfs" runs fine, but a subsequent "fsck" finds errors and cannot fix
them (though the first run does change one or two things).

I can't even get a clean fsck of the running system's root FS:
(the "ada0: disk error" after I hit ^C is because the underlying disk
(vnd0d) is exported read-only to the domU)


# fsck -v /dev/ufs/FreeBSD_Install
start / wait fsck_ufs /dev/ufs/FreeBSD_Install
** /dev/ufs/FreeBSD_Install

SAVE DATA TO FIND ALTERNATE SUPERBLOCKS? [yn] n


ADD CYLINDER GROUP CHECK-HASH PROTECTION? [yn] n

** Last Mounted on
** Root file system
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=28
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=112
SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe

* FILE SYSTEM MARKED DIRTY *

#


Most mysteriously this filesystem is in use as the root FS and all the
files in it can be found and read!  Presumably they are all intact too
-- no programs have failed or behaved mysteriously (except fsck) and all
the human readable files I've looked at (e.g. manual pages) all seem
fine.  In fact it only seems to be fsck that complains, possibly along
with any attempt to write to a filesystem, that causes problems.  (I
believe writing to a filesystem appears to corrupt it but that is only
according to fsck.  I do seem believe there was an eventual crashes of a
system that had been running with active filesystems, but I have not got
far enough again since to reproduce this, due to the fsck problem.)

# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
# df
Filesystem   512-blocks   Used  Avail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016 -16680   102%/
devfs 2  2  0   100%/dev
tmpfs 65536232  65304 0%/var
tmpfs 40960  8  40952 0%/tmp
# time -l sh -c 'find  / -type f | xargs cat > /dev/null '
   38.58 real 1.36 user18.30 sys
  4872  maximum resident set size
13  average shared memory size
 5  average unshared data size
   215  average unshared stack size
  1906  page reclaims
 0  page faults
 0  swaps
 14024  block input operations
 0  block output operations
 0  messages sent
 0  messages received
 0  signals received
 12348  voluntary context switches
33  involuntary context switches


In fact I can put a copy of the FreeBSD img file into an LVM LV, attach
it to the running FreeBSD domU, mount it (without an FSCK, since the
FreeBSD_Install filesystem comes clean from the factory), then do
"diff -r -X /mnt -X /dev / /mnt" and find only the expected differences.

So, what could be different about how fsck reads v.s. the kernel itself?

If indeed writing to filesystem corrupts it, how and why?


It seems NetBSD can make sense of the BSD label inside the FreeBSD
mini-memstick.img file, e.g. when accessed through vnd(4), but it can't
seem to make sense of the filesystem(s) inside (which I guess might be
expected?):

# file -s /dev/rvnd0f
/dev/rvnd0f: DOS/MBR boot sector, BSD disklabel

# disklabel vnd0
# /dev/rvnd0:
type: vnd
disk: vnd
label: fictitious
flags:
bytes/sector: 512
sectors/track: 32
tracks/cylinder: 64
sectors/cylinder: 2048
cylinders: 387
total sectors: 791121
rpm: 3600
interleave: 1
trackskew: 0
cylinderskew: 0
headswitch: 0   # microseconds
track-to-track seek: 0  # microseconds
drivedata: 0

6 partitions:
#sizeoffset fstype [fsize bsize cpg/sgs]
 d:791121 0 unused  0 0# (Cyl.  0 -386*)
 e:  1600 1unknown # (Cyl.  0*-  0*)
 f:789520  1601 4.2BSD  0 0 0  # (Cyl.  0*-386*)
disklabel: boot block size 0
disklabel: super block size 0


# fsck -n /dev/vnd0f
** /dev/rvnd0f (NO WRITE)
BAD SUPER BLOCK: CAN'T FIND SUPERBLOCK
/dev/rvnd0f: CANNOT FIGURE OUT SECTORS PER CYLINDER


--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpBY33Act0N5.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
At Sun, 11 Apr 2021 13:23:31 -0700, "Greg A. Woods"  wrote:
Subject: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> In fact it only seems to be fsck that complains, possibly along
> with any attempt to write to a filesystem, that causes problems.

Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
xbd(4) with a new filesystem created on it, is impossible.

I was able to write 500MB of zeros to the LVM LV backed disk,
overwriting the copy of the .img file I had put there, and only see
500MB of zeros back on the NetBSD side, so writing directly to the raw
/dev/da1 on FreeBSD seems to write data without problem.

However then the following happens when I try to use a new FS there:

# newfs /dev/da1
/dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# mount /dev/da1 /mnt
# mount
/dev/ufs/FreeBSD_Install on / (ufs, local, noatime, read-only)
devfs on /dev (devfs, local, multilabel)
tmpfs on /var (tmpfs, local)
tmpfs on /tmp (tmpfs, local)
/dev/da1 on /mnt (ufs, local)
# df
Filesystem   512-blocks   UsedAvail Capacity  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%/
devfs 2  20   100%/dev
tmpfs 6553660864928 1%/var
tmpfs 40960  840952 0%/tmp
/dev/da1   60901560 16 56029424 0%/mnt
# cp /COPYRIGHT /mnt
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 0, cgp: 0xe66de1a4 != bp: 
0xf433acbc
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 1, cgp: 0x89ba8532 != bp: 
0x3491fbd0
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 3, cgp: 0xdeaf87a7 != bp: 
0x3a071e86
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 7, cgp: 0x7085828d != bp: 
0xaaae0f19
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 15, cgp: 0x293dfe28 != bp: 
0xe2f25f8b
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 31, cgp: 0x9a4d0762 != bp: 
0x4119c6e
[[  and on and on  ]]
UFS /dev/da1 (/mnt) cylinder checksum failed: cg 49, cgp: 0x931f84e5 != bp: 
0xb48687df

/mnt: create/symlink failed, no inodes free
cp: /mnt/COPYRIGHT: No space left on device
# Apr 11 20:37:28  syslogd: last message repeated 4 times
Apr 11 20:37:59  kernel: pid 713 (cp), uid 0 inumber 2 on /mnt: out of inodes
# df -i
Filesystem   512-blocks   UsedAvail Capacity iused   ifree 
%iused  Mounted on
/dev/ufs/FreeBSD_Install 782968 737016   -16680   102%   12129 285   
98%   /
devfs 2  20   100%   0   0  
100%   /dev
tmpfs 6553660864928 1%  75  114613
0%   /var
tmpfs 40960  840952 0%   6   71674
0%   /tmp
/dev/da1   60901560 16 56029424 0%   2 4012796
0%   /mnt




NetBSD can actually make some sense of this FreeBSD filesystem though:

# fsck -n /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0 (NO WRITE)
Invalid quota magic number

CONTINUE? yes

** File system is already clean
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
SUMMARY INFORMATION BAD
SALVAGE? no

BLK(S) MISSING IN BIT MAPS
SALVAGE? no

** Phase 6 - Check Quotas

CLEAR SUPERBLOCK QUOTA FLAG? no

2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *



I'm not sure if those problems are to be expected with a FreeBSD-created
filesystem or not.  Probably the "Invalid quota magic number" is normal,
but I'm not sure about the "BLK(s) MISSING IN BIT MAPS".  Have FreeBSD
and NetBSD FFS diverged this much?  I won't try to mount it, especially
not from the dom0.

Dumpfs shows the following:

file system: /dev/mapper/rscratch-fbsd--test.0
format  FFSv2
endian  little-endian
location 65536  (-b 128)
magic   19540119timeSun Apr 11 13:46:15 2021
superblock location 65536   id  [ 60735d32 358197c4 ]
cylgrp  dynamic inodes  FFSv2   sblock  FFSv2   fslevel 5
nbfree  951584  ndir2   nifree  4012796 nffree  21
ncg 50  size7864320 blocks  7612695
bsize   32768   sh

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
nt
# fsck /dev/da1
** /dev/da1
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
PARTIALLY TRUNCATED INODE I=325128
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877864
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877866
SALVAGE? [yn] n

PARTIALLY TRUNCATED INODE I=877879
SALVAGE? [yn] ^C

* FILE SYSTEM MARKED DIRTY *


Back on the NetBSD side:


 # xl block-detach fbsd-test  2064
 # fsck /dev/mapper/rscratch-fbsd--test.0
** /dev/mapper/rscratch-fbsd--test.0
** Last Mounted on /mnt
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
FREE BLK COUNT(S) WRONG IN SUPERBLK
SALVAGE? [yn] n

SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

12076 files, 91642 used, 7647797 free (293 frags, 955938 blocks, 0.0% 
fragmentation)

* UNRESOLVED INCONSISTENCIES REMAIN *****



--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpUlrkicjNvs.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-11 Thread Greg A. Woods
At Sun, 11 Apr 2021 23:04:29 - (UTC), mlel...@serpens.de (Michael van Elst) 
wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> wo...@planix.ca ("Greg A. Woods") writes:
>
> >SALVAGE? [yn] ^Cada0: disk error cmd=write 8145-8152 status: fffe
>
> That seems to be a message from the disk driver:

Yes, exactly, that's from the FreeBSD kernel as fsck was trying to
update the superblock and mark the filesystem as dirty (their fsck_ffs
always opens the device for write, even with '-n'); and the error is of
course because the backend has attached the disk as a read-only device.


> The latter case should log a message on Dom0 about DIOCCACHESYNC
> failing.

I haven't seen anything like that yet.


> But if you have sectors of DEV_BSIZE like here there is no difference
> and no conflict.

Yes as far as I've seen the FreeBSD domU reports a sector size of 512
bytes in every xbd(4) device and for every GEOM partition it creates or
finds on those devices.

FreeBSD newfs seems to concur that sectors are 512 bytes even when
writing to a raw (i.e. un-labeled) /dev/da1 (which has a 30GB LVM LV
backing it):


# newfs /dev/da1
/dev/da1: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.

$ echo 62914560 \* 512 / 1024 / 1024 | bc -l
30720.


The NetBSD dom0 reported the attachment of this device with a matching
number of (512-byte) sectors:

xbd backend: attach device scratch-fbsd--t (size 62914560) for domain 2



> The FreeBSD-12.2-RELEASE-amd64-mini-memstick.img I just fetched
> has two MBR partitions:
>
> Partition table:
> 0: EFI system partition (sysid 239)
> start 1, size 1600 (1 MB, Cyls 0/0/2-0/50/1)
> 1: FreeBSD or 386BSD or old NetBSD (sysid 165)
> start 1601, size 789520 (386 MB, Cyls 0/50/2-386/18/17), Active
>
> Making our disklabel program read the FreeBSD disklabel was a bit
> tricky, there is a bug that makes it segfault, but:
>
> type: unknown
> disk:
> label:
> flags:
> bytes/sector: 512
> sectors/track: 1
> tracks/cylinder: 1
> sectors/cylinder: 1
> cylinders: 789520
> total sectors: 789520
> rpm: 3600
> interleave: 0
> trackskew: 0
> cylinderskew: 0
> headswitch: 0   # microseconds
> track-to-track seek: 0  # microseconds
> drivedata: 0
>
> 8 partitions:
> #sizeoffset fstype [fsize bsize cpg/sgs]
>  a:78950416 4.2BSD  0 0 0  # (Cyl. 16 - 
> 789519)
>  c:789520 0 unused  0 0# (Cyl.  0 - 
> 789519)
>
>
> Apparently the MBR partition 1 starting at sector 1601 is a disk
> image itself and the disklabel is in sector 1 of that image.

Well I think in FreeBSD parlance it just is an MBR partition that has a
BSD label confined within its limits, and that BSD label further divides
its MBR partition into more disk partitions.  That's just the FreeBSD
way -- if I understand correctly their BSD labels are restricted to the
confines of the MBR partition where they sit.

And yes, FreeBSD's disklabel output matches:

# disklabel da0s2
# /dev/da0s2:
8 partitions:
#  size offsetfstype   [fsize bsize bps/cpg]
  a: 789504 164.2BSD0 0 0
  c: 789520  0unused0 0 # "raw" part, don't edit


So in FreeBSD the filesystem there is at "/dev/da0s2a" -- where "da0" is
the "device", "s2" is the second MBR partition, and "a" is of course the
BSD label's "a" partition.  They use more or less the same naming for
GPT entries as well.


> Adding a wedge to access the partition at offset 16 (+1601) gives:
>
> # dkctl vnd0 addwedge freebsd 1617 789504 ffs
> dk6 created successfully.

I had not thought to try that yet.  It's good to see it works!

Now that I can get vnd0d to export the .img file to FreeBSD I think I've
effectively eliminated worries about vnd(4) causing the bigger problems.



Speaking of which, I think this might be evidence that the FreeBSD
system was suffering the effects of accessing the corrupted filesystem I
was experimenting with.  Note the SIGSEGV's from processes apparently
after the kernel has gone into its halt-spin loop (this is the first
time I've seen this particular misbehaviour):


# halt -pq
Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop...
Syncing disks, vnodes remaining... 0 0 done
Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Waiting (max 60 seconds) for sys

Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods
At Sun, 11 Apr 2021 13:55:36 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> Definitely writing to a FreeBSD domU filesystem, i.e. to a FreeBSD
> xbd(4) with a new filesystem created on it, is impossible.


So, having run out of "easy" ideas, and working under the assumption
that this must be a problem in NetBSD-current dom0 (i.e. not likely in
Xen or Xen tools) I've been scanning through changes and this one, so
far, is one that would seem to me to have at least some tiny possibility
of being the root cause.


RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/xen/xen/xbdback_xenbus.c,v

revision 1.86
date: 2020-04-21 06:56:18 -0700;  author: jdolecek;  state: Exp;  lines: 
+175 -47;  commitid: 26JkIx2V3sGnZf5C;
add support for indirect segments, which makes it possible to pass
up to MAXPHYS (implementation limit, interface allows more) using
single request

request using indirect segment requires 1 extra copy hypercall per
request, but saves 2 shared memory hypercalls (map_grant/unmap_grant),
so should be net performance boost due to less TLB flushing

this also effectively doubles disk queue size for xbd(4)


I don't see anything obviously glaringly wrong, and of course this is
working A-OK on my same machines with NetBSD-5 and a NetBSD-current (and
originally somewhat older NetBSD-8.99) domUs.

However I'm really not very familiar with this code and the specs for
what it should be doing so I'm unlikely to be able to spot anything
that's missing.  I did read the following, which mostly reminded me to
look in xenstore's db to see what feature-max-indirect-segments is set
to by default:

https://xenproject.org/2013/08/07/indirect-descriptors-for-xen-pv-disks/


Here's what is stored for a file-backed device:

backend = ""
 vbd = ""
  3 = ""
   768 = ""
frontend = "/local/domain/3/device/vbd/768"
params = "/build/images/FreeBSD-12.2-RELEASE-amd64-mini-memstick.img"
script = "/etc/xen/scripts/block"
frontend-id = "3"
online = "1"
removable = "0"
bootable = "1"
state = "4"
dev = "hda"
type = "phy"
mode = "r"
device-type = "disk"
discard-enable = "0"
vnd = "/dev/vnd0d"
physical-device = "3587"
hotplug-status = "connected"
sectors = "792576"
info = "4"
sector-size = "512"
feature-flush-cache = "1"
feature-max-indirect-segments = "17"


Here's what's stored for an LVM-LV backed vbd:

162 = ""
 2048 = ""
  frontend = "/local/domain/162/device/vbd/2048"
  params = "/dev/mapper/vg1-fbsd--test.0"
  script = "/etc/xen/scripts/block"
  frontend-id = "162"
  online = "1"
  removable = "0"
  bootable = "1"
  state = "4"
  dev = "sda"
  type = "phy"
  mode = "r"
  device-type = "disk"
  discard-enable = "0"
  physical-device = "43285"
  hotplug-status = "connected"
  sectors = "83886080"
  info = "4"
  sector-size = "512"
  feature-flush-cache = "1"
  feature-max-indirect-segments = "17"


So "17" seems an odd number, but it is apparently because of "Need to
alloc one extra page to account for possible mapping offset".  It is
currently the maximum for indirect-segments, and it's hard-coded.
(Linux apparently has a max of 256, and the linux blkfront defaults to
only using 32.)  Maybe it should be "16", so matching max_request_size?



I did take a quick gander at the related code in FreeBSD (both the domU
code that's talking to this code in NetBSD, and the dom0 code that would
be used if dom0 was running FreeBSD), and besides seeing that it is
quite different, I also don't see anything obviously wrong or
incompatible there either.  (I do note that the FreeBSD equivalent to
xbdback(4) has a major advantage of being able to directly access files,
i.e. without the need for vnd(4).  Not quite as exciting as maybe full
9pfs mounts through to domUs would be, but still pretty neat!)

FreeBSD's equivalent to xbdback(4) (i.e. sys/dev/xen/blkback/blkack.c)
doesn't seem to mention "feature-max-indirect-segments", so apparently
they don't offer it yet, though it does mention "feature-flush-cache".

However their front-end code does detect it and seems to make use of it,
and has done for some 6 years now according to "git blame" (with no
recent fixes beyond fixing a memory leak on their end).  Here we see it
live from FreeBSD's sysctl output, thus my concern that this feature may
be the source of the problem:

hw.xbd.xbd_enable_indirect: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32

--
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpjqWl9lIxPf.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-13 Thread Greg A. Woods
At Tue, 13 Apr 2021 18:20:39 -0700, "Greg A. Woods"  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> So "17" seems an odd number, but it is apparently because of "Need to
> alloc one extra page to account for possible mapping offset".

Nope, changing that to 16 didn't make any difference.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpKKEyzDjq3_.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods
At Wed, 14 Apr 2021 19:53:47 +0200, Jaromír Doleček  
wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
> 
> You can test if this is the problem by disabling the feature in
> negotiation in NetBSD xbdback.c - comment out the code which sets
> feature-max-indirect-segments in xbdback_backend_changed(). With the
> feature disabled, FreeBSD DomU should not use indirect segments.

Ah, yes, thanks!  I should have thought of that.  That's especially
useful since on the client side it's a read-only flag:

# sysctl -w hw.xbd.xbd_enable_indirect=0
sysctl: oid 'hw.xbd.xbd_enable_indirect' is a read only tunable
sysctl: Tunable values are set in /boot/loader.conf

Apparently in the Linux implementation the number of indirect segments
used by a domU can be tuned at boot time, and that appears to be done by
setting a driver option on the guest kernel command line.  When I first
read that it didn't make so much sense to me to be giving this kind of
control to the domU.  Perhaps it would be better to make this a tuneable
in xl.cfg(5) such that it can be tuned on a per-guest basis.  Then
setting it to zero for a given guest would not advertise the feature at
all.

I've some other things to do before I can reboot -- I'll report as soon
as that's done....

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgp9ArzMYs191.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-14 Thread Greg A. Woods
48
dev.xbd.0.features: flush
dev.xbd.0.ring_pages: 1
dev.xbd.0.max_request_size: 65536
dev.xbd.0.max_request_segments: 17
dev.xbd.0.max_requests: 32
dev.xbd.0.%parent: xenbusb_front0
dev.xbd.0.%pnpinfo: 
dev.xbd.0.%location: 
dev.xbd.0.%driver: xbd
dev.xbd.0.%desc: Virtual Block Device




For reference the bug behaviour remains the same (at least for this
simplest quick and easy test):

# newfs /dev/da0
/dev/da0: 30720.0MB (62914560 sectors) block size 32768, fragment size 4096
using 50 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112, 
11540352, 12822592, 14104832, 15387072, 16669312,
 17951552, 19233792, 20516032, 21798272, 23080512, 24362752, 25644992, 
26927232, 28209472, 29491712, 30773952, 32056192, 8432,
 34620672, 35902912, 37185152, 38467392, 39749632, 41031872, 42314112, 
43596352, 44878592, 46160832, 47443072, 48725312, 50007552,
 51289792, 52572032, 53854272, 55136512, 56418752, 57700992, 58983232, 
60265472, 61547712, 62829952
# fsck /dev/da0
** /dev/da0
** Last Mounted on 
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
CG 0: BAD CHECK-HASH 0x49168424 vs 0xe610ac1b
SUMMARY INFORMATION BAD
SALVAGE? [yn] n

BLK(S) MISSING IN BIT MAPS
SALVAGE? [yn] n

CG 1: BAD CHECK-HASH 0xfa76fceb vs 0xb9e90a55
CG 2: BAD CHECK-HASH 0x41f444c vs 0x5efb290e
CG 3: BAD CHECK-HASH 0xad63fe7e vs 0x7ab3861f
CG 4: BAD CHECK-HASH 0xfd2043f3 vs 0xadb781f4
CG 5: BAD CHECK-HASH 0x545cf9c1 vs 0xcec5661e
CG 6: BAD CHECK-HASH 0xaa354166 vs 0x7dd269d3
CG 7: BAD CHECK-HASH 0x349fb54 vs 0x3078e065
CG 8: BAD CHECK-HASH 0xab23a7c vs 0xc8aa7e98
CG 9: BAD CHECK-HASH 0xa3ce804e vs 0x205a6b0d
CG 10: BAD CHECK-HASH 0x5da738e9 vs 0x604d5ecf
CG 11: BAD CHECK-HASH 0xf4db82db vs 0xfef11ffc
CG 12: BAD CHECK-HASH 0xa4983f56 vs 0xc7e701c8
CG 13: BAD CHECK-HASH 0xde48564 vs 0x42072fba
CG 14: BAD CHECK-HASH 0xf38d3dc3 vs 0xad98cf7b
CG 15: BAD CHECK-HASH 0x5af187f1 vs 0xbacadeb1
CG 16: BAD CHECK-HASH 0xe07abf93 vs 0xe4ca225
CG 17: BAD CHECK-HASH 0x490605a1 vs 0xe2917802
CG 18: BAD CHECK-HASH 0xb76fbd06 vs 0xa895abc
CG 19: BAD CHECK-HASH 0x1e130734 vs 0x6a8bc135
CG 20: BAD CHECK-HASH 0x4e50bab9 vs 0x44719a4a
CG 21: BAD CHECK-HASH 0xe72c008b vs 0xadb0c6e9
CG 22: BAD CHECK-HASH 0x1945b82c vs 0x3aeca102
CG 23: BAD CHECK-HASH 0xb039021e vs 0xb99f957d
CG 24: BAD CHECK-HASH 0xb9c2c336 vs 0xd384be85
CG 25: BAD CHECK-HASH 0x10be7904 vs 0x649e2abf
CG 26: BAD CHECK-HASH 0xeed7c1a3 vs 0x95f7
CG 27: BAD CHECK-HASH 0x47ab7b91 vs 0x3fb02d8b
CG 28: BAD CHECK-HASH 0x17e8c61c vs 0xa2b4ca67
CG 29: BAD CHECK-HASH 0xbe947c2e vs 0x65972e04
CG 30: BAD CHECK-HASH 0x40fdc489 vs 0x4219223f
CG 31: BAD CHECK-HASH 0xe9817ebb vs 0x36eb9a37
CG 32: BAD CHECK-HASH 0x3007c2bc vs 0xd1916e1d
CG 33: BAD CHECK-HASH 0x997b788e vs 0x5204f64d
CG 34: BAD CHECK-HASH 0x6712c029 vs 0xe291bcf0
CG 35: BAD CHECK-HASH 0xce6e7a1b vs 0x136ff032
CG 36: BAD CHECK-HASH 0x9e2dc796 vs 0x78ea85c8
CG 37: BAD CHECK-HASH 0x37517da4 vs 0x40c2cf31
CG 38: BAD CHECK-HASH 0xc938c503 vs 0x9b844ab6
CG 39: BAD CHECK-HASH 0x60447f31 vs 0x23129481
CG 40: BAD CHECK-HASH 0x69bfbe19 vs 0xa81f5e9
CG 41: BAD CHECK-HASH 0xc0c3042b vs 0xbd37ebd1
CG 42: BAD CHECK-HASH 0x3eaabc8c vs 0xfadfd8d1
CG 43: BAD CHECK-HASH 0x97d606be vs 0xf41513bc
CG 44: BAD CHECK-HASH 0xc795bb33 vs 0xad4e6069
CG 45: BAD CHECK-HASH 0x6ee90101 vs 0xbeab94a9
CG 46: BAD CHECK-HASH 0x9080b9a6 vs 0x2688acd1
CG 47: BAD CHECK-HASH 0x39fc0394 vs 0xb5a37e85
CG 48: BAD CHECK-HASH 0x83773bf6 vs 0xd779cc90
CG 49: BAD CHECK-HASH 0xe0d3fd3c vs 0xb8083ca
2 files, 2 used, 7612693 free (21 frags, 951584 blocks, 0.0% fragmentation)

* FILE SYSTEM MARKED DIRTY *

* PLEASE RERUN FSCK *

-- 
Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpwH2P4OJhnc.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods
So I wrote a little awk script so that I could write 512-byte blocks
with varying values of bytes.  (Awk is the only decent programming
language on the FreeBSD mini-memstick.img which I could think of that
would do something close to what I wanted it to do.  I could have
combined awk+sh+dd and done things faster, but I had all day to let it
run while I worked on some small engine repairs.)

https://github.com/robohack/experiments/blob/master/tblocks.awk

and then I used it to write 30GB to two different LVM LVs, each of
identical size, and each exported to the domU, one written on the dom0
and the other written on the domU.

Then I ran a cmp of both drives on each the dom0 and domU.

On the dom0 side were no differences.  All 30GB of what was written
directly in the dom0 to one of the LVs was identical to what was written
in the FreeBSD domU to the other LV.  I.e. the FreeBSD domU side seems
to be writing reliably through to the disk.

The FreeBSD domU though is _really_ slow at reading with cmp (perhaps
not unexpectedly given that it is using stdio to do the read and only
managing 4KB requests, at a rate of just under 500 requests per second
on each disk).

I'm going to send this and go to bed before it finishes, but I'm
guessing it's about 2/3's of the way through (it has run for nearly
11,000 seconds), and thus so far there are no differences from the
FreeBSD domU's point of view either.

Anyway, what the heck is FreeBSD newfs and/or fsck doing different!?!?!??

They're both writing and reading the very same raw device(s) that I
wrote and read to/from with awk and cmp.

These awk/cmp tests did very sequential operations, and the data are
quite uniform and regular; whereas newfs/fsck write/read a much more
complex data structure using operations scattered about in the disk.

These tests are also writing then reading enough data to flush through
the buffer caches in each dom0 and domU several times over.  The dom0
has only 4GB and the domU has 8GB, but Xen says it's only using under 2GB.

What else is different?  What am I missing?  What could be different in
NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
Could the fault still be in the FreeBSD drivers -- I don't see how as
the same root problem caused corruption in both HVM and PVH domUs.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpnTBJp7jfyq.pgp
Description: OpenPGP Digital Signature


Re: one remaining mystery about the FreeBSD domU failure on NetBSD XEN3_DOM0

2021-04-16 Thread Greg A. Woods
At Fri, 16 Apr 2021 11:44:08 +0100, David Brownlee  wrote:
Subject: Re: one remaining mystery about the FreeBSD domU failure on NetBSD 
XEN3_DOM0
>
> On Fri, 16 Apr 2021 at 08:41, Greg A. Woods  wrote:
>
> > What else is different?  What am I missing?  What could be different in
> > NetBSD current that could cause a FreeBSD domU to (mis)behave this way?
> > Could the fault still be in the FreeBSD drivers -- I don't see how as
> > the same root problem caused corruption in both HVM and PVH domUs.
>
> Random data collection thoughts:
>
> - Can you reproduce it on tiny partitions (to speed up testing)
> - If you newfs, shutdown the DOMU, then copy off the data from the
> DOM0 does it pass FreeBSD fsck on a native boot
> - Alternatively if you newfs an image on a native FreeBSD box and copy
> to the DOM0 does the DOMU fsck fail
> - Potentially based on results above - does it still happen with a
> reboot between the newfs and fsck
> - Can you ktrace whichever of newfs or fsck to see exactly what its
> writing (tiny *tiny* filesystem for the win here :)

So, the root filesystem is clean (from the factory, and verified by at
least NetBSD's fsck as OK), but when '-f' is used it is found to be
corrupt.

Unfortunately I don't have any real FreeBSD machines available (though I
could possibly get it installed on my MacBookPro again, but that's
probably a multi-day effort at this point).

However I've just found a way to reproduce the problem reliably and with
a working comparison with a matching-sized memory disk.

First off attach a tiny 4mb LVM LV to FreeBSD -- that's the smallest LV
possible apparently:

dom0 # lvm lvs
  LV  VG  Attr   LSize   Origin Snap%  Move Log Copy%  Convert
  build   scratch -wi-a- 250.00g
  fbsd-test.0 scratch -wi-a-  30.00g
  fbsd-test.1 scratch -wi-a-  30.00g
  nbtest.pkg  vg0 -wi-a-  30.00g
  nbtest.root vg0 -wi-a-  30.00g
  nbtest.swap vg0 -wi-a-   8.00g
  nbtest.var  vg0 -wi-a-  10.00g
  tinytestvg0 -wi-a-   4.00m
dom0 # xl block-attach fbsd-test format=raw, vdev=sdc, access=rw, 
target=/dev/mapper/vg0-tinytest


Now a run of the test on the FreeBSD domU (first showing the kernel
seeing the device attachment):


# xbd3: 4MB  at device/vbd/2080 on xenbusb_front0
xbd3: attaching as da2
xbd3: features: flush
xbd3: synchronize cache commands enabled.
GEOM: new disk da2

# dd if=/dev/zero of=tinytest.fs count=8192
8192+0 records in
8192+0 records out
4194304 bytes transferred in 0.081106 secs (51713998 bytes/sec)
# mdconfig -a -t vnode -f tinytest.fs
md0
# newfs -o space -n md0
/dev/md0: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# newfs -o space -n da2
/dev/da2: 4.0MB (8192 sectors) block size 32768, fragment size 4096
using 4 cylinder groups of 1.03MB, 33 blks, 256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 2304, 4416, 6528
# dumpfs da2 >da2.dumpfs
# dumpfs md0 >md0.dumpfs
# diff md0.dumpfs da2.dumpfs
1,2c1,2
< magic 19540119 (UFS2) timeFri Apr 16 18:48:55 2021
< superblock location   65536   id  [ 6079dc17 1006b3b4 ]
---
> magic 19540119 (UFS2) timeFri Apr 16 18:49:57 2021
> superblock location   65536   id  [ 6079dc55 348e5947 ]
27c27
< magic 90255   tell2   timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell2   timeFri Apr 16 18:49:57 2021
40c40
< magic 90255   tell128000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell128000  timeFri Apr 16 18:49:57 2021
53c53
< magic 90255   tell23  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell23  timeFri Apr 16 18:49:57 2021
66c66
< magic 90255   tell338000  timeFri Apr 16 18:48:55 2021
---
> magic 90255   tell338000  timeFri Apr 16 18:49:57 2021
# fsck md0
** /dev/md0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
1 files, 1 used, 870 free (14 frags, 107 blocks, 1.6% fragmentation)

* FILE SYSTEM IS CLEAN *
# fsck da2
** /dev/da2
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
ROOT INODE UNALLOCATED
ALLOCATE? [yn] n


* FILE SYSTEM MARKED DIRTY *


So I ktraced the fsck_ufs run, and though I haven't looked at it with a
fine-tooth comb and the source open, the only thing that seems a wee bit
different about what fsck does is that it opens the device twice, with
O_RDONLY, then shortly before it prints the first "** /dev/da2" line it
reopens it O_RDRW a third time, closes the second one, and then closes
the second one and calls dup() on the third one so that it has the same
FD# as the

Xen FreeBSD domU block I/O problem begins somewhere between 8.99.32 (2020-06-09) and 9.99.81 (2021-03-10)

2021-04-16 Thread Greg A. Woods
So I was just reminded that I do still have a Xen server that's still
running the 8.99.32 kernel and Xen-4.11.  I had not been testing on it
because it still of course has the vnd(4) CHS size bug (and because it's
also hosting my $HOME and /usr/src and I don't want to crash it), and I
had not remembered until just now that I can work around that by simply
padding out the mini-memstick.img file!

And, so

It works, A-OK, with all other things remaining the same:

# ls -l /dev/xbd0
crw-r-  1 root  operator  0x3a Apr 17 04:31 /dev/xbd0
# newfs /dev/xbd0
/dev/xbd0: 20480.0MB (41943040 sectors) block size 32768, fragment size 4096
using 33 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.
super-block backups (for fsck_ffs -b #) at:
 192, 1282432, 2564672, 3846912, 5129152, 6411392, 7693632, 8975872, 10258112,
 11540352, 12822592, 14104832, 15387072, 16669312, 17951552, 19233792,
 20516032, 21798272, 23080512, 24362752, 25644992, 26927232, 28209472,
 29491712, 30773952, 32056192, 8432, 34620672, 35902912, 37185152,
 38467392, 39749632, 41031872
# fsck /dev/xbd0
** /dev/xbd0
** Last Mounted on
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
2 files, 2 used, 5076797 free (21 frags, 634597 blocks, 0.0% fragmentation)

* FILE SYSTEM IS CLEAN *
#


So the problem is almost certainly in NetBSD-current itself, and
somewhere in the vast gulf between 8.99.32 (2020-06-09) and 9.99.81
(2021-03-10).

Unfortunately I don't have enough hardware that's Xen-capable and up and
running well enough to allow me to do any brute-force bisecting.

--
        Greg A. Woods 

Kelowna, BC +1 250 762-7675   RoboHack 
Planix, Inc.  Avoncote Farms 


pgpTsPzyBFUd7.pgp
Description: OpenPGP Digital Signature


  1   2   >