Re: How does PCIe appear to the host?
>>> It's supposed to negotiate down to x1? >> Yes. > Okay. I've dropped the manufacturer an email asking whether this > device is supposed to work in a mechanically x16 slot which has only > one lane available to it. It's not. Vantec wrote back, saying > I'm sorry, but this card won't work with Any PCIe slot with only 1 lane. > Yes, technically it should be able to run in 1 lane, but we designed it not > to support using 1 lane for 5 devices, it will be slow. I offered my opinion that "slow" is better than "not working at all", but, obviously, that's not relevant to the existing hardware. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
>> Here's what ACPI has to say. As I mentioned, dmesg is identical >> with or without the card. > There are devices out there that require a relatively recent host ( > '3rd generation PCIE' or somesuch ). I am inclined to doubt that's what's up here. The Q1900M fails; the one that works is an Asus M3A78-CM, which still has PCI slots. Certainly I've owned the Asus for years, whereas the Q1900N was bought this summer, not that time I've possessed it necessarily goes with time it's existed (or been designed). Certainly the Q1900M *looks* like a newer design. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
>> It's supposed to negotiate down to x1? > Yes. Okay. I've dropped the manufacturer an email asking whether this device is supposed to work in a mechanically x16 slot which has only one lane available to it. The primary chip on the board is indeed marked with the JMicron logo and "JMB585" (it's also marked "2143 QHBA1 A" and "E771C0011"), but there are numerous other components, including at least two other chips, there >> Then either Vantec or ASRock has done something odd or my particular >> Q1900M has a duff "x16" slot, because it doesn't work. > I once had a PCIe network card in a x16 slot that didn't work > reliable and wasn't recognized now and then. Reason was that the > edge connector wasn't correctly aligned and I had to shape it with a > file. Some things are just too cheap. I may have to do some such. I'll see what Vantec has to say. (In particular, I don't want to do anything permanent to this card while there's still some chance I may need to RMA it.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
>> ahcisata0 at pci2 dev 0 function 0: vendor 0x197b product 0x0585 > That's a JMicron JMB585 which has a PCIe Gen3 x2 interface and > provides five 6Gbps SATA ports. That sounds right; the card has five SATA connectors and its ppb reports "link is x2 @ 5.0GT/s". > If your board has eight SATA ports, No, the card has only five ports, both physically (only five SATA connectors) and digitally (five atabus instances appear in autoconf). > A JMB585 should have no problems to work in a x1 slot. It's supposed to negotiate down to x1? Then either Vantec or ASRock has done something odd or my particular Q1900M has a duff "x16" slot, because it doesn't work. I think my quad wm is a x1 card; if I can find the silly thing I'll try it in the ASRock to see if the slot itself works. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
0xd0: 0xc000 0x0842 0xc9118000 0x 0xe0: 0x 0x 0x0004 0x 0xf0: 0x0050 0x00c0 0x010e0f1a 0x0100 When I put the card into the other machine, into the x16 slot that actually _is_ x16, the one it works in, it shows up as ahcisata0, attached via ppb1 at pci0 dev 2 function 0: vendor 0x1022 product 0x9603 (rev. 0x00) ppb1: PCI Express capability version 2 x16 @ 5.0GT/s ppb1: link is x2 @ 5.0GT/s pci2 at ppb1 bus 2 pci2: i/o space, memory space enabled, rd/line, wr/inv ok ahcisata0 at pci2 dev 0 function 0: vendor 0x197b product 0x0585 ahcisata0: interrupting at ioapic0 pin 18 ahcisata0: AHCI revision 0x10301, 5 ports, 32 command slots, features 0xef33e080 and pcictl dump /dev/pci2 -d 0 shows (114 lines) PCI configuration registers: Common header: 0x00: 0x0585197b 0x00100107 0x01060100 0x0010 Vendor Name: JMicron Technology (0x197b) Device ID: 0x0585 Command register: 0x0107 I/O space accesses: on Memory space accesses: on Bus mastering: on Special cycles: off MWI transactions: off Palette snooping: off Parity error checking: off Address/data stepping: off System error (SERR): on Fast back-to-back transactions: off Interrupt disable: off Status register: 0x0010 Capability List support: on 66 MHz capable: off User Definable Features (UDF) support: off Fast back-to-back capable: off Data parity error detected: off DEVSEL timing: fast (0x0) Slave signaled Target Abort: off Master received Target Abort: off Master received Master Abort: off Asserted System Error (SERR): off Parity error detected: off Class Name: mass storage (0x01) Subclass Name: SATA (0x06) Interface: 0x01 Revision ID: 0x00 BIST: 0x00 Header Type: 0x00 (0x00) Latency Timer: 0x00 Cache Line Size: 0x10 Type 0 ("normal" device) header: 0x10: 0xdc01 0xd881 0xd801 0xd481 0x20: 0xd401 0xfbefe000 0x 0x197b 0x30: 0xfbee 0x0080 0x 0x010a Base address register at 0x10 type: i/o base: 0xdc00, not sized Base address register at 0x14 type: i/o base: 0xd880, not sized Base address register at 0x18 type: i/o base: 0xd800, not sized Base address register at 0x1c type: i/o base: 0xd480, not sized Base address register at 0x20 type: i/o base: 0xd400, not sized Base address register at 0x24 type: 32-bit nonprefetchable memory base: 0xfbefe000, not sized Cardbus CIS Pointer: 0x Subsystem vendor ID: 0x197b Subsystem ID: 0x Expansion ROM Base Address: 0xfbee Capability list pointer: 0x80 Reserved @ 0x38: 0x Maximum Latency: 0x00 Minimum Grant: 0x00 Interrupt pin: 0x01 (pin A) Interrupt line: 0x0a Capability register at 0x80 type: 0x01 (Power Management, rev. 1.0) Capability register at 0x90 type: 0x05 (MSI) Capability register at 0xc0 type: 0x10 (PCI Express) PCI Power Management Capabilities Register Capabilities register: 0x4003 Version: 1.2 PME# clock: off Device specific initialization: off 3.3V auxiliary current: self-powered D1 power management state support: off D2 power management state support: off PME# support: 0x08 Control/status register: 0x0008 Power state: D0 PCI Express reserved: off No soft reset: on PME# assertion disabled PME# status: off PCI Express Capabilities Register Capability version: 2 Device type: Legacy PCI Express Endpoint device Interrupt Message Number: 0 Device-dependent header: 0x40: 0x 0x 0x 0x 0x50: 0x 0x 0x 0x 0x60: 0x 0x 0x 0x 0x70: 0x 0x 0x 0x 0x80: 0x40039001 0x0008 0x 0x 0x90: 0x0086c005 0x 0x 0x 0xa0: 0x 0x 0x 0x 0xb0: 0xc011 0x 0x0008 0x 0xc0: 0x00120010 0x10008102 0x00092810 0x0041a023 0xd0: 0x00220040 0x 0x 0x 0xe0: 0x 0x00140392 0x 0x000e 0xf0: 0x00010003 0x 0x 0x I hope the above includes whatever you'd be looking for! /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
> [...]. I just today picked up a 5-port PCIe SATA card and tried it. In case it matters to anyone, the card is identified, on the box and on a sticker on the card itself, as a UGT-ST655, and it comes from Vantec. As one of my messages quoted autoconf as saying, it shows up as vendor 0x197b product 0x0585. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
>> If you're plugging in a card and it isn't seen, I'd check the BIOS >> and look for any pcie settings it might have. > I suspect it's worse than that in this case; see mlelstv's mail, > explaining that there are only four lanes available total, so my > "x16" slot is [actually x1 electrically] I tried the card in another machine, in a "x16" slot. It appears _that_ marking is honest in that respect: ppb1 at pci0 dev 2 function 0: vendor 0x1022 product 0x9603 (rev. 0x00) ppb1: PCI Express capability version 2 x16 @ 5.0GT/s ppb1: link is x2 @ 5.0GT/s pci2 at ppb1 bus 2 pci2: i/o space, memory space enabled, rd/line, wr/inv ok ahcisata0 at pci2 dev 0 function 0: vendor 0x197b product 0x0585 ahcisata0: interrupting at ioapic0 pin 18 ahcisata0: AHCI revision 0x10301, 5 ports, 32 command slots, features 0xef33e080 atabus0 at ahcisata0 channel 0 atabus1 at ahcisata0 channel 1 atabus2 at ahcisata0 channel 2 atabus3 at ahcisata0 channel 3 atabus4 at ahcisata0 channel 4 ... ahcisata0 port 2: device present, speed: 6.0Gb/s wd0 at atabus2 drive 0: wd0: drive supports 16-sector PIO transfers, LBA48 addressing wd0: HPA enabled, protected area present wd0: limit not raised (not enabled in configuration) wd0: 1863 GB, 3876021 cyl, 16 head, 63 sec, 512 bytes/sect x 3907029168 sectors wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd0: block sizes: medium 4096, interface 512, alignment 0 wd0(ahcisata0:2:0): using PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) (using DMA) Note (a) the ppb1: line says x16 (though it also says "link is x2"; if I'm reading the code right this might mean the card actually requires only x2, not x4, though it's x4 mechanically) and (b) ahcisata0 is found, including the drive I had connected as a test. So the card appears fine. I think mlelstv's explanation is (not surprisingly, in view of whom it's from) right about why it doesn't work on the ASRock. I also tried it in a "x8" slot, in a Dell PowerEdge 840. This one is more confusing, at least to me: pchb0 at pci0 dev 0 function 0 pchb0: vendor 0x8086 product 0x2778 (rev. 0x00) ppb0 at pci0 dev 1 function 0: vendor 0x8086 product 0x2779 (rev. 0x00) ppb0: PCI Express capability version 1 x8 @ 2.5GT/s pci1 at ppb0 bus 1 pci1: i/o space, memory space enabled, rd/line, wr/inv ok vendor 0x197b product 0x0585 (SATA mass storage, interface 0x01) at pci1 dev 0 function 0 not configured It says it actually is x8, but then...not configured? This is using the same kernel as the above (each one was using my PXE boot setup, just on different hardware), so I really want to figure out why the same hardware - reported as same vendor/product, even recognized as "SATA mass storage" - would fail to match on the second machine. I took that amchine out of service, but I can't now recall why; maybe it's very slightly broken, returning the right vendor/product numbers but trash in other fields? I'll have to investigate. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: How does PCIe appear to the host?
First, my thanks to everyone; while the news is not great from the perspective of getting things working, it has greatly improved my understanding of PCIe, so in that sense it was a success. You people are a marvelous resource! > It's been my experience that pcie busses show up as pci busses from > the software perspective My experience is _very_ limited - I've used PCIe in all of one other case - but that's how it worked in that case. > If you're plugging in a card and it isn't seen, I'd check the BIOS > and look for any pcie settings it might have. I suspect it's worse than that in this case; see mlelstv's mail, explaining that there are only four lanes available total, so my "x16" slot is x16 only mechanically - it seems to me it would have been more honest of ASRock to use a x1 socket that's open at the end so any size card can fit mechanically but it doesn't appear to be more than it is. > You don't say which version of NetBSD you're running, but I've used > pcie as early as NetBSD-3 and quite extensively on NetBSD-5. I didn't? *check* Ah, yes, just "relatively old". It's my (somewhat mutant) 5.2, and, yes, I have pcictl. > You might experiment by booting without ACPI to see how the PCI > busses probe in that case. I may try that. But I suspect the PCI(e) busses are doing the best they can. The ppbs are visible with pcictl list 000:28:0: Intel product 0x0f48 (PCI bridge, revision 0x0e) 000:28:1: Intel product 0x0f4a (PCI bridge, revision 0x0e) 000:28:2: Intel product 0x0f4c (PCI bridge, revision 0x0e) 000:28:3: Intel product 0x0f4e (PCI bridge, revision 0x0e) which matches with what autoconf reports at boot time. pcictl dump on the ppbs reports a lot of stuff, but nothing interesting that the backported dump of the XCAP/LCAP values doesn't, and, indeed, doesn't report the max width value from PCIE_LCAP as far as I can see. Here's what ACPI has to say. As I mentioned, dmesg is identical with or without the card. ACPI Warning (tbfadt-0327): FADT (revision 5) is longer than ACPI 2.0 version, truncating length 0x10C to 0xF4 [20080321] [...] acpi0 at mainbus0: Intel ACPICA 20080321 acpi0: X/RSDT: OemId , AslId acpi0: SCI interrupting at int 9 acpi0: fixed-feature power button present timecounter: Timecounter "ACPI-Safe" frequency 3579545 Hz quality 900 ACPI-Safe 24-bit timer hpet0 at acpi0 (HPET, PNP0103-0): mem 0xfed0-0xfed003ff timecounter: Timecounter "hpet0" frequency 14318179 Hz quality 2000 FWHD (INT0800) at acpi0 not configured attimer1 at acpi0 (TIMR, PNP0100): io 0x40-0x43,0x50-0x53 irq 0 LPTE (PNP0400) at acpi0 not configured UAR1 (PNP0501) at acpi0 not configured ADMA (DMA0F28) at acpi0 not configured acpibut0 at acpi0 (PWRB, PNP0C0C): ACPI Power Button acpibut1 at acpi0 (SLPB, PNP0C0E): ACPI Sleep Button BTH0 (BCM2E1A) at acpi0 not configured GPS0 (BCM4752) at acpi0 not configured CAM0 (INTCF0B) at acpi0 not configured CAM1 (INTCF1A) at acpi0 not configured STRA (INTCF1C) at acpi0 not configured SHUB (SMO91D0) at acpi0 not configured FAN0 (PNP0C0B) at acpi0 not configured acpitz0 at acpi0 (TZ01): active cooling level 0: 50.0C critical 90.0C hot 85.0C passive 26.8C That first line looks possibly worrisome; if I knew ACPI I'd have a better idea whether it's anything to be concerned over. > If the PCIE card you're using is working properly, it should be seen > by the BIOS as additional SATA ports. If you have a drive plugged > into it when you boot the BIOS, you might even be rewarded with a > listing of the make and model of the drive you have connected. If > you see that, you then know it's not hardware trouble. Hard to tell. The BIOS setup facility is horrible; it was designed by someone who thinks GUI glitz is more important than functionality. But I *think* I didn't get any additional drives listed (not surprising in view of the above). I'll experiment a bit more. I think I have another machine with a PCIe slot (I once had a quad wm in there, but I can't recall whether it was x1 or x4 or what); if I can find it I may try the SATA card there. If I can find _it_, may also try the quad wm in the ASRock. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
How does PCIe appear to the host?
How does PCIe differ from PCI from the CPU's point of view? I'm running into an issue and it seems to me this is important. I've been having hardware (partially) fail on me recently, which is breaking my backups. In particular, I'm having trouble finding a machine to connect the backup disks to - they're SATA, and I don't have very many machines with SATA, and some of those SATA ports appear to be broken (one machine, for example, has six, of which I've been able to make only two work). One of these machines is an ASRock Q1900M. It has only two SATA ports onboard; it has two PCIe x1 slots and a PCIe x16 slot. I just today picked up a 5-port PCIe SATA card and tried it. The reason I'm asking about PCIe is that, as far as I can tell from the host, it isn't there at all. While this is a relatively old kernel, I'd expect at least a "not configured" line - but dmesg is identical between a boot with it and a boot without it. So now I'm wondering whether I've got a DOA card, or a duff slot, or I need to backport something, or whether this is somehow expected, or what. I've already backported the printout of PCIe capability in ppb.c. The ppbs report as ppb0: PCI Express capability version 2 x1 @ 5.0GT/s pci1 at ppb0 bus 1 pci1: i/o space, memory space enabled, rd/line, wr/inv ok ppb1 at pci0 dev 28 function 1: vendor 0x8086 product 0x0f4a (rev. 0x0e) ppb1: PCI Express capability version 2 x1 @ 5.0GT/s pci2 at ppb1 bus 2 pci2: i/o space, memory space enabled, rd/line, wr/inv ok ppb2 at pci0 dev 28 function 2: vendor 0x8086 product 0x0f4c (rev. 0x0e) ppb2: PCI Express capability version 2 x1 @ 5.0GT/s pci3 at ppb2 bus 3 pci3: i/o space, memory space enabled, rd/line, wr/inv ok ppb3 at pci0 dev 28 function 3: vendor 0x8086 product 0x0f4e (rev. 0x0e) ppb3: PCI Express capability version 2 x1 @ 5.0GT/s I note a possible conflict between the "x1" and the presence of a x16 slot; that 1 is coming from the PCIE_LCAP_MAX_WIDTH bits in PCIE_LCAP, which makes me wonder whether something needs configuring to run the x16 slot at more than x1. The card does say it needs a x4 or higher slot to work, so if the x16 slot is running x1 (is that even possible?) that might be responsible. Any thoughts? Any pointers to where I might usefully look? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]
>> If virtio were declared normally in the kernels that provide it and >> declared as valid but specifically absent in XEN3_DOM* kernels? >> Then I think that's what I'd want (to my limited understanding, this >> is close to what "no virtio" does at present). > A fair point, but are you suggesting that every bus that could ever > exist be declared and all other kernels have "no", as a general > approach? Well, every bus that you'd want to support this feature for. Alternatively, perhaps add some declaration that says "this is a valid bus name" without declaring any instances of it, then do the silent-drop thing for any attachment which is at a bus declared as valid but without instances. Then all - well, most - configs would include all-buses.conf or some such to get those declarations, then declare normally the buses they want to actually have. Of course, then you have people wondering why the new device attachment line they added isn't working. But you'll have that potential for _any_ design with a "silently drop this line" semantic. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: vio9p vs. GENERIC.local vs. XEN3_DOM[0U]
> The right level of abstraction is to do something that says > if there is a virtio bus, add viop* at virtio* > and this is true of pretty much anything that attaches to a bus that > may or may not present. > I wonder if there are good reasons to avoid "just skip lines that > refer to a bus that don't exist". My answer is, error-checking. If I, say, typo "pci" as "cpi" in mydev* at cpi? I'd want an error rather than having the line silently ignored. (That particular typo is not all that plausible. It's just an example.) Now, if virtio were specifically declared as "this name is valid but may or may not be present"? I'm on the fence. If virtio were declared normally in the kernels that provide it and declared as valid but specifically absent in XEN3_DOM* kernels? Then I think that's what I'd want (to my limited understanding, this is close to what "no virtio" does at present). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Moving memory in VM?
>> Is there any way to move memory within a process? That is, [...] > mremap? Or what am I missing? No, thank you; it was I who was missing something (specifically, the existence of mremap - not something I've run into before). I was hoping it would be that simple! /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Moving memory in VM?
Is there any way to move memory within a process? That is, I have a bunch of valid pages at addresses [A..B) and I want to move those pages to [C..C+(B-A)). The principal use case I have at the moment has the initial mapping mapped MAP_ANON, with C fixed and mmap()-style semantics for anything that was in the target range to start with. It's OK if the [A..B) range is unmapped in the process, and it's OK to requires that the old and new address ranges not overlap. (I'm trying to implement an alternative memory allocator and want realloc() for multi-page blocks to be faster than copying all the bytes.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: strange zfs size allocation data
>> Is bup zfs-specific? > No, it is a general backup program. I just happen to have sources > for it on zfs. Which people tell me is a great filesystem and is now > not odd on NetBSD Well, great for some use cases. I have, for example, seen it said that it's ludicrously RAM-hungry (as in, you need multiple gigs of RAM or you shouldn't even think about using it). This is fine if you have a machine so overspecced that you have that much RAM to burn; it's less great if you're looking for something more general-purpose. (To be fair, it also has upsides; for example, I think I've seen it described as having the ability to add and remove partitions live, and as keeping integrity checksums.) >> Because, if you're not doing something filesystem-specific, I >> actually think you will have trouble even _defining_ what "100% >> right" is for this test, since everything about sparseness, right >> down to whether it's even a thing, is filesystem-dependent. > True. the point is to try to verify that the backup program, when > restoring a sparse file, writes it in such a way that the normal > implementation of sparse files works, meaning results in a file > without blocks storing all the zeros. Fair point. You might want to have the test verify that the filesystem in use does support sparseness in the form you're looking for before doing the rest of the testing. > What you are missing, and everybody else too, is that the fact that > this is theoretically impossible is irrelevant to it being useful in > the real world, to detect regressions, even if it also occasionally > detects bizarre behavior. I, at least, haven't been missing that. When you talk about getting a test "100% right", though, I read that as including "...even in relatively unlikely circumstances". Running on msdosfs strikes me as unlikely enough to not care about. tmpfs, though, is relatively plausible as a filesystem for tests. > A better test would be 'fuse-sparsetest' that makes metadata > available for inspection later about the writes it sees. But that's > hard to write. You could get much of that information by ktracing and looking for the relevant calls - {,p}write{,v} and lseek seem to me to be the most likely candidates. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: strange zfs size allocation data
> This is a test case, to see if backing up and restoring a sparse file > results in a sparse file. I realize that this probably requires a > logging fuse driver and a lot of complexity to do 100% right. Is bup zfs-specific? Because, if you're not doing something filesystem-specific, I actually think you will have trouble even _defining_ what "100% right" is for this test, since everything about sparseness, right down to whether it's even a thing, is filesystem-dependent. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD-10.0/i386 spurious SIGSEGV
>> pid 853.853 (nagios): signal 11 code=1 (trap 6) @eip 0xbb61057a addr >> 0x7513f5a9 error=14 > Anyone has ideas of things to investigate? I don't have anything specific to suggest; I don't work with either 10.* or Xen myself. But... > I am about to upgrade the offending domU to amd64, in order to work > around the problem. If that works (and I hope it will), I will have > no way left to test for this bug. ...if you have the space and can set up a test environment you don't mind giving a copy of to others, it might be useful to help someone else track it down. Ideally, that would be a copy of the whole dom0 and domU in question, but I recognize that might involve a prohibitive amount of space - though if you can strip the test setup down to essentials, it'll both save on space and reduce what people have to look at. But, of course, that depends on you having the resources (disk space, time, and energy/motivation) to do that. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD-10.0/i386 spurious SIGSEGV
>> I have seen many crashes on system call returns. [...] > I would suggest printing the siginfo, but apparently our gdb doesn't > support it (so I filed PR 58325): [...] Okay, this is strictly a debugging workaround: how about building a kernel with code added so that, whenever a SIGSEGV is delivered, the siginfo is printed on the console? It would at least let you get the information, and I suspect SEGVs are rare enough you wouldn't have to sift through too many false positives. It does, though, assume you're comfortable adding code to your kernel and rebuilding it. (If trying to build a new kernel SEGVs, maybe cross-build it?) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: NetBSD-10.0/i386 spurious SIGSEGV
> After upgrading i386 XEN3PAE_DOMU to NetBSD 10.0, various daemons on > multuple machines get SIGSEGV at places I could not figure any reason > why it happens. [...] > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 > (gdb) bt > #0 0xbb610579 in __gettimeofday50 () from /lib/libc.so.12 > #1 0xbb60ca82 in __time50 (t=t@entry=0xbf7fde88) > at /usr/src/lib/libc/gen/time.c:52 > #2 0x0808afdd in update_check_stats (check_type=3, check_time=1717878817) > at utils.c:3015 First thing I'd look at is the userland instruction(s) around the crash point, maybe look at instructions starting at 0xbb610480 or something and then disassemble forwards looking for 0xbb610579. In particular, I'd be interested in whether it's a store instruction that failed or whether this happened during a syscall trap. Are all the failures in __gettimeofday50? All in trap-to-the-kernel calls? You say "multiple machines"; are those multiple domUs on a single dom0, or are they spread across multiple underlying hardware machines? If the latter, how similar are those underlying machines? I'm wondering if perhaps something is broken in a subtle way such that it manifests on only certain hardware (I'm talking about something along the lines of "this tickles erratum #2188 in stepping 478 of Intel CPUs from the Forest Lawn family"). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: TCP vs urgent data [was Re: poll(): IN/OUT vs {RD,WR}NORM]
>> I might rip out the OOB stuff just to find and fix anything trying >> to use it, though. > I think ripping it out would be the right thing. And I suspect very > little would notice. I just did a first pass, doing find . -type f -print0 | xargs -0 mcgrep -H -l MSG_OOB in /usr/src. Most of it is no surprise, or is irrelevant here: telnet and ftp, documentation, the kernel support that backends it, sys/compat, those are expected. But I got one surprise: rlogin{,d}. And I had a quick look at the code - it actually _uses_ it. It is not a case where completely ignoring URG will work. (It actually uses it as out-of-band data, too. You'd almost think it came from the same people who tried to turn the urgent pointer into out-of-band data in the first place.) Fortunately or unfortunately, I don't care about rlogin. I would ditch it when eliminating MSG_OOB. In theory, eliminating MSG_OOB is wrong, because TCP may not be the only protocol that uses it. My sweep found sys/netiso/tp_usrreq.c, and searching for SSD_RCVATMARK under sys/ finds hits in netiso as well as netinet. But I care about netiso about as much as I care about rlogin; I certainly don't mind losing it for long enough to find everything using TCP "OOB". /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: TCP vs urgent data [was Re: poll(): IN/OUT vs {RD,WR}NORM]
> But reading RFC 959, there is no mention of using urgent data under > any circumstances. No _explicit_ mention. It's there by reference. > What that RFC says about aborting is: > [...] >2. User system sends the Telnet "Synch" signal. This involves setting URG. Read the telnet spec's description of the SYNCH operation (the top of page 9 of RFC854 is a good starting point). See also SO_OOBINLINE, SIOCATMARK, and SS_RCVATMARK. I am sorely tempted to try to rip out the OOB botch and design a socket interface to the urgent pointer that isn't so badly broken, but I doubt I'd actually find any use for the latter. I might rip out the OOB stuff just to find and fix anything trying to use it, though. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
TCP vs urgent data [was Re: poll(): IN/OUT vs {RD,WR}NORM]
Should we maybe move this to tech-net? It's no longer about poll(). >> I question whether it actually works except by accident; see RFC >> 6093. > I hadn't seen that one before, Neither had I until Johnny Billquist mentioned it upthread. (I tend to share your reaction to the modern IETF, though I have additional reasons.) >> But the facility it provides is of little-to-no use. I can't recall >> anything other than TELNET that actually uses it, > TELNET and those protocols based upon it (SMTP and FTP command at > least). FTP command, yes. SMTP I'm moderately sure doesn't do TELNET in any meaningful sense; for example, I'm fairly sure octet 0xff is not special. I find no mention of TELNET in 5321. > SMTP has no actual use for urgent data, and never sends any, but FTP > can in some circumstances I believe (very ancient unreliable memory). Yes. It should, according to the spec, be done when sending an ABOR to abort a transfer in progress. But, unlike TELNET's specification that data is to be dropped while looking for IAC DM, the urgent bit can be completely ignored by an FTP server which is capable of paying attention to the control channel while a data transfer is in progress. >> then botched it further by pointing the urgent sequence number to >> the wrong place, > In fairness, when that was done, it wasn't clear it was wrong - that > all long predated anyone even being aware that there were two > different meanings in the TCP spec, people just used whichever of > them was most convenient (in terms of how it was expressed, not which > is easier to implement) and ignored the other completely. That's > why it took decades to get fixed - no-one knew that the spec was > broken for a long time. So...I guess next to nothing depended on it even then, or someone would have noticed the interoperability fail sooner than decades. > Further, if used properly, it really doesn't matter much, the > application is intended to recognise the urgent data by its content > in the data stream, all the U bit (& urgent pointer) should be doing > is giving it a boot up the read stream to suggest that it should > consume more quickly than it otherwise would. Right. But... > Whether than indication stops one byte earlier or later should not > really matter. That depends. Consider TELNET, which is defined to drop data while searching for IAC DM. If the sender consider the urgent pointer to point _after_ the last urgent octet but the receiver considers it to point _to_ the last urgent octet, the receiver will get the IAC DM and notice the urgent pointer points past it and continue reading and dropping, looking for another IAC DM, dropping at least one data octet the sender didn't expect. > The text in that RFC about multiple urgent sequences also misses that > I think - I thought that was probably there for clarity, clarifying what logically follows from the rest. > all that matters is that as long as there is urgent data coming, the > application should be aware of that and modify its behaviour to read > more rapidly than it otherwise might (if it never delays reading from > the network, always receives & processes packets as soon as they > arrive, which for example, systems which do remote end echo need to > do) then it doesn't need to pay attention to the U bit at all). Well, there are correctness issues in some cases. For example, in TELNET's case, it is defined to drop data while sarching for the IAC DM that makes up part of a synch; ignoring the urgent bit means that dropping won't happen. (Does that matter in practice? Probably not, especially given how little TELNET is used outside walled gardens. But it still is a correctness issue.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: poll(): IN/OUT vs {RD,WR}NORM
>>> However, the urgent pointer is close to useless in today's network, >>> in that there are few-to-no use cases that it is actually useful >>> for. > That's probably correct too. It is however still used (and still > works) in telnet - though that is not a frequently used application > any more. I question whether it actually works except by accident; see RFC 6093. > [That's where the off by one occurred, there were two references to > it, one suggested that the urgent pointer would reference the final > byte of what is considered urgent, the other that it would reference > one beyond that, that is, the first byte beyond the urgent data. > This was corrected in the Hosts Requirements RFCs, somewhere in the > mid 80's if I remember them, roughly.] But only a few implementors paid any attention, it appears. > That is all very simple, and works very well, particularly on high > latency or lossy networks, as long as you're not expecting "urgent" > to mean "out of band" or "arrive quickly" or anything else like that. But the facility it provides is of little-to-no use. I can't recall anything other than TELNET that actually uses it, though I am by no stretch fmailiar with more than some of the commonest protocols out there. Furthermore, given that probably the most popular API to TCP, sockets, botched it by trying to turn it into an out-of-band data stream, then botched it further by pointing the urgent sequence number to the wrong place, I'd say it is questionable whether it is good for _anything_ any longer. > If an application needs a mechanism like this, it works well. That's a bit like saying that car hand crank starter handles are useful if you need them: strictly true, but to a first and even second approximation both the statement and the thing stated about are irrelevant to everyone. Also, it's true only provided you don't use sockets for your API (or fixed sockets - has anyone done a TCP socket interface that exposes the urgent popinter _properly_?), and provided your and the peer's implementations agree on which sequence number goes in the urgent field. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: poll(): IN/OUT vs {RD,WR}NORM
>> Where do we attach 3 priority levels to data? > [I]n the context of poll itself, it's undefined. But it's easy to > think that the TCP urgent data would be something usable in this > context. But as you note, the urgent data is a somewhat broken thing > that noone ever really figured out how it was meant to be used or > anything about it at all. TCP's urgent pointer is well defined. It is not, however, an out-of-band data stream, nor, despite Berkeley's attempts, can it really be twisted and bent into one, unless you are on a network which is high bandwidth, low latency, and low loss (as compared to the "out-of-band" data rate). Even then, the receiving process has to handle data promptly. Which, probably not coincidentally, describes Berkeley's network and most of their network programs at the time. However, the urgent pointer is close to useless in today's network, in that there are few-to-no use cases that it is actually useful for. > It's not really even oob. No, and it never has been, despite Berkeley's hallucinations. > But poll isn't really getting into any details about this. Just that > if you have some sort of [data stream], where you can assign multiple > priorities to the data, then poll can make use of it. For a few priorities, yes. It's a poor design in that it doesn't provide any good way to handle more than two or maybe three different priorities. > I have no idea if any such device or file ever have had such a > distinction. Possibly nobody except System V (or possibly its direct ancestors) ever did. My impression is that it's a STREAMS thing, but that's a fuzzy impression, mostly coming from manpage notes ("The distinction between some of the fields in the events and revents bitmasks is really not useful without STREAMS."). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: poll(): IN/OUT vs {RD,WR}NORM
>> POLLPRIHigh priority data may be read without blocking. >> POLLRDBAND Priority data may be read without blocking. >> POLLRDNORM Normal data may be read without blocking. > Is this related to the "oob data" scheme in TCP (which is a hack that > doesn't work)? I really wish BSD hadn't tried to turn the urgent pointer into an out-of-band data stream, because, as you say, it doesn't work for that. It doesn't really work very well even as an API to TCP's urgent pointer. > Where do we attach 3 priority levels to data? That's part of what I was wondering. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: poll(): IN/OUT vs {RD,WR}NORM
> Also, I suspect mouse was thinking of the TCP URG concept, and not > PUSH when he wrote what he did, but I don't know for sure. Ouch. Yes, you are entirely correct in that regard. Total braino on my part. My apologies. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: poll(): IN/OUT vs {RD,WR}NORM
>> I can understand treating POLLWRNORM as identical to POLLOUT. But >> why the distinction between POLLRDNORM and POLLIN? Might anyone >> know the reason and be willing to explain? > I'd hazard a guess that you'll likely not get an explanation. Quite possibly not, but I figured it was still workt asking. >> ... I'd still be curious where it came from. > Those answers are CVS: Not where it came from in the sense of who committed it or what it looked like at the time, but rather where that person got the various distinctions from. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
poll(): IN/OUT vs {RD,WR}NORM
In sys/sys/poll.h, I see (in 1.4T, -current according to cvsweb, and everything I've checked in between): #define POLLIN 0x0001 #define POLLPRI 0x0002 #define POLLOUT 0x0004 #define POLLRDNORM 0x0040 #define POLLWRNORM POLLOUT #define POLLRDBAND 0x0080 #define POLLWRBAND 0x0100 I can understand treating POLLWRNORM as identical to POLLOUT. But why the distinction between POLLRDNORM and POLLIN? Might anyone know the reason and be willing to explain? Not that it matters tremendously. But I'm curious, because it indicates there's something I don't understand somewhere there. The wording is a bit confusing. In -current's poll(2), POLLIN is said to be synonymous with POLLRDNORM (and POLLOUT with POLLWRNORM); in 1.4T and 5.2, there is confusing wording about "high priority data" and "data with a non-zero priority", which may or may not be talking about the same thing and may or may not map onto other concepts such as TCP's push semantics, and the wording is different in the two directions. -current's manpage's BUGS entry implies, to me, that this distinction is something of a historical accident that should be fixed, but, even if that reading is correct, I'd still be curious where it came from. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Using mmap(2) in sort(1) instead of temp files
Why is this on tech-kern? It seems to me it belongs on tech-userlevel. > I'm trying to speed up sort(1) by using mmap(2) instead of temp > files. If you're going to sort in RAM, why bother with temporaries at all? Just slurp it all in and sort it in core. But. Part of the point of using temp files, it seems to me, is to be able to sort datasets larger than will fit memory. Unless NetBSD is prepared to completely desupport small machines like the MicroVAX-II or shark, I think this might be a misguided thing to do. Given the way swap works, I suspect it will work better to use of temp files instead of mmap()ed memory to sort datasets larger than will fit in RAM, even if VM is available. Furthermore, VM can be limited; sorting input bigger than 3G on i386 shouldn't break (and, from a usability standpoint, shouldn't even require any special options). Even on 64-bit, VM can be comparatively small; on a 9.1 amd64 machine at work, proc.$$.rlimit.datasize.hard is only 8 gigs. At the very least, I would strongly recommend adding an option to disable this, to continue to use real files for temporaries. > ftmp() (see code below) is called in the sort functions to create and > return a temp file. mkstemp() is used to create the temp file, then > the file pointer (returned by fdopen) is returned to the sort > functions for use. I'm trying to understand where and how mmap > should come into the picture here, and how to implement this feature. I think the biggest issue you'll have (aside the ones raised above) is that an mmap()ed memory block has a fixed size, set at map time. Files are sized much more dynamically. I suspect you'll end up (re)implementing a ramfs (a simplified one, because the application needs are relatively simple, but still.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Fullfs file system
> Thanks Mouse and Martin. I got past that error. But now I'm running > into another problem - unable to determine when write op occurs (to > be able to return ENOSPC error). > I want to return ENOSPC error whenever write occurs. Which variable > contains this info? I'm confused which structs contain what > information. I don't know. What I'd do in this case is to trace down through what you call (VCALL in this case) until you get to a point where it determines what operation to call, then see what it uses at to select the vnode operation to be performed. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Fullfs file system
> Hi, when I try to run `mount_full /root/xyz/a /root/xyz/b` > I get the following error: > `mount_full: /root/xyz/a on /root/xyz/b: Operation > not supported by device` > Any tips for debugging this? Add printfs in the kernel codepaths? That's what I'd start with. (Well, actually, I'd start by rereading the code, but I assume you've already done that.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Perceivable time differences [was Re: PSA: Clock drift and pkgin]
> ? If I remember right, anything less than 200ms is immediate response > for a human brain. "Response"? For some purposes, it is. But under the right conditions humans can easily discern time deltas in the sub-200ms range. I just did a little psychoacoustics experiment on myself. First, I generated (44.1kHz) soundfiles containing two single-sample ticks separated by N samples for N being 1, 101, 201, 401, 801, and going up by 800 from there to 6401, with a second of silence before and after (see notes below for the commands used): for d in 0 100 200 400 800 1600 2400 3200 4000 4800 5600 6400 do (count from 0 to 44100 | sed -e "s/.*/0 0 0 0/" echo 0 128 0 128 count from 0 to $d | sed -e "s/.*/0 0 0 0/" echo 0 128 0 128 count from 0 to 44100 | sed -e "s/.*/0 0 0 0/" ) | code-to-char > zz.$d done I don't know stock NetBSD analogs for count and code-to-char. count, as used here, just counts as the command line indicates; given what count's output is piped into, the details don't matter much. code-to-char converts numbers 0..255 into single bytes with the same values, with non-digits ignored except that they serve to separate numbers. (The time delta between the beginnings of the two ticks is of course one more than the number of samples between the two ticks.) After listening to them, I picked the 800 and 1600 files and did the test. I grabbed 128 bits from /dev/urandom and used them to play, randomly, either one file or the other, letting me guess which one it was in each case: dd if=/dev/urandom bs=1 count=16 | char-to-code | cvtbase -m8 d b | sed -e 's/./& /g' -e 's/ $//' -e 's/0/800/g' -e 's/1/1600/g' | tr \ \\n | ( exec 3>zz.list 4>zz.guess 5&3 audioplay -f -c 2 -e slinear_le -P 16 -s 44100 < zz.$n skipcat 0 1 0<&5 1>&4 done ) char-to-code is the inverse of code-to-char: for each byte of input, it produces one line of output containing the ASCII decimal for that byte's value, 0..255. cvtbase -m8 d b converts decimal to binary, generating a minimum of 8 "digits" (bits) of output for each input number. skipcat, as used here, has the I/O behaviour of "dd bs=1 count=1" but without the blather on stderr: it skips no bytes and copies one byte, then exits. (The use of /dev/urandom is to ensure that I have no a priori hint which file is being played which time.) I then typed "s" when I thought it was a short-gap file and "l" when I thought it was a long-gap file. I got tired of it after 83 data samples and killed it. I then postprocessed zz.guess and compared it to zz.list: < zz.guess sed -e 's/s/800 /g' -e 's/l/1600 /g' | tr \ \\n | diff -u zz.list - I got exactly two wrong out of 83 (and the stats are about evenly balanced, 39 short files played and 44 long). So I think it's fair to say that, in the right context (an important caveat!), a time difference as short as (1602-802)/44.1=18.14+ milliseconds is clearly discernible to me. This is, of course, a situation designed to perceive a very small difference. I'm sure there are plenty of contexts in which I would fail to notice even 200ms of delay. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
> Better than 100Hz is possible and still precise. Something around > 1000Hz is necessary for human interaction. That doesn't sound right. I've had good HCI experiences with HZ=100. Why do you see a higher HZ as necessary for human interaction? > Modern hardware could easily do 100kHz. Not with curren^Wat least one moderately recent NetBSD version! At work, I had occasion to run 9.1/amd64 with HZ=8000. This was to get 8-bit data pushed out a parallel port at 8kHz; I added special-case hooks between the relevant driver and the clock (I forget whether softclock or hardclock). It worked for its intended use fairly nicely...but when I tried one of my SIGALRM testers on it, instead of the 100Hz it asked for, I got signals at, IIRC, about 77Hz. I never investigated. I think I still have access to the work machine in question if anyone wants me to try any other quick tests, but trying to figure out an issue on a version I don't use except at work is something I am unmotivated to do on my own time, and using work time to dig after an issue that doesn't affect work's use case isn't an appropriate use of work resources. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
> So even though we added one tick, you can still get two timer events > in much closer proximity than a single tick as far as the process is > concerned. Certainly. I think that's unavoidable without resetting the timer inside the signal handler, or hard realtime guarantees (which are Not Easy). > And we probably do need to talk about the timer expiration and > rearming as separate from signal deliver and process scheduling. There are plenty of reasons user code running the signal handler may be delayed from the time the timer is supposed to tick. But without the timer ticking as requested, I don't think the rest matters nearly as much. When even an _unloaded_ machine can't get the ticks it asks for, something is wrong. A machine which gets that overloaded just from delivering 100 signals to a mostly-trivial signal handler per second, well, I doubt NetBSD runs on anything that weak. > And from a program point of view, that is what really matters in the > end. If the program really wants a minimum amount of time before the > next timeout, it needs to do the request for the next time event at > the processing point, not something kernel internal which happend > very disconnected from the process. Agreed. ITIMER_REAL in the form I've been arguing for is of little help to a process that wants timer ticks separated by a hard minimum interval as seen by the signal handler. At least when using it_interval to get repeated signals. But then, so is every other ITIMER_REAL I've ever used. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
> The attached (untested) patch reverts to the old algorithm > specifically for the case of rearming a periodic timer, leaving the > new algorithm with +1 in place for all other uses. > Now, it's unclear to me whether this is correct, because it can have > the following effect. Suppose ticks happen on intervals of time T. > Then, owing to unpredictable and uncontrollable scheduling delays, > the following sequence of events may happen: > Time 0*T: timer_settime(.it_value =3D T, .it_interval =3D T), arms timer at > 1*T > Time 1*T + 0.9*T: timer expires, rearms timer at 2*T > Time 2*T + 0.1*T: timer expires, rearms timer at 3*T > The duration between these consecutive expirations of the timer is > 0.2*T, even though we asked for an interval of T. True. In my opinion that is the correct behaviour; userland requested timer ticks at multiples of T, so there is a conceptually infinite stream of (conceptual) ticks generated at those times. Those then get turned into real events when the kernel can manage it. But a delay for one of them should not affect any other, except for the case where one is delayed long enough to occur after another's ideal time, in which case I would consider it acceptable (though not required) to drop one of the two. > [...POSIX...] IMO if POSIX forbids the above, POSIX is broken and should, in this respect, be ignored. One reason for using facilities taking structs itimerval is for ticks to _not_ be delayed by delay of previous ticks. If POSIX cannot be ignored for whatever reason, I would argue that a new facility that _does_ provide undelayed ticks should be provided. (I'm partial to timer sockets, but I am hardly unbiased. :-) > On the other hand, if it's not correct to do that, I'm not sure > correct POSIX periodic timers can attain a target _average_ interval > between expirations [...] I would argue that it's misleading, to the point I would call it incorrect, to call such a thing "periodic". /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
>> Specifically, under a kernel built with HZ=100, requesting signals >> at 100Hz actually delivers them at 50Hz. [...] > This is the well-known problem that we don't have timers with > sub-tick resolution, PR kern/43997: https://gnats.netbsd.org/43997 It doesn't need sub-tick resolution; one-tick resolution would fix the problem. The problem appears to be that an ITIMER_REAL timer can't deliver signals more often than every second tick. > In particular, there is no way to reliably sleep for a duration below > 1/hz sec. Nor does there need to be to fix this. 1.4T/sparc and /i386 get it right, even when running with HZ=100 and requesting 100Hz SIGALRMs. (My timer sockets get it right too, but their codepath is completely different, depending on only timeout(...,0,1) (1.4T) or callout_schedule(...,1) (4.0.1 and 5.2). > Fixing this requires adopting an MI API and MD clock driver support > for wakeups with sub-tick resolution, You must be talking about something different from what I'm talking about. What I want fixed does not involve sub-tick-resolution timers at any level. If using setitimer(ITIMER_REAL,...) to request SIGALRMs every tick actually delivered a SIGALRM every tick, I'd be fine. But, instead, doing that delivers a SIGALRM every second tick. > which nobody's done yet -- Nobody's done the sub-tick resolution you're talking about, maybe. But 1.4T long ago did what I'm looking for. Something between 1.4T and 4.0.1 broke it, and it's stayed broken until at least 9.1, probably 9.3 based on someone else's report on port-vax. (Okay, strictly, I don't know that it's stayed broken. It could have been fixed and then re-broken.) >> } else if (sec <= (LONG_MAX / 100)) >> ticks = (((sec * 100) + (unsigned long)usec + (tick - 1)) >> / tick) + 1; > Whether this is a bug depends on whether: [...] I think that code is not a bug per se; for sleeps, that codepath is...well, "reasonable" at the very least. The bug is that it is broken for timer reloads, but timer reloads are using it anyway; whether you think of this as a bug in timer reloads or a bug in tvtohz is a question of which way you prefer to squint your mind. Always adding an extra tick may be fine for sleeps (though that's arguable for short sleeps on a system with a high-res wallclock), but not for timer reloads. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
> [...], but we are in fact rounding it up to the double amount of time > between alarms/interrupts. Not what I think anyone would have > expected. Quite so. Whatever the internals behind it, the overall effect is "ask for 100Hz, get 50Hz", which - at least for me - violates POLA hard. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
>>>} else if (sec <= (LONG_MAX / 100)) >>>ticks = (((sec * 100) + (unsigned long)usec + (tick - 1)) >>>/ tick) + 1; >> The delay is always rounded up to the resolution of the clock, so >> waiting for 1 microsecond waits at least 10ms. But it is increased by 1 tick when it is an exact multiple of the clock resolution, too. For sleeps, that makes some sense. For timer reloads, it doesn't. I could of course be wrong about that code being responsible, but reading realtimerexpire() makes me think not; it uses tshzto, which calls tstohz, which calls tvtohz, which is where the code quoted above comes from. Maybe realtimerexpire should be using other code? > Look at the wording sleep(3), nanosleep(2), etc. They all use > wording like "... the number of time units have elapsed ..." True. And, if the misbehaviour involved sleep, nanosleep, etc, that would be relevant. The symptom I'm seeing has nothing to do with them (except that both are related to time); what I'm talking about is the timing of SIGALRMs generated by setitimer(ITIMER_REAL,...) when it_interval is set to 1/HZ (which in my test cases is exact). setitimer(2) does say that "[t]ime values smaller than the resolution of the system clock are rounded up to this resolution (typically 10 milliseconds)", but it does _not_ have language similar to what you quote for sleep() and relatives. Nor, IMO, should it. The signals should be delivered on schedule, though of course process scheduling means the target process may not run the handler on schedule. Under interrupt load sufficient that softclock isn't running when it should, I'd consider this excusable. That does not describe my test systems. 1.4T does not have this bug. As I mentioned, it works fine on sparc. Even on i386, I see: $ date; test-alrm > test-alrm.out; date Sat Dec 23 07:57:45 EST 2023 Sat Dec 23 07:58:46 EST 2023 $ sed -n -e 1p -e \$p < test-alrm.out 1703336265.921251 1703336325.916413 $ Linux, at least on x86_64, gets this right too. On a work machine: $ date; ./test-alrm > test-alrm.out; date Sat Dec 23 08:18:15 EST 2023 Sat Dec 23 08:19:15 EST 2023 $ sed -n -e 1p -e \$p < test-alrm.out 1703337495.219734 1703337555.209737 $ uname -a Linux mouchine 5.15.0-86-generic #96-Ubuntu SMP Wed Sep 20 08:23:49 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux $ > Two options are to increase HZ on the host as suggested, or halve HZ > on the guest. I suppose actually fixing the bug isn't an option? I don't know whether that would mean using different code for timer reloads and sleeps or what. But 1.4T is demonstration-by-example that it is entirely possible to get this right, even in a tickful system. (I don't know whether 1.4T sleeps may be slightly too short; I haven't tested that. But, even if so, fixing that should not involve breaking timer reloads.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PSA: Clock drift and pkgin
In a discussion of timekeeping on emulated VAXen, over on port-vax@, I mentioned that I've found that, on 4.0.1, 5.2, and 9.1, and, based on a report on port-vax@, apparently 9.3 as well, there's a bug in ITIMER_REAL signals (possibly on only some hardware - I've seen it on amd64 and i386, and, if my guess below is correct, it should manifest on various others as well). Specifically, under a kernel built with HZ=100, requesting signals at 100Hz actually delivers them at 50Hz. This is behind the clock running at half speed on my VAX emulator, and quite likely behind similar behaviour from simh (which emulates VAXen, among other things) on 9.3. I suspect it will happen on any port when requesting signals one timer tick apart (ie, at HZ Hz). In case anyone wants it, I wrote a small test program. It requests 100Hz signals, then, in the signal handler, takes a gettimeofday() timestamp. After taking 6000 timestamps (which ideally should take 60.00 seconds), it then prints out all the timestamps, thus indicating the actual rate signals were delivered at. It's on ftp.rodents-montreal.org (which supports HTTP fetches as well as FTP) in /mouse/misc/test-alrm.c for anyone interested. On machines with the half-speed bug, it takes two minutes rather that one, and the timestamps average about 20ms apart, instead of 10ms. ("About" because in most of my tests there is usually at least one interval that is slightly longer than it should be.) I don't _know_ what's behind it. But today I went looking, and, in 5.2, there is code which looks suspicious. I don't know where the analogous code is in 9.x, but presumably plenty of people here do. Speaking of 5.2, then: in kern/subr_time.c, there is tvtohz(), which has code } else if (sec <= (LONG_MAX / 100)) ticks = (((sec * 100) + (unsigned long)usec + (tick - 1)) / tick) + 1; which looks suspicious. If sec is zero and usec is tick, that expression will return 2 instead of the 1 I suspect it needs to return. I haven't yet actually tried changing that. Holiday preparations and observations are likely to occupy much of my time for the next week or so, but I'll try to fit in the time to change that and see if it helps any. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: DRM-KMS: add a devclass DV_DRMKMS and allow userconf to deal with classes
>>> [...DV_DRMKMS...userconf...] >> [...devices in multiple classes...maybe use a separate namespace, >> used by only config(1) and userconf?...] > This is precisely why I ask for comment ;-) :-) > I have two requirements: > - that the solution is not ad hoc i.e. that it can provide, in > userconf, facilities not limited to drmkms (I don't want to implement > a special case to recognize "drmkms" and to expand to all the STARred > driver names implied); I agree with this; special-casing drmkms would be...suboptimal. > - that it will not imply to have to maintain special data for > userconf to recognize some "magic" strings. You already need that, in that userconf has to have some way to recognize the string "drmkms" as a device category (hinted by the "class =" syntax, but it still needs error-checking) and map it into the corresponding DV_ value. I don't see it as significantly worse for config(1) to generate some data structure mapping device class names into whatever userconf would need to affect all devices of that class. Though it occurs to me that there are too many things called "class" here. "Group"? "Category"? "Collection"? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: DRM-KMS: add a devclass DV_DRMKMS and allow userconf to deal with classes
> I propose to add a DV_DRMKMS class to sys/device.h:enum_devclass; to > augment cfdata with a devclass member [...] > Comments? This is not intended as criticism; I am just trying to examine all sides of this question. Why use the sys/sys/device.h kind of device class for userconf? Is there some reason to think it will be useful to userconf other device classes, or do you expect other device-class machinery to have a use for DV_DRMKMS, or is it a question of just reusing the existing device class rather than creating a new kind of device class, or what? I'm also thinking it could be useful for a device to fall into multiple classes for userconf, but I _think_ DV_* classes don't support a device being in multiple classes. It also could be useful for custom kernels to have custom modifications to device classification. So I'm wondering if it would be better for this to be a namespace specific to config(1) and userconf rather than having anything to do with DV_* values. Or is that getting into "the best is the enemy of the good" territory? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: drm.4 man page and import of X11 drm-kms.7 and al.
>> So to clarify: I'm proposing to convert the rst doc pages to man >> pages [...] > I wouldn't bother installing man pages for this, someone working on > the kernel already has the source tree. With vanishingly few exceptions (notable by their rarity), source code is not documentation. There's a reason NetBSD has section 9. For that matter, most manpages are depressingly poor, but they are usually better than telling people "UTSL". (I don't know anything about the quality of the manpages in question here, except that someone invented yet another markup language for them, which is a mild negative to me; the xsrc trees I have on hand don't have any *.rst files.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: drm.4 man page and import of X11 drm-kms.7 and al.
> There is no man page for drmkms (the kernel part), but there are man > pages in the X sources, in the rst format [...] Would it maybe be worth creating a tiny manpage that just says "go look over there for docs"? I don't _like_ having manpages in yet _another_ markup language, but I prefer that to no doc - and, without any manpage, I have no idea how someone looking for doc would know to look for .rst (!) files buried in xsrc, for doc on something in the kernel. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kern.boottime drift after boot?
>> On embedded systems without an RTC, you _could_ get the same UUID in >> rare cases. But I think this would be a bug, because on Linux, any >> kind of jitter-source (interrupt timing, for instance) would perturb >> the generated UUID. > Hopefully! Even though this jitter might not be high entropy, it > should (in theory) be enough to give a differnet UUID each boot. Probably. Depending on how the entropy and RNG are handled, it is likely to be different only with some (relatively high) probability. What probability of collision (ie, of a boot giving the same value as the previous boot) is acceptable here? One in 256? One in 64k? One in 4G? If you're doing "random" generation, it's hard to avoid some chance of collision. If the system can tolerate repeated writes to a piece of its mass storage (disk etc - "disk" for brevity), you could set up something with (say) a 32-bit value saved in a fixed place. Each boot, you read it, save the value somewhere for this boot, and write it back to disk incremented before doing anything else. The value saved is your boot ID. But this depends on having a small piece of disk you can afford to write to once per boot. It also demands custom kernel code, unless NetBSD wants to adopt something of the sort or it's acceptable to have duplicate boot IDs if a boot attempt crashes "too early". If the latter is acceptable (which, based on the fragments of the original post I saw quoted, sounds likely), you could even do it entirely in userland, immediately upon having a writable persistent filesystem available. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: panic options
>> There is a call to panic() if the kernel detects that there is no >> console device found, I would like to make this call to it just >> reboot without dropping into ddb. > Well. Not completely what you ask for, but... > sysctl -w ddb.onpanic=0 (or even -1) Well, notice that the original post writes of panicking if "there is no console device found", which sounds to me like something that happens during boot, much earlier than /etc/sysctl.conf can affect. I'd say, rather, DDB_ONPANIC=0 in the kernel config and then ddb.onpanic=1 during boot. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kcmp(2)
>>> Don't cloner instances differ in minor number? [...] >> Not that I?m aware of. [...] > Well, as noted in this thread, traditionally you can tell when two > files are the same by examining stat results. Maybe that should be a hint. See below. > And the cloner mechanism replaced an older scheme where you had to > pick the number of instances you wanted, and unless I'm > misremembering badly in that world each had to have its own minor > number. That's my understanding as well. > That said, it almost certainly isn't important... Well, if it means that with a minor tweak NetBSD could have kcmp(3) instead of kcmp(2), it could be. It occurs to me that, if you give each device_t a unique-per-boot serial number and return that in structs stat, writing a kcmp(3) would border on trivial. (The _implementation_ would be NetBSD-specific, in that it would depend on st_serial or whatever you call it, but the _interface_ wouldn't.) Except, hmm. The above covers only DTYPE_VNODE. I'm not sure what could be done about other DTYPEs. If you really want to support Linuxisms - IMO that way lies madness, but it's ben somewhere between a long time and forever since NetBSD cared about MO - it might have to be kcmp(2) in order to DTRT for all DTYPEs. Or else each DTYPE would need its own analog of st_serial. Perhaps st_serial could be done in a way that's common across all DTYPEs? It'd probably need to be 64-bit to avoid running out in the face of extreme use cases, but that's hardly impossible. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: kcmp(2)
>> What does Mesa use kcmp(2) for? > Working out whether two device handles are to the same DRM device. >> Is there another way to accomplish what Mesa is trying to use it for? > I don't know of one. Is fstat() plus checking st_rdev (and possibly other fields) insufficient? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
>> [...] as I see it the divide you're sketching here ([...]) is the >> divide between interface and implementation, and in some cases the >> interface is more than just the typedefs. > Sort of. > // contains the "vnode_t" opaque type definition > // contains the guts of "struct vnode" and the other > implementation details > // Contains some of the file system interfaces, some of which > use vnode_t > // Contains the vnode interfaces, which definitely use vnode_t > The latter if the two would each include . You're right, I hadn't fully understood you. Hmm. What value is provided by separating the vnode_t type from the rest of the vnode interface (in )? If taken to its logical extreme (which of course ends up at a satirical position, like most logical extremes), this leads to // vnode_t // enum vtype // enum vtagtype // #define VNODE_TAGS // struct vnlock // #define IO_UNIT // #define IO_APPEND which I hope you agree is madness. What makes it worth singling out vnode_t for special treatment here? I would prefer to draw include-file boundaries more or less matching conceptual-API boundaries, which I thought was more or less where we started: defines the API to the vnode subsystem, including types, #defines, externs, etc. But I'm not sure how that would differ from what we have now. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
> I'm tempted to say that an opaque struct declaration in a .c file > ought be treated suspiciously - Depending on what you mean by "an opaque struct declaration", I may agree or disagree. If you mean the "struct foo;" without a body, then I think I agree. But the "struct foo { ... };" that completes the incomplete type corresponding to the include file's "struct foo;", that I think is the whole point of opaque structs: the completion is available only under the implementation hood. While that may be in a foo-impl.h file, if only a single file needs it I see no harm in the completion being in the .c (and indeed I've done it that way often enough). > (and it would be kind of nice if C had a way to say "all functions > defined beyond this point are static"). Personally, I'd prefer "all functions defined before this point are static", since I prefer calls to move textually backward (or, to put it another way, I prefer to avoid forward declarations when possible). But I doubt either of those will appear anytime soon. > And to return briefly, to the issue way up the top of simplifying the > leader files, there is one change I've wanted to make for ages, but > just haven't been brave enough to do, > That is to rip the definition of __NetBSD_Version__ [...] into a new > header file all of its own [...] with the rule that *no* other header > file is oermitted to include it. I'm...not sure I agree with that. I once built a kernel facility which I wanted to be completely drop-in to multiple kernel versions (as widely disparate as 1.4T and 5.2). The design I came up with was (names probably changed; I'm not digging up the code to see what names I actually used) a "version.h" file which looked like #include // to get __NetBSD_Version__ #if __NetBSD_Version__ == whatever value 1.4T had #include // where 1.4T keeps it #define THINGY() do { ...1.4T code for THINGY() ... } while (0) typedef int COMMON_TYPE; // just an int on 1.4T #endif #if __NetBSD_Version__ == whatever value 5.2 had #include // on 5.2 we need two #include // include files #define THINGY() thingy() // 5.2 has THINGY() nicely encapsulated typedef struct something COMMON_TYPE; // 5.2 uses a struct #endif etc. It sounds as though you would prefer the #include that pulls in __NetBSD_Version__ be in every C file that wants to include "version.h". This seems counterintuitive, even counterproductive, to me (and see also below, about #include order). Or perhaps you'd prefer that I'd designed those interfaces some other way, rather than with a version-switch include file? My own pet include-file peeve is rather different: I strongly believe that, with very few exceptions ( is the only one that comes to mind), you should be able to re-order your #include lines arbitrarily without producing any semantic change, and that, if this is not so, at least one of the include files involved is broken. I've been making small steps towards fixing this in my own trees, but it's still a major mess. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
[paragraph-length-line damage repaired manually] > What about something like and > ? The type definitions go into the former > header file, [...] Well, I don't like the "typedefs" name, because as I see it the divide you're sketching here (which I support, in general) is the divide between interface and implementation, and in some cases the interface is more than just the typedefs. Some structs have their struct definition, or some of it (regex_t is an example), as part of their advertised interface, and many have #defines as well. But I'd rather see the division done with (what I see as) an inaccurate name than see the division not done. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
> Sometimes I even wonder why typedef exists in C. Feels like I could > accomplish the same with a #define For `typedef struct foo Something', yes, you could (well, except for their different scopes, and possibly some other corner cases). For `typedef void (*callback_t)(void *, status_t, int)', not so much. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
> I don't get it. Why the "void *" stuff? That is where I think the > real badness lies, and I agree we should not have that. > But defining something like > typedef struct bus_dma_tag *bus_dma_tag_t; > would mean we could easily change what bus_dma_tag_t actually is, > keeping it opaque, while at the same time keeping the type checking. Um, no, you get the type checking only as long as "what [it] actually is" is a tagged type - a struct, union, or (I think; I'd have to check) enum. Make it (for example) a char *, or an unsigned int, and you lose much of the typechecking. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [PATCH] style(5): No struct typedefs
[riastradh@] > I propose the attached change to KNF style(5) to advise against > typedefs for structs or unions, or pointers to them. Pointers to them, I agree. I don't like typedefs for pointers to structs, except possibly for a few special cases. I think it should be pellucid from the declaration whether you're dealing with a pointer. But most - all, I think - of the benefits you cite are still available when using typedefs for the structs themselves. Indeed, different files do not have to agree on whether to use typedefs, and external references, such as your > struct vnode; > int frobnitz(struct vnode *); can do exactly that, even if other code does "typedef struct vnode VNODE;" and then uses VNODE (or vnode_t, or whatever name you prefer; personally, I like all-caps). > There isn't even any need to define `struct bus_dma_tag' anywhere; > the pointer can be cast in sys/arch/x86/x86/bus_dma.c to `struct > x86_bus_dma_tag', for instance (at some risk in type safety, but a > much smaller risk than having void * as the public interface), But at a risk in formal nonportability, unless the struct bus_dma_tag * was obtained by casting _from_ a struct x86_bus_dma_tag * to begin with (which in this case it probably would have been). I'd have to look up the details to tell whether it's possible for casting a pointer to a completed struct type to a pointer to a never-completed struct type to lose information, fall afoul of alignment requirements, or the like. [uwe@] > Typedefs make sense when the type is *really* opaque and can, behind > the scenes, be an integer type, a pointer or a struct. Agreed. > [Ab]using typedefs to save 8 bytes of "struct " + "*" just adds > cognitive load (and whatever logistical complications that you have > enumerated in the elided part of the quote). Personally, I find that *in*cluding the "struct" adds cognitive load. Perhaps it's just a question of what I'm used to, but having a noise word present - and the "struct" is close to that, from a conceptual point of view - means more noise to ignore. Especially when the type is referred to multiple times; I haven't seen it often, but I have seen statements that look as though half the alphanumerics are "struct" (I doubt any actually make it to the point of half, since each "struct" needs a tag to be useful, and at least a few other identifiers to make a useful statement, but it sure feels like it occasionally). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Anonymous vnodes?
>> Is it possible to create a vnode for a regular file in a file system >> without linking the vnode to any directory, so that it disappears >> when all open file descriptors to it are closed? (As far as I can >> tell, this isn't possible with any of the vn_* or VOP_* functions?) > That's completely normal. It's a normal state to be in. But, as I read it, the post was asking for a way to reach that state _without_ passing through a "has a name in some directory" state; it's not clear to me whether that's possible in general (ie, without doing something filesystem-specific). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: RFC: Native epoll syscalls
> It is definitely a real problem that people write linuxy code that > seems unaware of POSIX and portability. While I feel a bit uncomfortable appearing to defend the practice (and, to be sure, it definitely can be a problem) - but, it's also one of the ways advancements happen: add an extension, use it, it turns out to be useful, it gets popular I've done it myself (well, except for the "gets popular" part, which no one person can do alone): labeled control structure, AF_TIMER sockets, pidconn, validusershell, the list goes on. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>>> To the extent that it's impossible, it's impossible only because >>> the APIs provided by the kernel to userland don't have any way to >>> represent such a thing. This would border on trivial to fix, >>> except that it would be difficult to get much userland code to use >>> the resulting APIs because of their lack of portability to other >>> UNIX variants. > Since write(2) is one of the oldest interfaces in Unix the chances of > any change taking hold are vanishingly small... Oh, I'm not suggesting that write(2) change. What I'm suggesting is the creation of some alternative interface, write_extended(2) let's call it for the sake of concreteness[%], which is just like write(2) except that it _does_ provide some way to unambiguously return "wrote N, then error E". (Exactly how is pretty much irrelevant; I'm sure practically everyone here can imagine plenty of possible alternatives. If it comes to arguing choices for that, I'd paint the bikeshed a dark forest green.) But write(2) would continue to exist, with more or less its traditional semantics. Only if - and only when - write_extended becomes so popular that nobody uses plain write(2) any longer would I propose removing it. [%] If anything like this happens I certainly hope someone will invent a better name. But userland uptake for write_extended will be minimal, especially initially; that's the portability issue I was talking about. > All of this is not _independent_ of fixing uiomove callers, [...], > but it's largely orthogonal to the original problem of incorrectly > rolling back partial uiomoves. :-( Agreed. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>> - uiopeek leaves uio itself untouched ([...]). > Hmâ?? Iâ??m having second thoughts about uiopeek(), as well. It implies a d$ That is a good point. But would it be a problem to have uiopeek (works only to move from uio to buffer) and uiopoke (the other way)? I've never liked the way uiomove can move data one direction or the other depending on how the uio is set up. (I'd rather have uioread and uiowrite except that I'd be constantly trying to remember whether uioread reads from the uio or moves data in the direction a read() operation does.) Maybe uioget and uioput? Is there any significant amount of code that calls uiomove without knowing which direction the bits are moving? As for uiocopy versus uiomove, those are similar enough to memcpy and memmove that the implication feels to me more like "buffers can overlap" (for all that that is a highly unlikely use case) or some such. Finding good names is a mess. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> (In general, erroring in I/O is a whole additional can of worms; it's > wrong to not report back however much work happened before the error > when it can't be undone, but also impossible to both report work and > raise an error. [...]) To the extent that it's impossible, it's impossible only because the APIs provided by the kernel to userland don't have any way to represent such a thing. This would border on trivial to fix, except that it would be difficult to get much userland code to use the resulting APIs because of their lack of portability to other UNIX variants. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
>> I'm not a fan of uioskip() as a name - [...] > I agree. "skip" seem to have wrong connotations (cf. dd(1)). I'm not sure I agree. (The dd analogy is weak; dd has no peek operation, and uioskip - under whatever name - would border on useless without uiopeek.) For uioskip, it _is_ skip semantics. The bytes skipped are not copied anywhere, not by uioskip. (They may have been copied earlier with uiopeek, but that doesn't affect what uioskip does.) If you want skip-*with*-copy, well, uiomove is still there. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: PROPOSAL: Split uiomove into uiopeek, uioskip
> I propose adding uiopeek(buf, n, uio) and uioskip(n, uio) which > together are equivalent to successful uiomove(buf, n, uio). For what it may be worth, I like this. I don't _think_ I've ever run into issues caused by this issue before, but I have trouble seeing it as anything other than a bug waiting to happen. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: dkwedge: checksum before swapping?
>> But that comment clearly indicates that _someone_ thought it >> reasonable to checksum before swapping, so I can't help wondering >> what use case that's appropriate for. > It's a checksum over the 16bit words in native byte order. So when > you access the words in opposite byte order, you need to swap the > checksum too. Yes. But, to me, that would mean byteswap, then checksum. In the case at hand, the checksum is also bit-order-independent (in C terms, it's ^, not +), which means that byteswapping is almost irrelevant. But there are 8-bit members involved too (p_fstype and p_frag) and strings (d_typename and d_packname), which don't get swapped, so swapping and checksumming do not commute. As a toy example, consider: struct toy { uint16_t a; uint16_t b; uint8_t c; uint8_t d; uint16_t checksum; }; Let's set a=0xfeed, b=0xface, c=0xf0 d=0x0d. On a big-endian machine, the resulting octet stream is fe ed fa ce f0 0d, checksum f4 2e. On little-endian, ed fe ce fa f0 0d, checksum d3 09 - not 2e f4. > Unlike the regular disklabel code (which ignores other-endian > disklabels) the wedge autodiscover code will accept either. Is that actually known to work, or is it more "is intended to accept either"? Because it looks to me as though it will not accept labels checksummed and written natively by the other endianness, unless the strings and fstype/frag values happen to be such that they checksum to a multiple of 0x0101 (which is possible but unlikely). And the comment indicates that someone thought about the issue and came to what looks to me like an incorrect conclusion. > As for padding, the structure is nowadays defined with fixed size > types and explicit padding fields, so we may still assume that the > compiler won't add any further padding by itself. Currently true, though still disturbingly fragile. ("Nowadays"? It was so even as far back as 1.4T. Well, as far as the fixed-size types thing goes, at least; there are no explicit padding fields in any version I have, but it is (presumably carefully) defined so there is no need for them, either. At least assuming a power-of-two octet-addressed machine, a char * that's no bigger than 8 bytes, and a non-malicious compiler.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
dkwedge: checksum before swapping?
In sys/dev/dkwedge/dkwedge_bsdlabel.c, I find (and I see more or less the same code in 5.2 and what cvsweb.n.o shows me) static int validate_label(mbr_args_t *a, daddr_t label_sector, size_t label_offset) { ... /* * We have validated the partition count. Checksum it. * Note that we purposefully checksum before swapping * the byte order. */ ...code that does indeed checksum before swapping... } Does anyone know what the intent of this is? The only reason I would expect to see a byteswapped label is when it's written by a machine of the other endianness, and in that case the checksum will be wrong unless the label is swapped before checksumming (and even then only if the compiler doesn't insert shims in the struct, not that it's likely to given the present layout). But that comment clearly indicates that _someone_ thought it reasonable to checksum before swapping, so I can't help wondering what use case that's appropriate for. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Per-descriptor state
>>> But I kind of think it'd be preferable to make a way to clone a >>> second independent struct file for the same socket than to start >>> mucking with per-descriptor state. >> [...] it should be easy to extend dup3() to make that possible [...] > If we were to add such a thing it should be called something else, > not dup, because it's a fundamentally different operation from dup > and we don't need people confusing them. Different? Some. But not very. It _is_ closely related to dup(). I don't think dup3() would be a horrible way to do it - not nearly as horrible as, say, the hack I once implemented where calling wait4(0x456d756c,(int *)0x61746f72,0x4d616769,(struct rusage *)0x633a2d29) would produce suitably magic effects. (This was on a 32-bit machine.) But, honestly, when I saw the idea my reaction was to make it a new operation to fcntl. F_REOPEN, say, since it's creating new per-open state. Or, if you want to be all overloady, how about open(0,O_REOPEN,existingfd)? It _is_ creating new per-open state, so open is in _some_ sense right. My choice, for what it's worth, would be fcntl, dup3 second, with O_REOPEN a distant third. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Per-descriptor state
>>> I should probably add [close-on-fork] here, then, though use cases >>> will likely be rare. I can think of only one program I wrote where >>> it'd be useful; I created a "close these fds post-fork" data >>> structure internally. >> I can't think of any at all; to begin with it's limited to forks >> that don't exec, and unless just using it for convenience as you're >> probably suggesting, Yes. If the application does all the forking (ie, except for forks inside libraries, for which see below), it is just convenience, freeing the application from keeping track of which fds need closing. Well, except for libraries that open fds internally, without exposing them to the calling code. Depending on the user case, they may want them closed if the application forks. >> it only applies when also using threads, and if one's using threads >> why is one also using forks? Because one wants to exec a child process, maybe? >> So it seems like it's limited to badly designed libraries that want >> to fork behind the caller's back instead of setting up their forks >> at initialization time. Or something. What about libraries that fork _not_ behind the caller's back? (system(3) being, perhaps, the poster child.) > Or it is needed for a little used application called Firefox. What part of "badly designed" does that not fit? (Okay, admittedly, I don't know what Firefox looks like internally. But the UI design is bad enough I would expect the internals to be little better.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Per-descriptor state
>> Close-on-fork is apparently either coming or already here, [...] > We don't have it, but it will be in Posix-8. [...] not sure if > anyone has thought of a way to add it to socket() - It's looking to me as though more and more syscalls are having to grow flags arguments to do things right. Auto-close-on-fork for socket(), accept(), and socketpair(); per-operation non-blocking for at least some half-dozen calls I don't see how to avoid it for socket(). For accept(), it could be shoehorned into a SOL_SOCKET sockopt (SO_AUTO_CLOFORK_ON_ACCEPT, say, better name welcomed). > that doesn't look to be trivial, though it might be possible to abuse > one of the params [socket()] has - probably domain - and add flags in > upper bits ... Possible? Probably. Good? No, IMO. > while having it able to be set/reset via fcntl is useful, to work, it > needs to be able to be set atomically with the operation that creates > the fd, Well, to work for one particularly important use case. It can work just fine for various other use cases without that. > and having it default "on", which would work, would break almost all > existing non-trivial code). What about having it default to a per-proces (or per-thread) settable state? Mouse
Re: Per-descriptor state
> Close-on-fork is apparently either coming or already here, not sure > which, but it's also per-descriptor. I should probably add that here, then, though use cases will likely be rare. I can think of only one program I wrote where it'd be useful; I created a "close these fds post-fork" data structure internally. > The thing is, per-descriptor state is a mess and it shouldn't be > allowed to proliferate. The reason: the descriptor is an array > index. There's plenty of room to store stuff in the object it looks > up (that is, struct file) but storing stuff on the key side of the > array is a problem. (References to -current here really mean "filedesc.h 1.70 according to cvsweb.netbsd.org".) Looking at the include files, it looks to me as though descriptors are indices into an array of structs (struct fdfile) in -current or 5.2, or an index into two parallel arrays of pointers and flags in 1.4T. Those then point to the structs file (the per-open state). It's true the flags fields are chars (two bits used in 1.4T, two separate chars storing one bit each in 5.2 or -current). But it's a far cry from being as bad as you outline. There are multiple bits free, and, even if they run out, growing them from chars is a low (memory) cost on 1.4T and probably zero on 5.2 and -current on most ports. > For a couple bits you can mask them off from the pointer (though even > that's abusive); If that were what were being done, I would agree it's abusive. > more than that and suddenly you need to double the size of the > fdtable so it contains an extra machine word for random state bits as > well as the pointer to the struct file. That is quite possibly why 1.4T uses parallel arrays rather than an array of structs. In 5.2 and -current, there is enough additional state that someone (rightly, IMO) decided it wasn't worth the code complexity of keeping parallel arrays. (See struct fdfile in sys/filedesc.h for the additional state I'm talking about.) > (Then there's also another issue, which is that in nearly all cases > nonblocking I/O is a workaround for interface bugs, e.g. nonblocking > accept or open of dialout modem devices, or for structural problems > in software that also needs to use the same file handle, like your > original problem with curses. In the long run it's probably better to > kill off the reasons to want to use nonblocking I/O at all.) And replace nbio with...what? Multiple threads doing blocking calls? Or do you think everything should be nonblocking by default, with blocking behaviour implemented userland-side? Or am I completely misinterpreting somehow? > (also, "mirabile visu") What did I write? *checks* Oops. Thanks. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Per-descriptor state
Back in late March, I wrote here (under the Subject: Re: flock(2): locking against itself?) about a locking issue, which drifted into discussing per-descriptor state versus per-open state (thank you dholland for that phrasing!) versus backing-object state. In particular, someone (I forget who) said that non-blocking really should be per-operation rather than being state anywhere. That's correct, but, thinking about it since then, that is not as easy as one might wish, because there are quite a number of calls that can be affected. (Offhand: {,p}{read,write}{,v}, connect, accept, {send,recv}{,msg}, sendto, recvfrom. There are probably others, but this is sixteen already - though I'm not sure p{read,write}{,v} need the option; are there any seekable objects on which non-blocking is significant?) Some of these, such as send*, already have a flags argument that could grow a new bit to indicate nonblocking, but the rest - more than half - would need to have an alternative version created with a flags field, or some such. While hardly impossible, this gets annoying, and indeed might be best addressed by (to use read() as an example) making the real syscall backing read() always take a flags argument, with the libc stub supplying a fixed value for the flags when the flagless API is called. This is pushing towards making it per-descriptor state. At present, the versions I have don't have anything but close-on-exec as true per-descriptor state. A quick look at 9.1 and cvsweb.netbsd.org (which, mirabilu visu, actually works for me, unlike www.netbsd.org and mail-index.netbsd.org) sys/sys/filedesc.h makes me think that that's true of them as well. For backward compatibility, I would be inclined to leave the existing mechanisms in place, theoretically to be removed eventually. This also means divorcing "non-blocking open" from "non-blocking I/O after open". So: does anyone have any comments on the above analysis, or thoughts on good, bad, or even just notable effects making it real per-descriptor state might have? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: flock(2): locking against itself?
>> Another option is to use newterm() in curses with a funopen()ed >> stream for output which queues the output to be written >> (nonblockingly) to the real stdout. > Would toggling O_NONBLOCK using fcntl() work for you? A bit tedious, > but it can be done "per operation". ...ish. I hadn't thought of that. But, the way the code is structured, that actually isn't a crazy suggestion at all. I'm not sure it would work, but it's a very good thought. Thank you! Mouse
Re: flock(2): locking against itself?
> You probably already researched this, but it looks like newterm() is > in the curses library in NetBSD-5, It is. (I wouldn't've even discovered it if it weren't. I started looking at libcurses with an eye to providing some way to output data via a callback instead of a file descriptor and discovered newterm().) It's in 4.0.1's libcurses, even - or, at least, it's in my 4.0.1 derivative, and diff -u -r between that and 5.2's libcurses shows enough differences I doubt I ported it between them, so it probably is in the base OS I started with. But we have now definitely drifted off-topic for this list. Moving non-blocking I/O out of the object towards userland, that's on-topic here. Working around the issue in userland, not so much. > so getting it to work in NetBSD-1.4T shouldn't be that difficult. That's what I'm hoping. I'll see what I can get working in my Copious Spare Time Mouse
Re: flock(2): locking against itself?
>> [...non-blocking stdin vs stdout, with curses...] > The only way I've thought of to work around this is to fork a helper > process [...] I just realized this is not quite true. Another option is to use newterm() in curses with a funopen()ed stream for output which queues the output to be written (nonblockingly) to the real stdout. That, however, would mean backporting libcurses, because I'd like this to run on my 1.4T as well as 4.0.1 or 5.2, and 1.4T's libcurses has no newterm(). I've started looking at that backport, but it's going to take a while. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: flock(2): locking against itself?
> I may be missing something in your curses non-blocking case, but > can't you work around the issue by setting up an independent file > descriptor, and hence tty structure, by performing a dup2(2) on stdin > and then closing the original stdin file descriptor? No. dup2, like dup, creates another file descriptor reference to the original open file table entry. (Something very much like that is how stdin and stdout got set up to begin with.) Furthermore, in the case of non-blocking I/O, it is the underlying object (the tty, here) that is nonblocking, not the open file table entry and definitely not the file descriptor. As Taylor R Campbell said, nonblocking really _should_ be a property of the operation, not of the descriptor, not of the open file table entry, and _definitely_ not of the object. The only way I've thought of to work around this is to fork a helper process which reads - blockingly - from stdin and writes the data to a pipe; that pipe is then set nonblocking in the main process and is independent of everything else. I might need a third process to make the reader process die reliably /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: flock(2): locking against itself?
>> I'm not sure whether I think it'd be better for O_NONBLOCK to apply >> to the descriptor - [...] > O_NONBLOCK should really be part of the _operation_, not the > _object_. [...] Agreed - or, at least, I agree that it should be possible to make it part of the operation rather than of the object. Even aside from installed-base arguments, I'm not sure whether I think non-blocking mode on the object should go away. I'd have to think about it more. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: flock(2): locking against itself?
>> "They're per-open" > That's not bad for this level of description. Agreed! >> ...which is not actually difficult to understand since it's the same >> as the seek pointer behavior; that is, seek pointers are per-open >> too. > and almost all the other transient attributes, that is distinct from > stable attributes like owner, group, permissions, which are inode > attributes. In our current system I think just close on exec is a > per fd (transient) attribute though if we follow linux (I think) and > soon to be POSIX, and add close on fork, that would be another. I actually ran into a case where this distinction caused trouble. I have a program that uses curses for output but wants to do non-blocking input. So I set stdin nonblocking and threw fd 0 into the poll() loop. But, in normal use, stdin and stdout come from the same open on the session's tty, so setting stdin nonblocking also sets stdout nonblocking, which curses is not prepared to handle, leading to large screen updates getting partially lost. I'm not sure whether I think it'd be better for O_NONBLOCK to apply to the descriptor - if that could even be done; the way things are currently designed, in a lot of cases it needs to get pushed all the way to the underlying object, in my case the tty (which then is responsible for making that state non-permanent). I may need to find some other approach. I've also wished for a way to suppress SIGPIPE akin to MSG_NOSIGNAL to send(). This is relevant here because it would be useful to have it as a per-descriptor (or for my immediate use case even a per-open-file) option, though it would also be nice to have it as a per-call option (akin to Linux's writev2, though the Linux writev manpage I checked doesn't have any such flag - it merely has room for it). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: flock(2): locking against itself?
> Locks are on files, not file descriptors. Except they aren't. They're on open file table entries, something remarkably difficult to describe in a way that doesn't just refer to the kernel-internal mechanism behind it (which for _this_ list isn't a big deal, but...). If they were truly on files, rather than open file table entries, then it wouldn't matter whether my test program opened the file once or twice, since it's the same file either way. > Applying flock() to an already locked (of this kernel file*) file is > an attempt to upgrade, or downgrade (including unlock) the file, Hm, okay, I can see how the second flock call in my test was taken as an attempt to equalgrade (neither upgrade nor downgrade) the exclusive lock to another exclusive lock. I'll have to think more about my locking paradigm. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
flock(2): locking against itself?
I ran into an issue, which turned out to be, loosely put, that an open file table entry cannot lock against itself. Here's a small test program (most error checking omitted for brevity): #include #include #include #include #include #include #include static int fd; int main(void); int main(void) { fd = open(".lock",O_RDWR|O_CREAT,0666); if (fork() == 0) { if (flock(fd,LOCK_EX|LOCK_NB) < 0) { printf("lock child: %s\n",strerror(errno)); exit(1); } sleep(5); exit(0); } if (flock(fd,LOCK_EX|LOCK_NB) < 0) { printf("lock parent: %s\n",strerror(errno)); exit(1); } sleep(5); wait(0); return(0); } An earlier version skipped the fork, doing the two flock calls in succession from the same process, without the sleeps. Neither version produces EWOULDBLOCK from either flock call on any of the systems I tried it on (my mutant 1.4T, my mutant 5.2, and a guest account on a stock (as far as I know) 9.0). This is not what I was expecting. On examination, the manpages available to me (including the one at http://man.netbsd.org/flock.2) turn out to say nothing to clarify this. Moving the open after the fork, so the parent and child open separately, I do, of course, get the expected EWOULDBLOCK from one process. Is this expected behaviour, or is it a bug? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: [GSoC] Emulating missing Linux syscalls project questions
> (2) Is there a better binary-finding strategy than trying Linux > binaries on NetBSD, and if they fail (have a script) compare the > output of strace from a Linux run of the program with the table in > sys/compat/linux/arch/*/linux_syscalls.c? Better? Maybe, maybe not. But what I did in a similar case (an emulator that emulated just userland, handling directly anything that traps to the kernel on real hardware) was to `implement' any unimplemented syscalls with code that just prints a message and terminates. Here, that would map into unimplemented syscalls printing/logging something and killing the process. Obviously, that code would not survive into the end result, but something like it might be a useful intermediate step. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: building 9.1 kernel with /usr/src elsewhere?
>> Does it make a difference if you set >> NETBSDSRCDIR=/home/abcxyz/netbsd-9.1/usr/src when you run make? > Yes, that appears to make the symptom go away. Also, I can reproduce the problem by setting NETBSDSRCDIR=/infibulated/gonkulator when running make depend even with a source tree in /usr/src. Is this enough of a bug that it's worth sending-pr? Or is this a case of me expecting something that's no longer supported to work? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: building 9.1 kernel with /usr/src elsewhere?
Omnibus reply here. Thank you, everyone; I have a better understanding of the actual problem (admittedly that's a low bar, given how little I understood it before) and two different workarounds. [Brian Buhrow] > hello. I regularly build kernels outside of the /usr/src location. But does /usr/src exist at the time? > My technique is to install the source in some location: > /usr/local/netbsd/src-91, for example, then put my configuration file > in: /usr/local/netbsd/src-91/sys/arch//conf/ > Then > cd /usr/local/netbsd/src-91/sys/arch//conf > config > cd ../compile/ > make -j 4 >& make.log I'd prefer to avoid assuming the user who wants to build the kernel can write into the source directory tree. You may note the source tree was (admittedly only by implication) owned by abcxyz but I was doing the build as mouse. That said, this does appear to work. (Well, I didn't use -j, but then, I practically never do, and the machine I'm doing this on is single-core.) I didn't wait for the whole build, but it doesn't fail fast the way the run that prompted my mail did. [Taylor R Campbell] > Does it make a difference if you set > NETBSDSRCDIR=/home/abcxyz/netbsd-9.1/usr/src when you run make? Yes, that appears to make the symptom go away. (I probably would not have stumbled across that; /usr/src/BUILDING mentions NETBSDSRCDIR only twice, neither time documenting it, only mentioning it in examples. It likely would have taken enough digging to locate the actual culprit for me to discover it. But still, it does seem to work.) > I always build out of my home directory, never /usr/src, but I also > always use build.sh and the make wrapper it creates [...] Ugh, I hate using build.sh for small things like individual kernels. It always (well, far too often, at least) insists on rebuilding make, which takes significant time on some machines, like my shark, and requires extra writable filesystem space. If there's a reasonably easy way to avoid it, I prefer to. That said, if NetBSD wants to desupport building even kernels without using build.sh, that's its choice; what I think of it does't matter. (But I do think that, in that case, config(1) should be documented as an internal tool, not intended for use other than by build.sh.) [Johnny Billquist] > You should build the kernel using build.sh, [...] See above. > Don't try to make things complicated by doing all that stuff by hand. > :-) build.sh _is_ the complicated way, to me. It's a large, complex, and slow tool I find even less understandable than config(1) and make(1). It also has way too much "when we want your opinion we'll give it to you" for my taste. Which I suppose is just another way of saying that NetBSD, having lost interest in users like me, is moving even farther in that same direction. Sic transit gloria mundi. I don't squawk much about build.sh because it does bring benefits; the biggest one I notice is probably painless cross-compiles. But I'd never run into this price before. 5.2 doesn't exhibit the misbehaviour at all, so I couldn't've noticed it except at work, and I think I've never tried to build a kernel without /usr/src in place before (at work or not). [matthew green] >> make[1]: don't know how to make absvdi2.c. Stop > what happens if you run "make USETOOLS=no"? Fast failure, superficially looking like the same one. [Valery Ushakov] > Mail-Followup-To: matthew green , > Mouse , tech-kern@netbsd.org Um, why would you think I'd want people to mail followups to _me_? I would prefer - though admittedly it's a weak preference, weak enough I practically never mention it - that people _not_ mail me when they're already sending to the list. > The problem is that NETBSDSRCDIR cannot be inferred for a randomly > located kernel builddir and sys/lib/libkern/Makefile.compiler-rt uses > it. In that case, maybe config(1) should write a suitable setting of NETBSDSRCDIR into the Makefile it generates? At least when -s is given with an absolute path? > Our makefile spaghetti is a bit out of control. I've felt so often enough myself. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
building 9.1 kernel with /usr/src elsewhere?
Okay, I'm trying to help someone with a NetBSD 9.1 machine at work. Today's issue is one I had trying to build a kernel. We needed to do this because that project has a few customziations in the system needed to build the application layer, and more needed to run it. He installed stock 9.1, but did not install any source; /usr/src and /usr/xsrc did not exist. We then set up the customized source trees in his homedir, which I will here call /home/abcxyz, under a directory 9.1. Thus, the path to kernel source, for example, was /home/abcxyz/netbsd-9.1/usr/src/sys. Then I copied in a kernel config (GEN91) into my ~/kconf/GEN91, from back when I was working on that project. I then ran % config -b ~/kbuild/GEN91 -s /home/abcxyz/netbsd-9.1/usr/src/sys ~/kconf/GEN91 This completed apparently normally, reporting the build directory and telling me to remember to make depend. I then went to ~/kbuild/GEN91 and ran make depend && make. It failed fast - no more than a second or two - with make[1]: don't know how to make absvdi2.c. Stop (full log below). I then moved /home/abcxyz/netbsd-9.1/usr/{src,xsrc} to /usr/{src,xsrc}, chowned them -R to 0, destroyed ~/kbuild/GEN91, and repeated, only this time I passed /usr/src/sys to config's -s flag. This time the kernel built fine (at least apparently - we haven't tried booting it yet, but I've built enough kernels to be confident there are no obvious red flags in the log; it certainly did not fail a second or two in with a cryptic message about absvdi2.c). Note in particular that the source tree content was identical; only the path and ownership differed. Is this a bug? Or am I doing something wrong? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B The logfile includes one line that's over 3300 characters long, guaranteeing that it'd get mangled by email (soft limit 78 characters, hard limit 998). So, I ran the logfile through bzip2 and then base64-encoded the output. Here's the result: QlpoOTFBWSZTWfzLRGEAA6xfgGAQSZP/cj/v36qwUANjwoKAAGSNUbUn6mpv U1NPU8hAZB6mQA9EMymmagOaMmJgAmIwI0wIMRgmTAIw5oyYmACYjAjTAgxGCZMA jDmjJiYAJiMCNMCDEYJkwCMEUQkwmQJoBR6jR5QA02ppo0Gm2qaekaI6tlUVe/S1 HAv0T332gt7Q3uZ7SnDMNrMXxRm2qdT965avS81vYtWJqOVqFHMMG2lXjZAzLhEs 27kbS8RZOB1YXdRZGFfB+Vv64SXOnJbdALlWMEEIQABCjGnQhAOAkGYImXVAYFSV gDpIIgIk4iJU661xD4IOouYQv7DytQCYMTqXUIYqMGgCfbG40vx0e6+cYsRMxWbG Ry7dS5oRQAwI3XHxETDWFMrFa1+URPAsCmnoETwfSvQInSHI418MxAY/xIzvVCzA h8iP4i6OcEEPxhJRZKmkZ6dtYVdUDx1F0h7z+ryOBD/FDb5RQsf2pwBlyggpIeVR 92uNxc3aZYSUPe2TV3aWC5CNOUBELwKW0ESTM5thvQmeUmlz7thjtk0ZKaUlm7gU OASVllCpleWIQpx0y3morldhNMw1tTjtzvoRkc3rx23xrveRHQbDZTLTOChswIDD Hd1nzGHP6A6yTEnMgL84Z9SCSxgYH37YdFFZdV2mdEHoHkd4UEEKgyDwOZDxQfyI LY3SF3TJCfWg+wRPXIPqESyDXzK+OW6lP1KEJkIlSAsAbcBcD4Z3iJiVKGIiUIVP 2N8kaG0HbAXu37Pbdt2AG2OyBH02gG/nzEShoSSVebJPP+Tvi5LgcVHE0ETWo+og lPYdofIamwKb8sqHQKEUOcsc6XWAN4TwoSAdclLBN1QIjmJqi9pYEKRoAYHCIhB2 JbpbzqBS+ki6xaG47NWFzrJmUkSRYUZQK3CJKcZTX/htKF+MY7OgFIFyd+gvIvDY 9NSBvxpfE85ThkVMyOwRJdkiJYLaw0Bb1Qzg77rchEyOLkoB1ApmIlADnLcxULoE TYnnCmYiQ6/ao6pPnJ+4sPC3GA6D4mvuANDYUESE1XGG9JKlu+DVtUdYidGAibhE 4iJ/4u5IpwoSH5lojCA=
Re: crash in timerfd building pandoc / ghc94 related
>> It seems so far, from not really paying attention, that there is >> nothing wrong with ghc but that there is a bug in the kernel. > Yes of course no userland code should be able to crash the kernel :D I used to think so. Then it occurred to me that there are various ways for userland to crash the kernel which are perfectly reasonable, where of course "reasonable" is a vague term, meaning maybe something like "I don't think they indicate anything in need of fixing". Perhaps the simplest is dd if=/dev/urandom bs=65536 of=/dev/mem but there are others. Yet I can't help feeling that there is some sense in which it *is* fair to say that userland should never be able to crash the kernel. I have been mulling over this paradox for some time but have not come up with an alternative phrasing that avoids the reasonable crashes while still capturing a significant fraction of the useful meaning. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Finding the slot in the ioconf table a module attaches to?
> Usually your user program would attempt to open xyz0 which would find > the major/minor in th devsw tables. You're relying on a hard coded > major. That's the difference. Okay, I'm probably exposing my ignorance of something here, but, what's the difference? You still have to get the major number, whether chosen at module load time or chosen at kernel build time, into the /dev/xyz0 special file in the filesystem. That requires exfiltrating it from the kernel *somehow* /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Add five new escape sequences to wscons
>> Technically the wscons terminal type is wsvt25, an extended ANSI >> compatible terminal, already supporting more sequences than vt100. Well, it calls itself "vt100" >> Having it also support a useful subset of xterm [...] seems like a >> useful addition, > 100% agree. Oh, certainly. It just seems to me that the name "vt100" for that emulation type is becoming more and more misleading. I have nothing at all against adding the sequences in question, except the mismatch between the implications of the name and the actual emulation (and even that is relatively weak, given how many "vt100" emulations are at least as far from VT-100s as wscons is). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Add five new escape sequences to wscons
>> [wscons's "vt100" is] not a very good VT-100 emulation, handling a >> bunch of sequences differently from VT-100s (mostly things wscons >> implements but VT-100s don't) and having a handful of other >> mismatches (such as supporting sizes other than 80x24 and 132x24). > A lot of the sequences that it supports are from later VT-series > terminals. Sure, and other things from X3.64. When I did the X3.64 (and, if turned on, ISO 6429 colour SGR values) mode for my terminal emulator, I called the emulation types "ansi" and "decansi" for basically this reason. (The difference between ansi and decansi is support for various DEC extensions, such as scrolling regions or ?-flagged arguments to CSI h and CSI l.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Add five new escape sequences to wscons
> Here is a patch to implement five additional escape sequences in the > wscons vt100 emulation. Not to pick on this particular addition...but is it really still appropriate to call it "vt100"? It's not a very good VT-100 emulation, handling a bunch of sequences differently from VT-100s (mostly things wscons implements but VT-100s don't) and having a handful of other mismatches (such as supporting sizes other than 80x24 and 132x24). /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>> I find it far more plausible that I'm doing something wrong. > Or maybe the drive just doesn't obey the spec? That's possible, I suppose. But it's a brand new Kingston SSD, which I would expect would support TRIM. And it self-identifies as supporting TRIM. The packaging promises free technical support. I suppose I should try to chase down a contact (the packaging gives no hint whom to contact for that promised support) and ask. At worst I'll be told nothing useful. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>> According to that PDF, dholland is wrong. > I fail to see a behaviour that would be allowed due to dholland@'s > definition, but not according to the one you cited, nor the other way > round. A read returning the pre-TRIM contents. Two of the options specifically state "independent of the previously written value"; the third is simply zero, which is also independent of the previously written value. dholland wrote > The state of the data after TRIM is unspecified; you might read the > old data, you might read zeros or ones, you might (I think) even read > something else. and, as I read that PDF, "you might read the old data" is specifically disallowed. You may read zeros or ones, or something else, but the only way you'll read the old data is if the old data matches what the drive's algorithm happens to return for those sectors (for example, if the drive returns zeros but zeros were what you had written). It is theoretically possible that the data I wrote happens to match what the drive returns for trimmed sectors. Given the data, I find that extremely unlikely. (I may try again with different data, just in case, but I still don't like the way the command is timing out. I find it far more plausible that I'm doing something wrong.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
[dholland] > The state of the data after TRIM is unspecified; you might read the > old data, you might read zeros or ones, you might (I think) even read > something else. [RVP] > OK, I've now actually looked at what the spec[1] says instead of > relying on my faulty recall of stuff I read on lwn.net years ago. > [1] [...] > https://web.archive.org/web/20200616054353if_/http://t13.org/Documents/UploadedDocuments/docs2017/di529r18-ATAATAPI_Command_Set_-_4.pdf According to that PDF, dholland is wrong. PDF page 150, page-number page 113, includes examples of "propperties associated with trimmed logical sectors" including a) no storage resources; and b) read commands return: A) a nondeterministic value that is independent of the previously written value; B) a deterministic value that is independent of the previously written value; or C) zero. though it seems to me (b)(C) is actually a special case of (b)(B). See table 33, later on that page, for more. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
> are you trying to trim a really large section at once? i think > that's what i see: >> [ - root] 3> date; ./trim /dev/rwd1d 4 2; date That means "first six bytes contain 4, LE; second two bytes contain 2, LE". I thought that in turn meant "2 sectors at offset 4". Apparently it actually means "2 * max_dsm_blocks at offset 4", but max_dsm_blocks is 8 for this device, so that's still only 8K. > at least in my experience, the problem is that most devices take a > while to handle a TRIM request, longer than the 30s timeout typically > used. That's...odd. How can it be useful if it takes that long? Is the intent that it be useful only for very occasional "erase this whole filesystem" use, or what? I thought it was intended for routine filesystem use upon deleting files. > this is why blkdiscard(8) defaults to 32MiB chunks. I once did what I thought was trying to trim 16M, but my current understanding says that attempt would have been 128M. That didn't work any better. I just tried increasing the timeout to 30 (ie, five minutes) and trimming offset 0 size 8, which I now think for this device (with max_dsm_blocks 8) should mean 64 (interface) sectors, ie, 32k. It still timed out, with the same followup timeouts. Note the date output here; it took five minutes for the TRIM to time out, then thirty seconds for wd_flushcache. [ - root] 4> date; trim /dev/rwd1d 0 8; date Mon Dec 12 08:22:29 EST 2022 TRIM wd1: arg 00 00 00 00 00 00 08 00 Version 2040.283, max DSM blocks 8 TRIM wd1: calling exec piixide1:0:1: lost interrupt type: ata tc_bcount: 512 tc_skip: 0 TRIM wd1: returned 1 ATAIOCTRIM workd wd1: wd_flushcache: status=128 Mon Dec 12 08:27:59 EST 2022 [ - root] 5> dd if=/dev/rwd1d of=/dev/null count=8 piixide1:0:1: wait timed out wd1d: device timeout reading fsbn 0 (wd1 bn 0; cn 0 tn 0 sn 0), retrying wd1: soft error (corrected) 8+0 records in 8+0 records out 4096 bytes transferred in 0.005 secs (819200 bytes/sec) [ - root] 6> > maybe port that tool back, I'll try to have a look at it. I haven't been trying to match the -9 userland API, though, so I'm not sure how useful it will actually be. It may point me in a useful direction, though. > it's also supposed to match the linux command of the same name. it's > not in netbsd-9, but last i tried, the interfaces the -current tool > uses are available in -9 kernels. I did bring over the 9.2 syssrc set, so I should be able to figure _something_ out. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>> OK, so any requests >4K will have to be packaged into further range >> requests [...] > This isn't right. Bytes 7 & 8 of a TRIM range request form a > counter. So, a counter of 1 = (1 x max_dsm_blocks); 2 = (2 x > max_dsm_blocks) up to 0x counts. So is max_dsm_blocks misnamed, or is it just being abused as a dsm_granularity value by TRIM, whereas other DSM commands do use it as a maximum? If the former, I'd like to rename it in my tree > And you can have 64 range requests (contiguous or disjoint) in a 512 > byte DSM payload. You clearly know a lot more about the relevant commands than I do, though admittedly at the moment that's a very very low bar. > Start with a `count' of 1 after you set the LBA48 flag. Once I figure out how to get some analog to LBA48, at least. :) Yes, my code sets r_count to 1 because the code I started with does analogously. Until I saw your email, I had no idea there was even any way to _represent_ multiple ranges in a single request. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>> I tried trimming 8 at 0. Still the same syndrome: TRIM timeout, >> flush timeout, device timeout reading [...] > You may have to set the AT_LBA48 flag (not sure if this is present on > 5.2) It is not. 5.2 has an ATA_LBA48 flag, going with the flags field of struct ata_bio, but no LBA48 flag for ata_command.flags. My evolution of 5.2 has AT_READREG48, which I added as part of my attempt to support HPAs. But that's the closest thing I see, and that's not really close enough to be useful here. > so that `wdccommandext' gets called rather than `wdccommand' for the > ATA_DATA_SET_MANAGEMENT command. All this from [FreeBSD] And presumably the NetBSD wdccommand/wdccommandext difference matches the FreeBSD one closely enough for that to be relevant? I shall have to read wdccommand* over in more detail. Mouse
Re: ATA TRIM?
>> Okay, that now seems unlikely. I tried to TRIM 32M at zero. (Actually, 16M - 32K blocks is 16M.) > What is the value of `max_dsm_blocks' that your drive reports? > Unfortunately, atactl(8) doesn't show this currently. I added that - and the version numbers - to my printf. atap_ata_major is 2040, 0x7f8. atap_ata_minor is 283, 0x11b. max_dsm_blocks is 8. I tried trimming 8 at 0. Still the same syndrome: TRIM timeout, cache flush timeout, device timeout reading - and I just now noticed that the last timeout is a timeout reading wd*0*. This leads me to suspect that it's the host hardware, not the drive, that's falling over here (presumably trying to load dd to read wd1 with). Is that plausible? I did another test. I tried to trim 8 at 0, but, first, I started a loop that reads successive blocks of wd0, the OS's disk, one per second, printing timestamps as it goes. wd0 access locks up during the TRIM attempt. One read got through between that and the cache flush; it locked up again during that. It then came back. But when I tried to read wd1 it locked up again during that. Dunno what all this means /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
[Replying to two messages at once here, both from the same person] [First message] >> printf("TRIM %s: calling exec\n",device_xname(wd->sc_dev)); >> rv = wd->atabus->ata_exec_command(wd->drvp,&cmd); >> printf("TRIM %s: returned %d\n",device_xname(wd->sc_dev),rv); >> return(0); > ata_exec_command() will start the command, but, the completion of it > is usually signalled by an interrupt. Presumably, the 9.2 > ATA-related code takes care of this as ata_exec_command() takes a > `xfer' parameter rather than a bare command struct. How does 5.2 > wait for ATA command completion? I will have to dig into that more. It does seem to be waiting, in that the call does not return until the thirty seconds specified in the timeout field have elapsed. (It then takes about another 30s before printing the cache-flush timeout message and returning to userland.) Since the data on the device is still there afterwards, I don't think it's just a question of not correctly handling completion. If it were, I'd expect the operation to work in the sense of dropping the blocks described by the argument values. [Other message] >>case ATAIOCTRIM: >> { unsigned char rq[512]; >> struct ata_command cmd; ... >> rv = wd->atabus->ata_exec_command(wd->drvp,&cmd); >> printf("TRIM %s: returned %d\n",device_xname(wd->sc_dev),rv); >> return(0); >> } > Ah, shouldn't `cmd' be allocated memory rather than being a > locally-scoped variable? Why? cmd.flags specifies AT_WAIT, and as I remarked above it is indeed waiting, so cmd, on the kernel stack, should outlive the I/O attempt. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>> [...TRIM...] > It could perhaps be that the area you're trying to trim is too small, > or badly aligned? Okay, that now seems unlikely. I tried to TRIM 32M at zero. (Much more than that seems implausible, since the request has only 16 bits of size, so the maximum representible size is 65535 blocks, or a smidgen under 64M. And zero certainlky ought to be aligned.) The behaviour is basically the same. Except for the details, like the argument area, it looks the same: [ - root] 3> trim /dev/rwd1d 0 32768 TRIM wd1: arg 00 00 00 00 00 00 00 80 TRIM wd1: calling exec piixide1:0:1: lost interrupt type: ata tc_bcount: 512 tc_skip: 0 TRIM wd1: returned 1 ATAIOCTRIM workd wd1: wd_flushcache: status=128 [ - root] 4> [ - root] 4> dd if=/dev/rwd1d of=/dev/null count=64 piixide1:0:1: wait timed out wd1d: device timeout reading fsbn 0 (wd1 bn 0; cn 0 tn 0 sn 0), retrying wd1: soft error (corrected) 64+0 records in 64+0 records out 32768 bytes transferred in 0.008 secs (4096000 bytes/sec) [ - root] 5> That is, the request starts and nothing happens until the 30-second timeout expires, at which point it reports "lost interrupt" and says it worked. It then reports another timeout on cache flush. Attempting to read gives _another_ timeout, from which it recovers and then works. And, as before, reading the beginning of the drive indicates that the first hundred sectors, at least, still retain the test data I wrote to them before I started all this. Hm, the device packaging promises free technical support. As cynical as I may be about vendor support, I suppose I really ought to call them up and see if they can put me in touch with someone who actually knows how TRIM works. I don't really have anything to lose except some time. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: ATA TRIM?
>>> I'm trying to understand TRIM, such as is used on SSDs. [...] >> [...] > It could perhaps be that the area you're trying to trim is too small, > or badly aligned? Entirely possible. What are the restrictions? Are they device-specific, or generic? (While wedging seems like a rather broken response to such issues, I've seen brokener.) Mouse
Re: ATA TRIM?
I wrote > I'm trying to understand TRIM, such as is used on SSDs. [...] I forgot to ask: does anyone know whether TRIM is known to work? It occurs to me that I don't actually know whether the code I'm trying to backport works. The code looks more or less identical in current, according to cvsweb, but that still doesn't tell me whether anyone is _using_ it. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
ATA TRIM?
I'm trying to understand TRIM, such as is used on SSDs. As a first step towards this, I'm trying to do a rudimentary backport to a 5.2 derivative I'm using - nothing teaches a thing like implementing it. I found wd_trim() in 9.2's wd.c and had a stab at integrating a form of it into my kernel. It doesn't work, as you can probably infer from my writing this mail. Userland issues the ioctl I'm using as aan early-stage API, printfs indicate that my kernel code is running, and...it times out. I'm writing to ask if there's anyone who knows TRIM well enough to have a stab at telling what's wrong and is willing to try. Anyone not interested in details can stop reading now without loss; the rest of this mail details what I've done and what I got. I've presumably made some mistake somewhere, but it's not clear to me. Here's the code I dropped into wdioctl(), adapted from 9.2's wd_trim(). I also lifted ataparams fields 129 through 208 from 9.2, including things such as ATA_SUPPORT_DSM_TRIM. (5.2 has all those fields as a reserved [80] array.) case ATAIOCTRIM: { unsigned char rq[512]; struct ata_command cmd; int rv; if (! (flag & FWRITE)) return(EBADF); if (! (wd->sc_params.atap_ata_major & WDC_VER_ATA7)) { printf("ATAIOCTRIM: %s: not ATA-7\n",device_xname(wd->sc_dev)); return(EINVAL); } if (! (wd->sc_params.support_dsm & ATA_SUPPORT_DSM_TRIM)) { printf("ATAIOCTRIM: %s: has no TRIM support\n",device_xname(wd->sc_dev)); return(EINVAL); } bcopy(addr,&rq[0],8); printf("TRIM %s: arg %02x %02x %02x %02x %02x %02x %02x %02x\n", device_xname(wd->sc_dev), rq[0], rq[1], rq[2], rq[3], rq[4], rq[5], rq[6], rq[7]); bzero(&rq[8],512-8); bzero(&cmd,sizeof(cmd)); // XXX API botch cmd.r_command = ATA_DATA_SET_MANAGEMENT; cmd.r_count = 1; cmd.r_features = ATA_SUPPORT_DSM_TRIM; cmd.r_st_bmask = WDCS_DRDY; cmd.r_st_pmask = WDCS_DRDY; cmd.timeout = 3; cmd.data = &rq[0]; cmd.bcount = 512; cmd.flags |= AT_WRITE | AT_WAIT; printf("TRIM %s: calling exec\n",device_xname(wd->sc_dev)); rv = wd->atabus->ata_exec_command(wd->drvp,&cmd); printf("TRIM %s: returned %d\n",device_xname(wd->sc_dev),rv); return(0); } break; When I run my userland program, I get [ - root] 3> date; ./trim /dev/rwd1d 4 2; date Wed Dec 7 11:46:43 EST 2022 TRIM wd1: arg 04 00 00 00 00 00 02 00 TRIM wd1: calling exec piixide1:0:1: lost interrupt type: ata tc_bcount: 512 tc_skip: 0 TRIM wd1: returned 1 ATAIOCTRIM workd wd1: wd_flushcache: status=128 Wed Dec 7 11:47:43 EST 2022 [ - root] 4> 1 is ATACMD_COMPLETE. (The "ATAIOCTRIM workd" message is coming from the userland program.) Then attempting to read the drive times out but recovers: [ - root] 4> dd if=/dev/rwd1d of=/dev/null bs=512 count=64 piixide1:0:1: wait timed out wd1d: device timeout reading fsbn 0 (wd1 bn 0; cn 0 tn 0 sn 0), retrying wd1: soft error (corrected) 64+0 records in 64+0 records out 32768 bytes transferred in 0.008 secs (4096000 bytes/sec) [ - root] 5> Reading the device after that, I find the original contents are still accessible up through (at least) sector 17, so the TRIM did not actually work. wd1 is a Kingston SATA SSD: wd1 at atabus1 drive 1: wd1: drive supports 1-sector PIO transfers, LBA48 addressing wd1: HPA enabled, no protected area wd1: 111 GB, 232581 cyl, 16 head, 63 sec, 512 bytes/sect x 234441648 sectors wd1: 32-bit data port wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 6 (Ultra/133) wd1: non-rotational device wd1(piixide1:0:1): using PIO mode 4, Ultra-DMA mode 6 (Ultra/133) (using DMA) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: #pragma once
> Traditionally to avoid problems with repeated inclusion of a header > file, you put #include guards around it, say in sys/dev/foo.h: > [...] > With newer compilers this can be replaced by a single line in the > header file: > #pragma once Some newer compilers, perhaps. Unless and until it is standardized, there's no telling what #pramga once might mean to the next compiler to come along - except that, for Eliza reasons, it presumably will be related to doing something only once, but there are a lot of such possibilities. Furthermore, even when implementors agree on the basic meaning, unless and until it is precisely specified and standardized, implementations will differ in corner cases. foo.h #define FOO(x) _Pragma(x) bar.h #define BAR() FOO("once") hdr.h #include "bar.h" #include "foo.h" BAR() Which file gets the include-once semantic? Why or why not? I could make an argument for each of the three (some of the arguments will be stronger than others...but which ones are which will vary by person). > It's nonstandard, but using #pragma once is maybe a bit less > error-prone -- don't have to have to pollute the namespace with > have-I-been-included macros, and I've made mistakes with copying & > pasting the per-file have-I-been-included macro into the wrong file. I'm not sure. I see arguments each way. The biggest problems I see with using it in NetBSD-provided include files: (1) Developers may see it and think it's more portable than it is as a result. Developers are already way too ready to assume that anything that works on their development machines is suitable for release. (2) Unless and until the functionality is standardized, it makes the system gratuitously nonportable. ("Portable between what I think are the currently most popular two compilers" is awfully weak, even if "what I think" is correct.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: MP-safe /dev/console and /dev/constty
>> Your suggestion of pushing it into a separate function (which >> presumably would just mean using return instead of break to >> terminate the code block) strikes me as worth considering in general >> but a bad idea in this case; there are too many things that would >> have to be passed down to the function in question. > Of course, GCC offers nested functions for exactly this, but... Yes. I would not expect gcc-specific code to be accepted. In this case, I see no benefit to using a nested function over one of the constructs that supports a break-out to a well-defined point: do{...}while(0), switch(0){case 0:...}, while(1){...;break;}, or the like. (I would say do-while(0) is the closest to a canonical version of those.) In some cases there may be a benefit, if you want to break out of multiple nested constructs. (In that case I'd actually use labeled control structure, but that's even less well supported than gccisms like nested functions.) However, this is all armchair quarterbacking when we don't know what mrg disliked about the code as given. I still think all it really needs is to be reformatted so the do and the while(0) don't visually disappear into the containing if-elseif construct. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: MP-safe /dev/console and /dev/constty
> i really like this except for the if () do { ... } while (0); else > abuse portion. please rework that part. it looks easiest to push > into a separate function, perhaps. You don't say what you don't like about it. There are only two things I don't like about it, and one of them (indentation) is shared with almost all of /usr/src. (The other I'll get to below.) Given the lack of information about what you don't like about it, I'm going to guess that you don't like using an un-braced do-while as the consequent of an if. Or, perhaps, you don't like that use of do-while at all? Using do { ... } while (0); to provide a context in which break can be used to skip the rest of a well-defined block of code is, IMO, far preferable to using a goto, which latter seems to be the historically usual approach to such things. Your suggestion of pushing it into a separate function (which presumably would just mean using return instead of break to terminate the code block) strikes me as worth considering in general but a bad idea in this case; there are too many things that would have to be passed down to the function in question. And the only benefit I see is avoiding the do-while, which I have trouble seeing anything wrong with, except the second of the two things I mentioend above. Would you feel better if it were wrapped in switch (0) { case 0: ... } instead? Worse? Why or why not? I would prefer to see braces around the do-while, with a corresponding indentation level, but that's the only change I would say needs making there. With the current formatting, the do and while(0) tend to visually disappear into the if control structure, making the contained breaks too easy to misread. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: fallocate for FFS
>>> I will try to figure it out, since its not yet implemented the >>> syscall is a good way to start dev in BSD kernel. >> I'm not sure about that; many people have started looking at it and >> not got anywhere. > It is true that adding a system call is an easy entry point to learn > about the kernel. But here the syscall is the easy part, the real > work is modifying FFS code to suport it, and that is a steep learning > curve. Not just a steep learning curve. Some of the fallocate operations (mode zero in particular) would be fairly easy. But, as someone who knows somewhat of FFS, I would say some of the operations, in particular anything that involves allocating space after EOF, would be rather difficult to implement even for someone who's past the learning curve. (Some others, while easy, would be tedious and expensive; FALLOC_FL_COLLAPSE_RANGE is an example. While it would technically be possible to make it fast and simple by taking advantage of the granularity requirement leeway, such an implementation would be too restrictive to be worth doing.) /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: Lock of NetBSD-current with ifconfig down / up
> I've ordered some PS/2 keyboards, because I take it that's the only > way to reliably get in to the kernel debugger on amd64, unless > someone knows a trick to make USB keyboards usable. This is true - to the extent it _is_ true - only if you insist on video console. I find a break condition on serial console works well too. But I think that used to, and may still, depend on having a real serial port, which I gather recent machines may not, even if they have the connector for it. (I've heard it said they tend to have a USB-to-serial chip on an internal USB hub, though I have very limited experience with machines that recent.) As for USB keyboards, if you tell the BIOS to fake a real keyboard and then arrange for the OS to ignore the keyboard's USB existence, you may find yourself with a USB-hardware keyboard that looks like a PS/2 keyboard to the OS. But this may require disabling all OS knowledge of USB or something comparably drastic, which may or may not be an option for your use case. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B