Re: Data Independent Timing on arm64
> From: "Theo de Raadt" > Date: Sat, 01 Oct 2022 09:37:01 -0600 > > Mark Kettenis wrote: > > > Armv8.4 introduced a feature that provides data independent timing for > > data processing instructions. This feature is obviously introduced to > > mitigate timing side-channel attacks. Presumably enabling the feature > > has some impact on performance as it would disable certain > > optimizations to guarantee constant-time execution of instructions. > > But what impact does it have on all regular code? I cannot find that > answer in the quoted email thread. And I don't even see the question > being asked, because those people aren't talking about changing the > cpu to this mode permanently & completely. > > > The only hardware that implements this feature that I've seen so far > > is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores. > > I ran some benchmarks on an M2 Macbook Air. In particular, I ran > > "eopenssl30 speed" bound to a "performance" core. > > That is testing the performance of a program which uses a very narrow > set of cpu behaviours. For example, it has almost no crossings in & out > of the kernel: system calls and page faults. The class of operations > being tested are mostly pure compute against the register set rather > than memory, and when it does perform memory loads, it does so in quite > linear fashion. It also does relatively few memory writes. I also tested kernel builds. I don't see any evidence of any significant impact on those. > It is a program that would be slower if they implimented the feature > poorly, but using such a small subset of system behaviours, I don't > think it can identify things which might be slower in part, and thus > have an effect on whole system performance. The ARM ARM is rather explicit on the instructions that might be affected by those flags. That list makes me believe any performance impact would show up most prominantly in code that uses the ARMv8 crypto instructions. > > I could not detect a significant slowdown with this feature enabled. > > Then why did they make it a chicken bit? Why did the cpu makers not > simply make the cpus always act this way? There must be a reason, > probably undisclosed. They have been conservative for some reasons. I wouldn't call it a chicken bit. But obviously those in charge of the architecture spec anticipated some significant speedup from having instructions that have data-dependent timings. It appears that Apple's implementation doesn't though. What might be going on here is that Apple is just ticking boxes to make their implementation spec compliant. The feature is required for ARMv8.4 and above. But if none of their instructions have data-dependent timing, they could implement the "chicken bit" without it actually having an effect. > Is there an impact on the performance of building a full snapshot? That > at least has a richer use of all code flows, with lots of kernel crossings, > as opposed to the openssl speed command. I didn't test full snapshot builds. I think the kernel builds I did should be representative enough. But I certainly can do more benchmarking if you think that would be desirable. > > Therefore I think we should enable this feature by default on OpenBSD. > > Maybe. But I am a bit unconvinced. > > > The use of this feature is still being discussed in the wider > > comminity. See for example the following thread: > > > > > > https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc > > > > That discussion is talking about providing the ability for programs to > request that mode of operation. They are not talking about switching > into that mode for the kernel and all processes. > > That seems like a big difference. > > >From a security perspective it might make some sense, but this isn't > like some x86 catastrophy level speculation. I'm not sure there is > enough evidence yet that this is required for all modes of operation. In principle this should be only necessary for code that is sensitive to timing side-channel attacks. But the way I see it, the ecosystem will need (a lot of) time to figure out how to enable this mode around the bits of code where it *does* matter. > > On arm64, the feature can be controlled from userland, so even if we > > turn it on by default, userland code can still make its own decisions > > on whether it wants the feature enabled or disabled. We may have to > > expose the ID_AA64PFR0_EL1 register to userland when code shows uo > > that attempts to do that. > > I suspect that is this will go. Programs with specific libraries would > then request the feature on behalf of their entire runtime. Something > like a constructor (or startup function) in libcrypto would enable it. > Meaning this program "might" care about timingsafe behaviour, so let's > enable it, for the remainder of the life of that program.
Re: problem with interrupts on machines with many cores and multiqueue nics
> Date: Sat, 1 Oct 2022 20:43:31 +0200 > From: Hrvoje Popovski > > Hi all, > > I have 3 machines where I can reproduce problem in subject. > Dell R6515 16 core with 4 mcx (64 queues). One mcx have 16 queues. If I > lower number of cpus to 12 which lowers number of queues for mcx to 12, > everything is fine. > Supermicro 24 core with 2 bnxt, 2 mcx and 4 ix (112 queues). bnxt have 8 > queues, mcx and ix 16 queues. > IBM 12 core with 2 ix and 4 em (56 queues). On this box I have jmatthew@ > multiqueue diff for em(4) and em have 8 queues and ix 12 queues. > > On all those machines if I enable all interfaces machines freeze or panic. > > In attachment you can find vmstat -iz and dmesg from those machines. > > I would like to utilize all 16 core for my new dell firewalls :) At least on some of these machines, you're simply running out of kernel malloc space. The machines "hang" because the M_WAITOK flag is used for the allocations, and malloc(9) waits in the hope someone else gives up the memory. Maybe we need to allow for more malloc space. But to be sure this isn't the drivers being completely stupid, vmstat -m output would be helpful. As for interrupts, the x86 architecture only has 256 interrupt vectors (of which some are actually reserved for exception vectors). And the way we prioritize interrupts means we (probably) can't use all of the remaining vector. We currently treat them as a global resource, but probably need to treat them more as a per-CPU resource such that we can re-use vectors for different purposes on different CPUs. Fixing that isn't an easy task though. > Dell > When frozen I can't drop to ddb over ipmi console only over idrac vga. > Here link to ddb output > > https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx1.jpg > https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx2.jpg > https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx3.jpg > > > > > Supermicro > > if last enabled interface is ix then i'm getting log below > smc24# ifconfig ix3 up > ix3: Unable to create TX DMA map > ix3: Could not setup transmit structures > > > if last enabled interface is mcx then box freeze > smc24# ifconfig mcx1 up > > ~B [send break] >Stopped at db_enter+0x10: popq%rbp > ddb{0}> trace > db_enter() at db_enter+0x10 > comintr(80278000) at comintr+0x2de > intr_handler(800026d1a270,80267980) at intr_handler+0x6e > Xintr_ioapic_edge17_untramp() at Xintr_ioapic_edge17_untramp+0x18f > acpicpu_idle() at acpicpu_idle+0x11f > sched_idle(822cbff0) at sched_idle+0x280 > end trace frame: 0x0, count: -6 > > ddb{0}> ps >PID TID PPIDUID S FLAGS WAIT COMMAND > 33034 218257 24243 0 3 0x3 devbufifconfig > > ddb{0}> trace /t 0t218257 > sleep_finish(800026e40880,1) at sleep_finish+0xfe > msleep(82376190,822be2e0,2,81f42147,0) at > msleep+0xc7 > malloc(1d8,2,9) at malloc+0x323 > _bus_dmamap_create(822d87b8,251c,d,251c,0,2002,f2c29ad4e18b508f) > at _bus_dmamap_create+0x53 > mcx_up(80323000) at mcx_up+0x55c > mcx_ioctl(80323048,80206910,800026e40fb0) at mcx_ioctl+0x5f4 > ifioctl(fd904d7b0ab8,80206910,800026e40fb0,800026e2f510) at > ifioctl+0x7bf > soo_ioctl(fd8e5bd46440,80206910,800026e40fb0,800026e2f510) > at soo_ioctl+0x171 > sys_ioctl(800026e2f510,800026e410c0,800026e41120) at > sys_ioctl+0x2c4 > syscall(800026e41190) at syscall+0x384 > Xsyscall() at Xsyscall+0x128 > end of kernel > end trace frame: 0x7f7de0b0, count: -11 > > > interesting log in dmesg is: > > xhci3 at pci23 dev 0 function 3 vendor "AMD", unknown product 0x148c rev > 0x00failed to allocate interrupt slot for PIC msi pin -2143091968 > : couldn't establish interrupt at msi > ahci2 at pci24 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to > allocate interrupt slot for PIC msi pin -2143027200 > ahci2: unable to map interrupt > ahci3 at pci25 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to > allocate interrupt slot for PIC msi pin -2142961664 > ahci3: unable to map interrupt > > > > > IBM > > if last enabled interface is ix then box panic > x3550m4# ifconfig ix0 up > ix0: Unable to create Pack DMA map > uvm_fault(0xfd887f12cbb0, 0xc, 0, 1) -> e > kernel: page fault trap, code=0 > Stopped at _bus_dmamap_destroy+0x9:movl0xc(%rsi),%eax > TIDPIDUID PRFLAGS PFLAGS CPU COMMAND > *474869 73471 0 0x3 01K ifconfig > _bus_dmamap_destroy(822d5a90,0) at _bus_dmamap_destroy+0x9 > ixgbe_free_receive_buffers(800e9910) at > ixgbe_free_receive_buffers+0xb2 > ixgbe_init(800e7000) at ixgbe_init+0x778 > ixgbe_ioctl(800e7048,80206910,800021c356e0) at ixgbe_ioctl+0x327 > ifioctl(fd887c9d3008,80206910,800021c356e0,800021c50a88) at > ifioctl+0x7bf > soo_ioctl(fd87851d0c40,80206910,800021c356e0,800021c50a88) > at
Re: wc(1): add -L flag to write length of longest line
Theo de Raadt(dera...@openbsd.org) on 2022.09.30 11:11:42 -0600: > I'm sure there are other people have other desireable features which I > haven't listed. For instance, could wc.c be the scaffold to use for the > long-desired web browser to be included in OpenBSD? Oh, it's clearly incomplete until it can send its result by mail.
Re: HiFive Unmatched clean poweroff using the power button
On Thu, Aug 18 2022, Jeremie Courreges-Anglas wrote: > Some time ago I wanted to get a clean poweroff from the power button on > my Unmatched, so that I don't get fsck at reboot the morning after > someone sleeps in the room where the machine lives. kettenis kindly > provided sfgpio(4) to get interrupts working on dapmic(4) instead of my > initial hack that used polling. > > One issue I struggled with for a bit is that masking irqs also masks the > wake-up events, particularly the events we use for dapmic_reset(). > With the diff below most interrupt events are masked off during runtime, > until we're shutting down/rebooting. Maybe this is too paranoid and > I should let more events go through the intr handler, eg for wake-up > events that I don't envision yet? (I would love to get wake on lan > support in cad(4) but my attempts led nowhere so far.) > > Also interesting, the fault register needs to be cleared at boot, else > the interrupt will keep triggering after eg a hard button-driven poweroff. > We could log the faults found in FAULT_LOG at boot time to know why the > machine has stopped. But I'm more concerned about what to do if we get > a fault at runtime (see the XXX). When I tried to disestablish the > interrupt handler with fdt_intr_disestablish(sc->sc_ih) from > dapmic_reset(), I got a fault. Maybe something worth investigating. The FAULT_LOG register handling seems appropriate after reading entry 6.2 in: https://www.renesas.com/eu/en/document/apn/shared-irq-line-considerations-pm-059 I wish I had read this document earlier. :) > The code below is based off da9063_datasheet_2v2.pdf. I don't know of > other machines we run on that use this controller, the only entry in > dmesglog is matthieu's Unmatched machine. > > Tests, input and oks welcome. In the case where my hifive unmatched lives, I have wired both the shutdown and the reset buttons. Yesterday I tried to see whether the reset button could be handled as a clean reboot, but looking at the hifive unmatched board schematics it seems that the PMIC has no control over it (even if tweaking nRES_MODE). So the diff stands for scrutiny. I would like input and maybe oks to get this in. ;) Index: dev/fdt/dapmic.c === RCS file: /cvs/src/sys/dev/fdt/dapmic.c,v retrieving revision 1.2 diff -u -p -r1.2 dapmic.c --- dev/fdt/dapmic.c6 Apr 2022 18:59:28 - 1.2 +++ dev/fdt/dapmic.c17 Aug 2022 21:59:57 - @@ -19,6 +19,9 @@ #include #include #include +#include +#include +#include #include #include @@ -28,11 +31,31 @@ #include +#include + extern void (*cpuresetfn)(void); extern void (*powerdownfn)(void); /* Registers */ +#define FAULT_LOG 0x05 #define EVENT_A0x06 +#define EVENT_A_EVENTS_D (1 << 7) +#define EVENT_A_EVENTS_C (1 << 6) +#define EVENT_A_EVENTS_B (1 << 5) +#define EVENT_A_E_nONKEY (1 << 0) +#define EVENT_B0x07 +#define EVENT_C0x08 +#define EVENT_D0x09 +#define IRQ_MASK_A 0x0a +#define IRQ_MASK_A_M_RESERVED ((1 << 7) | (1 << 6) | (1 << 5)) +#define IRQ_MASK_A_M_SEQ_RDY (1 << 4) +#define IRQ_MASK_A_M_ADC_RDY (1 << 3) +#define IRQ_MASK_A_M_TICK (1 << 2) +#define IRQ_MASK_A_M_ALARM(1 << 1) +#define IRQ_MASK_A_M_nONKEY (1 << 0) +#define IRQ_MASK_B 0x0b +#define IRQ_MASK_C 0x0c +#define IRQ_MASK_D 0x0d #define CONTROL_F 0x13 #define CONTROL_F_WAKE_UP (1 << 2) #define CONTROL_F_SHUTDOWN(1 << 1) @@ -55,11 +78,20 @@ extern void (*powerdownfn)(void); #define ALARM_Y0x4b #define ALARM_Y_TICK_ON (1 << 7) +#ifdef DAPMIC_DEBUG +# define DPRINTF(args) do { printf args; } while (0) +#else +# define DPRINTF(args) do {} while (0) +#endif + struct dapmic_softc { struct device sc_dev; i2c_tag_t sc_tag; i2c_addr_t sc_addr; + int (*sc_ih)(void *); + struct task sc_task; + struct todr_chip_handle sc_todr; }; @@ -80,8 +112,11 @@ int dapmic_clock_read(struct dapmic_soft intdapmic_clock_write(struct dapmic_softc *, struct clock_ymdhms *); intdapmic_gettime(struct todr_chip_handle *, struct timeval *); intdapmic_settime(struct todr_chip_handle *, struct timeval *); +void dapmic_reset_irq_mask(struct dapmic_softc *); void dapmic_reset(void); void dapmic_powerdown(void); +intdapmic_intr(void *); +void dapmic_shutdown_task(void *); int dapmic_match(struct device *parent, void *match, void *aux) @@ -96,6 +131,7 @@ dapmic_attach(struct device *parent, str { struct dapmic_softc *sc = (struct dapmic_softc *)self; struct i2c_attach_args *ia = aux; + int node = *(int *)ia->ia_cookie;
Re: Data Independent Timing on arm64
Mark Kettenis wrote: > Armv8.4 introduced a feature that provides data independent timing for > data processing instructions. This feature is obviously introduced to > mitigate timing side-channel attacks. Presumably enabling the feature > has some impact on performance as it would disable certain > optimizations to guarantee constant-time execution of instructions. But what impact does it have on all regular code? I cannot find that answer in the quoted email thread. And I don't even see the question being asked, because those people aren't talking about changing the cpu to this mode permanently & completely. > The only hardware that implements this feature that I've seen so far > is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores. > I ran some benchmarks on an M2 Macbook Air. In particular, I ran > "eopenssl30 speed" bound to a "performance" core. That is testing the performance of a program which uses a very narrow set of cpu behaviours. For example, it has almost no crossings in & out of the kernel: system calls and page faults. The class of operations being tested are mostly pure compute against the register set rather than memory, and when it does perform memory loads, it does so in quite linear fashion. It also does relatively few memory writes. It is a program that would be slower if they implimented the feature poorly, but using such a small subset of system behaviours, I don't think it can identify things which might be slower in part, and thus have an effect on whole system performance. > I could not detect a significant slowdown with this feature enabled. Then why did they make it a chicken bit? Why did the cpu makers not simply make the cpus always act this way? There must be a reason, probably undisclosed. They have been conservative for some reasons. Is there an impact on the performance of building a full snapshot? That at least has a richer use of all code flows, with lots of kernel crossings, as opposed to the openssl speed command. > Therefore I think we should enable this feature by default on OpenBSD. Maybe. But I am a bit unconvinced. > The use of this feature is still being discussed in the wider > comminity. See for example the following thread: > > > https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc > That discussion is talking about providing the ability for programs to request that mode of operation. They are not talking about switching into that mode for the kernel and all processes. That seems like a big difference. >From a security perspective it might make some sense, but this isn't like some x86 catastrophy level speculation. I'm not sure there is enough evidence yet that this is required for all modes of operation. > On arm64, the feature can be controlled from userland, so even if we > turn it on by default, userland code can still make its own decisions > on whether it wants the feature enabled or disabled. We may have to > expose the ID_AA64PFR0_EL1 register to userland when code shows uo > that attempts to do that. I suspect that is this will go. Programs with specific libraries would then request the feature on behalf of their entire runtime. Something like a constructor (or startup function) in libcrypto would enable it. Meaning this program "might" care about timingsafe behaviour, so let's enable it, for the remainder of the life of that program.
Data Independent Timing on arm64
Armv8.4 introduced a feature that provides data independent timing for data processing instructions. This feature is obviously introduced to mitigate timing side-channel attacks. Presumably enabling the feature has some impact on performance as it would disable certain optimizations to guarantee constant-time execution of instructions. The only hardware that implements this feature that I've seen so far is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores. I ran some benchmarks on an M2 Macbook Air. In particular, I ran "eopenssl30 speed" bound to a "performance" core. I could not detect a significant slowdown with this feature enabled. Therefore I think we should enable this feature by default on OpenBSD. If at some point a processor core appears where this feature does have a significant performance impact, we can reconsider. The use of this feature is still being discussed in the wider comminity. See for example the following thread: https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc On arm64, the feature can be controlled from userland, so even if we turn it on by default, userland code can still make its own decisions on whether it wants the feature enabled or disabled. We may have to expose the ID_AA64PFR0_EL1 register to userland when code shows uo that attempts to do that. The diff below enables the feature for both the kernel and for userland processes. ok? Index: arch/arm64/arm64/cpu.c === RCS file: /cvs/src/sys/arch/arm64/arm64/cpu.c,v retrieving revision 1.70 diff -u -p -r1.70 cpu.c --- arch/arm64/arm64/cpu.c 15 Sep 2022 01:57:52 - 1.70 +++ arch/arm64/arm64/cpu.c 1 Oct 2022 10:05:15 - @@ -756,6 +756,7 @@ void cpu_init(void) { uint64_t id_aa64mmfr1, sctlr; + uint64_t id_aa64pfr0; uint64_t tcr; WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa); @@ -774,6 +775,11 @@ cpu_init(void) sctlr &= ~SCTLR_SPAN; WRITE_SPECIALREG(sctlr_el1, sctlr); } + + /* Enable DIT. */ + id_aa64pfr0 = READ_SPECIALREG(id_aa64pfr0_el1); + if (ID_AA64PFR0_DIT(id_aa64pfr0) >= ID_AA64PFR0_DIT_IMPL) + __asm volatile (".arch armv8.4-a; msr dit, #1"); /* Initialize debug registers. */ WRITE_SPECIALREG(mdscr_el1, DBG_MDSCR_TDCC); Index: arch/arm64/arm64/machdep.c === RCS file: /cvs/src/sys/arch/arm64/arm64/machdep.c,v retrieving revision 1.72 diff -u -p -r1.72 machdep.c --- arch/arm64/arm64/machdep.c 5 Sep 2022 19:18:56 - 1.72 +++ arch/arm64/arm64/machdep.c 1 Oct 2022 10:05:15 - @@ -433,7 +433,7 @@ setregs(struct proc *p, struct exec_pack tf->tf_sp = stack; tf->tf_lr = pack->ep_entry; tf->tf_elr = pack->ep_entry; /* ??? */ - tf->tf_spsr = PSR_M_EL0t; + tf->tf_spsr = PSR_M_EL0t | PSR_DIT; retval[1] = 0; } Index: arch/arm64/include/armreg.h === RCS file: /cvs/src/sys/arch/arm64/include/armreg.h,v retrieving revision 1.21 diff -u -p -r1.21 armreg.h --- arch/arm64/include/armreg.h 29 Aug 2022 02:01:18 - 1.21 +++ arch/arm64/include/armreg.h 1 Oct 2022 10:05:15 - @@ -602,9 +602,13 @@ #definePSR_I 0x0080 #definePSR_A 0x0100 #definePSR_D 0x0200 +#definePSR_SSBS0x1000 #definePSR_IL 0x0010 #definePSR_SS 0x0020 #definePSR_PAN 0x0040 +#definePSR_UAO 0x0080 +#definePSR_DIT 0x0100 +#definePSR_TCO 0x0200 #definePSR_V 0x1000 #definePSR_C 0x2000 #definePSR_Z 0x4000
Re: wc(1): add -L flag to write length of longest line
On Fri, Sep 30, 2022 at 02:22:34AM +0200, Joerg Sonnenberger wrote: > On Thu, Sep 29, 2022 at 08:39:16PM +1000, Jonathan Gray wrote: > > wc counts items in files. Finding the longest item indeed sounds > > like a task better suited to awk. > > Finding outliers, means and counting are all parts of the same basic > class of operations. A good implementation of all of them requires > constant space and a fixed time per input character. An implementation > in awk will generally not have that property. Why ? is awk really that bad at managing memory ? Anyway, we have perl in base, and it will definitely fit the bill. Heck, I can do it in shell with constant space and a fixed time per input character. Real world computing 101: complexity is not all that matters, the constant in O(1) is often what will actually kill you. Real world computing 102: any utility left unfettered for long enough will grow a lisp interpreter and go into a mud-fight with emacs.