Re: Data Independent Timing on arm64

2022-10-01 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Sat, 01 Oct 2022 09:37:01 -0600
> 
> Mark Kettenis  wrote:
> 
> > Armv8.4 introduced a feature that provides data independent timing for
> > data processing instructions.  This feature is obviously introduced to
> > mitigate timing side-channel attacks.  Presumably enabling the feature
> > has some impact on performance as it would disable certain
> > optimizations to guarantee constant-time execution of instructions.
> 
> But what impact does it have on all regular code?  I cannot find that
> answer in the quoted email thread.  And I don't even see the question
> being asked, because those people aren't talking about changing the
> cpu to this mode permanently & completely.
> 
> > The only hardware that implements this feature that I've seen so far
> > is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
> > I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
> > "eopenssl30 speed" bound to a "performance" core.
> 
> That is testing the performance of a program which uses a very narrow
> set of cpu behaviours.  For example, it has almost no crossings in & out
> of the kernel: system calls and page faults.  The class of operations
> being tested are mostly pure compute against the register set rather
> than memory, and when it does perform memory loads, it does so in quite
> linear fashion.  It also does relatively few memory writes.

I also tested kernel builds.  I don't see any evidence of any
significant impact on those.  

> It is a program that would be slower if they implimented the feature
> poorly, but using such a small subset of system behaviours, I don't
> think it can identify things which might be slower in part, and thus
> have an effect on whole system performance.

The ARM ARM is rather explicit on the instructions that might be
affected by those flags.  That list makes me believe any performance
impact would show up most prominantly in code that uses the ARMv8
crypto instructions.

> > I could not detect a significant slowdown with this feature enabled.
> 
> Then why did they make it a chicken bit?  Why did the cpu makers not
> simply make the cpus always act this way?  There must be a reason,
> probably undisclosed.  They have been conservative for some reasons.

I wouldn't call it a chicken bit.  But obviously those in charge of
the architecture spec anticipated some significant speedup from having
instructions that have data-dependent timings.  It appears that
Apple's implementation doesn't though.

What might be going on here is that Apple is just ticking boxes to
make their implementation spec compliant.  The feature is required for
ARMv8.4 and above.  But if none of their instructions have
data-dependent timing, they could implement the "chicken bit" without
it actually having an effect.

> Is there an impact on the performance of building a full snapshot?  That
> at least has a richer use of all code flows, with lots of kernel crossings,
> as opposed to the openssl speed command.

I didn't test full snapshot builds.  I think the kernel builds I did
should be representative enough.  But I certainly can do more
benchmarking if you think that would be desirable.

> > Therefore I think we should enable this feature by default on OpenBSD.
> 
> Maybe.  But I am a bit unconvinced.
> 
> > The use of this feature is still being discussed in the wider
> > comminity.  See for example the following thread:
> > 
> >   
> > https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc
> > 
> 
> That discussion is talking about providing the ability for programs to
> request that mode of operation.  They are not talking about switching
> into that mode for the kernel and all processes.
> 
> That seems like a big difference.
> 
> >From a security perspective it might make some sense, but this isn't
> like some x86 catastrophy level speculation. I'm not sure there is
> enough evidence yet that this is required for all modes of operation.

In principle this should be only necessary for code that is sensitive
to timing side-channel attacks.  But the way I see it, the ecosystem
will need (a lot of) time to figure out how to enable this mode around
the bits of code where it *does* matter.

> > On arm64, the feature can be controlled from userland, so even if we
> > turn it on by default, userland code can still make its own decisions
> > on whether it wants the feature enabled or disabled.  We may have to
> > expose the ID_AA64PFR0_EL1 register to userland when code shows uo
> > that attempts to do that.
> 
> I suspect that is this will go.  Programs with specific libraries would
> then request the feature on behalf of their entire runtime.  Something
> like a constructor (or startup function) in libcrypto would enable it.
> Meaning this program "might" care about timingsafe behaviour, so let's
> enable it, for the remainder of the life of that program.



Re: problem with interrupts on machines with many cores and multiqueue nics

2022-10-01 Thread Mark Kettenis
> Date: Sat, 1 Oct 2022 20:43:31 +0200
> From: Hrvoje Popovski 
> 
> Hi all,
> 
> I have 3 machines where I can reproduce problem in subject.
> Dell R6515 16 core with 4 mcx (64 queues). One mcx have 16 queues. If I
> lower number of cpus to 12 which lowers number of queues for mcx to 12,
> everything is fine.
> Supermicro 24 core with 2 bnxt, 2 mcx and 4 ix (112 queues). bnxt have 8
> queues, mcx and ix 16 queues.
> IBM 12 core with 2 ix and 4 em (56 queues). On this box I have jmatthew@
> multiqueue diff for em(4) and em have 8 queues and ix 12 queues.
> 
> On all those machines if I enable all interfaces machines freeze or panic.
> 
> In attachment you can find vmstat -iz and dmesg from those machines.
> 
> I would like to utilize all 16 core for my new dell firewalls :)

At least on some of these machines, you're simply running out of
kernel malloc space.  The machines "hang" because the M_WAITOK flag is
used for the allocations, and malloc(9) waits in the hope someone else
gives up the memory.  Maybe we need to allow for more malloc space.
But to be sure this isn't the drivers being completely stupid, vmstat
-m output would be helpful.

As for interrupts, the x86 architecture only has 256 interrupt vectors
(of which some are actually reserved for exception vectors).  And the
way we prioritize interrupts means we (probably) can't use all of the
remaining vector.  We currently treat them as a global resource, but
probably need to treat them more as a per-CPU resource such that we
can re-use vectors for different purposes on different CPUs.  Fixing
that isn't an easy task though.


> Dell
> When frozen I can't drop to ddb over ipmi console only over idrac vga.
> Here link to ddb output
> 
> https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx1.jpg
> https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx2.jpg
> https://kosjenka.srce.hr/~hrvoje/openbsd/dell_mcx3.jpg
> 
> 
> 
> 
> Supermicro
> 
> if last enabled interface is ix then i'm getting log below
> smc24# ifconfig ix3 up
> ix3: Unable to create TX DMA map
> ix3: Could not setup transmit structures
> 
> 
> if last enabled interface is mcx then box freeze
> smc24# ifconfig mcx1 up
> 
> ~B [send break]
>Stopped at  db_enter+0x10:  popq%rbp
> ddb{0}> trace
> db_enter() at db_enter+0x10
> comintr(80278000) at comintr+0x2de
> intr_handler(800026d1a270,80267980) at intr_handler+0x6e
> Xintr_ioapic_edge17_untramp() at Xintr_ioapic_edge17_untramp+0x18f
> acpicpu_idle() at acpicpu_idle+0x11f
> sched_idle(822cbff0) at sched_idle+0x280
> end trace frame: 0x0, count: -6
> 
> ddb{0}> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>  33034  218257  24243  0  3 0x3  devbufifconfig
> 
> ddb{0}> trace /t 0t218257
> sleep_finish(800026e40880,1) at sleep_finish+0xfe
> msleep(82376190,822be2e0,2,81f42147,0) at
> msleep+0xc7
> malloc(1d8,2,9) at malloc+0x323
> _bus_dmamap_create(822d87b8,251c,d,251c,0,2002,f2c29ad4e18b508f)
> at _bus_dmamap_create+0x53
> mcx_up(80323000) at mcx_up+0x55c
> mcx_ioctl(80323048,80206910,800026e40fb0) at mcx_ioctl+0x5f4
> ifioctl(fd904d7b0ab8,80206910,800026e40fb0,800026e2f510) at
> ifioctl+0x7bf
> soo_ioctl(fd8e5bd46440,80206910,800026e40fb0,800026e2f510)
> at soo_ioctl+0x171
> sys_ioctl(800026e2f510,800026e410c0,800026e41120) at
> sys_ioctl+0x2c4
> syscall(800026e41190) at syscall+0x384
> Xsyscall() at Xsyscall+0x128
> end of kernel
> end trace frame: 0x7f7de0b0, count: -11
> 
> 
> interesting log in dmesg is:
> 
> xhci3 at pci23 dev 0 function 3 vendor "AMD", unknown product 0x148c rev
> 0x00failed to allocate interrupt slot for PIC msi pin -2143091968
> : couldn't establish interrupt at msi
> ahci2 at pci24 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to
> allocate interrupt slot for PIC msi pin -2143027200
> ahci2: unable to map interrupt
> ahci3 at pci25 dev 0 function 0 "AMD FCH AHCI" rev 0x51: msi,failed to
> allocate interrupt slot for PIC msi pin -2142961664
> ahci3: unable to map interrupt
> 
> 
> 
> 
> IBM
> 
> if last enabled interface is ix then box panic
> x3550m4# ifconfig ix0 up
> ix0: Unable to create Pack DMA map
> uvm_fault(0xfd887f12cbb0, 0xc, 0, 1) -> e
> kernel: page fault trap, code=0
> Stopped at  _bus_dmamap_destroy+0x9:movl0xc(%rsi),%eax
> TIDPIDUID PRFLAGS PFLAGS  CPU  COMMAND
> *474869  73471  0 0x3  01K ifconfig
> _bus_dmamap_destroy(822d5a90,0) at _bus_dmamap_destroy+0x9
> ixgbe_free_receive_buffers(800e9910) at
> ixgbe_free_receive_buffers+0xb2
> ixgbe_init(800e7000) at ixgbe_init+0x778
> ixgbe_ioctl(800e7048,80206910,800021c356e0) at ixgbe_ioctl+0x327
> ifioctl(fd887c9d3008,80206910,800021c356e0,800021c50a88) at
> ifioctl+0x7bf
> soo_ioctl(fd87851d0c40,80206910,800021c356e0,800021c50a88)
> at 

Re: wc(1): add -L flag to write length of longest line

2022-10-01 Thread Sebastian Benoit
Theo de Raadt(dera...@openbsd.org) on 2022.09.30 11:11:42 -0600:
> I'm sure there are other people have other desireable features which I
> haven't listed. For instance, could wc.c be the scaffold to use for the
> long-desired web browser to be included in OpenBSD?

Oh, it's clearly incomplete until it can send its result by mail.



Re: HiFive Unmatched clean poweroff using the power button

2022-10-01 Thread Jeremie Courreges-Anglas
On Thu, Aug 18 2022, Jeremie Courreges-Anglas  wrote:
> Some time ago I wanted to get a clean poweroff from the power button on
> my Unmatched, so that I don't get fsck at reboot the morning after
> someone sleeps in the room where the machine lives.  kettenis kindly
> provided sfgpio(4) to get interrupts working on dapmic(4) instead of my
> initial hack that used polling.
>
> One issue I struggled with for a bit is that masking irqs also masks the
> wake-up events, particularly the events we use for dapmic_reset().
> With the diff below most interrupt events are masked off during runtime,
> until we're shutting down/rebooting.  Maybe this is too paranoid and
> I should let more events go through the intr handler, eg for wake-up
> events that I don't envision yet?  (I would love to get wake on lan
> support in cad(4) but my attempts led nowhere so far.)
>
> Also interesting, the fault register needs to be cleared at boot, else
> the interrupt will keep triggering after eg a hard button-driven poweroff.
> We could log the faults found in FAULT_LOG at boot time to know why the
> machine has stopped.  But I'm more concerned about what to do if we get
> a fault at runtime (see the XXX).  When I tried to disestablish the
> interrupt handler with fdt_intr_disestablish(sc->sc_ih) from
> dapmic_reset(), I got a fault. Maybe something worth investigating.

The FAULT_LOG register handling seems appropriate after reading entry
6.2 in:

  
https://www.renesas.com/eu/en/document/apn/shared-irq-line-considerations-pm-059

I wish I had read this document earlier. :)

> The code below is based off da9063_datasheet_2v2.pdf.  I don't know of
> other machines we run on that use this controller, the only entry in
> dmesglog is matthieu's Unmatched machine.
>
> Tests, input and oks welcome.

In the case where my hifive unmatched lives, I have wired both the
shutdown and the reset buttons. Yesterday I tried to see whether the
reset button could be handled as a clean reboot, but looking at the
hifive unmatched board schematics it seems that the PMIC has no control
over it (even if tweaking nRES_MODE).  So the diff stands for scrutiny.
I would like input and maybe oks to get this in. ;)

Index: dev/fdt/dapmic.c
===
RCS file: /cvs/src/sys/dev/fdt/dapmic.c,v
retrieving revision 1.2
diff -u -p -r1.2 dapmic.c
--- dev/fdt/dapmic.c6 Apr 2022 18:59:28 -   1.2
+++ dev/fdt/dapmic.c17 Aug 2022 21:59:57 -
@@ -19,6 +19,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include 
 #include 
@@ -28,11 +31,31 @@
 
 #include 
 
+#include 
+
 extern void (*cpuresetfn)(void);
 extern void (*powerdownfn)(void);
 
 /* Registers */
+#define FAULT_LOG  0x05
 #define EVENT_A0x06
+#define  EVENT_A_EVENTS_D  (1 << 7)
+#define  EVENT_A_EVENTS_C  (1 << 6)
+#define  EVENT_A_EVENTS_B  (1 << 5)
+#define  EVENT_A_E_nONKEY  (1 << 0)
+#define EVENT_B0x07
+#define EVENT_C0x08
+#define EVENT_D0x09
+#define IRQ_MASK_A 0x0a
+#define  IRQ_MASK_A_M_RESERVED ((1 << 7) | (1 << 6) | (1 << 5))
+#define  IRQ_MASK_A_M_SEQ_RDY  (1 << 4)
+#define  IRQ_MASK_A_M_ADC_RDY  (1 << 3)
+#define  IRQ_MASK_A_M_TICK (1 << 2)
+#define  IRQ_MASK_A_M_ALARM(1 << 1)
+#define  IRQ_MASK_A_M_nONKEY   (1 << 0)
+#define IRQ_MASK_B 0x0b
+#define IRQ_MASK_C 0x0c
+#define IRQ_MASK_D 0x0d
 #define CONTROL_F  0x13
 #define  CONTROL_F_WAKE_UP (1 << 2)
 #define  CONTROL_F_SHUTDOWN(1 << 1)
@@ -55,11 +78,20 @@ extern void (*powerdownfn)(void);
 #define ALARM_Y0x4b
 #define  ALARM_Y_TICK_ON   (1 << 7)
 
+#ifdef DAPMIC_DEBUG
+# define DPRINTF(args) do { printf args; } while (0)
+#else
+# define DPRINTF(args) do {} while (0)
+#endif
+
 struct dapmic_softc {
struct device sc_dev;
i2c_tag_t sc_tag;
i2c_addr_t sc_addr;
 
+   int (*sc_ih)(void *);
+   struct task sc_task;
+
struct todr_chip_handle sc_todr;
 };
 
@@ -80,8 +112,11 @@ int dapmic_clock_read(struct dapmic_soft
 intdapmic_clock_write(struct dapmic_softc *, struct clock_ymdhms *);
 intdapmic_gettime(struct todr_chip_handle *, struct timeval *);
 intdapmic_settime(struct todr_chip_handle *, struct timeval *);
+void   dapmic_reset_irq_mask(struct dapmic_softc *);
 void   dapmic_reset(void);
 void   dapmic_powerdown(void);
+intdapmic_intr(void *);
+void   dapmic_shutdown_task(void *);
 
 int
 dapmic_match(struct device *parent, void *match, void *aux)
@@ -96,6 +131,7 @@ dapmic_attach(struct device *parent, str
 {
struct dapmic_softc *sc = (struct dapmic_softc *)self;
struct i2c_attach_args *ia = aux;
+   int node = *(int *)ia->ia_cookie;
 

Re: Data Independent Timing on arm64

2022-10-01 Thread Theo de Raadt
Mark Kettenis  wrote:

> Armv8.4 introduced a feature that provides data independent timing for
> data processing instructions.  This feature is obviously introduced to
> mitigate timing side-channel attacks.  Presumably enabling the feature
> has some impact on performance as it would disable certain
> optimizations to guarantee constant-time execution of instructions.

But what impact does it have on all regular code?  I cannot find that
answer in the quoted email thread.  And I don't even see the question
being asked, because those people aren't talking about changing the
cpu to this mode permanently & completely.

> The only hardware that implements this feature that I've seen so far
> is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
> I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
> "eopenssl30 speed" bound to a "performance" core.

That is testing the performance of a program which uses a very narrow
set of cpu behaviours.  For example, it has almost no crossings in & out
of the kernel: system calls and page faults.  The class of operations
being tested are mostly pure compute against the register set rather
than memory, and when it does perform memory loads, it does so in quite
linear fashion.  It also does relatively few memory writes.

It is a program that would be slower if they implimented the feature
poorly, but using such a small subset of system behaviours, I don't
think it can identify things which might be slower in part, and thus
have an effect on whole system performance.

> I could not detect a significant slowdown with this feature enabled.

Then why did they make it a chicken bit?  Why did the cpu makers not
simply make the cpus always act this way?  There must be a reason,
probably undisclosed.  They have been conservative for some reasons.

Is there an impact on the performance of building a full snapshot?  That
at least has a richer use of all code flows, with lots of kernel crossings,
as opposed to the openssl speed command.

> Therefore I think we should enable this feature by default on OpenBSD.

Maybe.  But I am a bit unconvinced.

> The use of this feature is still being discussed in the wider
> comminity.  See for example the following thread:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc
> 

That discussion is talking about providing the ability for programs to
request that mode of operation.  They are not talking about switching
into that mode for the kernel and all processes.

That seems like a big difference.

>From a security perspective it might make some sense, but this isn't
like some x86 catastrophy level speculation. I'm not sure there is
enough evidence yet that this is required for all modes of operation.

> On arm64, the feature can be controlled from userland, so even if we
> turn it on by default, userland code can still make its own decisions
> on whether it wants the feature enabled or disabled.  We may have to
> expose the ID_AA64PFR0_EL1 register to userland when code shows uo
> that attempts to do that.

I suspect that is this will go.  Programs with specific libraries would
then request the feature on behalf of their entire runtime.  Something
like a constructor (or startup function) in libcrypto would enable it.
Meaning this program "might" care about timingsafe behaviour, so let's
enable it, for the remainder of the life of that program.




Data Independent Timing on arm64

2022-10-01 Thread Mark Kettenis
Armv8.4 introduced a feature that provides data independent timing for
data processing instructions.  This feature is obviously introduced to
mitigate timing side-channel attacks.  Presumably enabling the feature
has some impact on performance as it would disable certain
optimizations to guarantee constant-time execution of instructions.

The only hardware that implements this feature that I've seen so far
is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
"eopenssl30 speed" bound to a "performance" core.  I could not detect
a significant slowdown with this feature enabled.  Therefore I think
we should enable this feature by default on OpenBSD.  If at some point
a processor core appears where this feature does have a significant
performance impact, we can reconsider.

The use of this feature is still being discussed in the wider
comminity.  See for example the following thread:

  
https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc

On arm64, the feature can be controlled from userland, so even if we
turn it on by default, userland code can still make its own decisions
on whether it wants the feature enabled or disabled.  We may have to
expose the ID_AA64PFR0_EL1 register to userland when code shows uo
that attempts to do that.

The diff below enables the feature for both the kernel and for
userland processes.

ok?


Index: arch/arm64/arm64/cpu.c
===
RCS file: /cvs/src/sys/arch/arm64/arm64/cpu.c,v
retrieving revision 1.70
diff -u -p -r1.70 cpu.c
--- arch/arm64/arm64/cpu.c  15 Sep 2022 01:57:52 -  1.70
+++ arch/arm64/arm64/cpu.c  1 Oct 2022 10:05:15 -
@@ -756,6 +756,7 @@ void
 cpu_init(void)
 {
uint64_t id_aa64mmfr1, sctlr;
+   uint64_t id_aa64pfr0;
uint64_t tcr;
 
WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa);
@@ -774,6 +775,11 @@ cpu_init(void)
sctlr &= ~SCTLR_SPAN;
WRITE_SPECIALREG(sctlr_el1, sctlr);
}
+
+   /* Enable DIT. */
+   id_aa64pfr0 = READ_SPECIALREG(id_aa64pfr0_el1);
+   if (ID_AA64PFR0_DIT(id_aa64pfr0) >= ID_AA64PFR0_DIT_IMPL)
+   __asm volatile (".arch armv8.4-a; msr dit, #1");
 
/* Initialize debug registers. */
WRITE_SPECIALREG(mdscr_el1, DBG_MDSCR_TDCC);
Index: arch/arm64/arm64/machdep.c
===
RCS file: /cvs/src/sys/arch/arm64/arm64/machdep.c,v
retrieving revision 1.72
diff -u -p -r1.72 machdep.c
--- arch/arm64/arm64/machdep.c  5 Sep 2022 19:18:56 -   1.72
+++ arch/arm64/arm64/machdep.c  1 Oct 2022 10:05:15 -
@@ -433,7 +433,7 @@ setregs(struct proc *p, struct exec_pack
tf->tf_sp = stack;
tf->tf_lr = pack->ep_entry;
tf->tf_elr = pack->ep_entry; /* ??? */
-   tf->tf_spsr = PSR_M_EL0t;
+   tf->tf_spsr = PSR_M_EL0t | PSR_DIT;
 
retval[1] = 0;
 }
Index: arch/arm64/include/armreg.h
===
RCS file: /cvs/src/sys/arch/arm64/include/armreg.h,v
retrieving revision 1.21
diff -u -p -r1.21 armreg.h
--- arch/arm64/include/armreg.h 29 Aug 2022 02:01:18 -  1.21
+++ arch/arm64/include/armreg.h 1 Oct 2022 10:05:15 -
@@ -602,9 +602,13 @@
 #definePSR_I   0x0080
 #definePSR_A   0x0100
 #definePSR_D   0x0200
+#definePSR_SSBS0x1000
 #definePSR_IL  0x0010
 #definePSR_SS  0x0020
 #definePSR_PAN 0x0040
+#definePSR_UAO 0x0080
+#definePSR_DIT 0x0100
+#definePSR_TCO 0x0200
 #definePSR_V   0x1000
 #definePSR_C   0x2000
 #definePSR_Z   0x4000



Re: wc(1): add -L flag to write length of longest line

2022-10-01 Thread Marc Espie
On Fri, Sep 30, 2022 at 02:22:34AM +0200, Joerg Sonnenberger wrote:
> On Thu, Sep 29, 2022 at 08:39:16PM +1000, Jonathan Gray wrote:
> > wc counts items in files.  Finding the longest item indeed sounds
> > like a task better suited to awk.
> 
> Finding outliers, means and counting are all parts of the same basic
> class of operations. A good implementation of all of them requires
> constant space and a fixed time per input character. An implementation
> in awk will generally not have that property.

Why ? is awk really that bad at managing memory ?

Anyway, we have perl in base, and it will definitely fit the bill.

Heck, I can do it in shell with constant space and a fixed time per input
character.

Real world computing 101: complexity is not all that matters, the constant
in O(1) is often what will actually kill you.

Real world computing 102: any utility left unfettered for long enough will
grow a lisp interpreter and go into a mud-fight with emacs.