On Tue, Jan 22, 2019 at 02:05:26AM +0000, Zhang, Lei wrote: > Hi, Mark > > Thanks for your comments, and sorry for late. > > > -----Original Message----- > > * Under what conditions can the fault occur? e.g. is this in place of > > some other fault, or completely spurious?
> This fault can occur completely spurious under a specific hardware > condition and instructions order. Ok. Can you be more specific regarding the conditions under which this occurs? e.g. can this only occur with certain instruction sequences? > > * Does this only occur for data abort? i.e. not instruction aborts? > Yes. This fault only occurs for data abort. > > > * How often does this fault occur? > In my test, this fault occurs once every several times in the OS boot > sequence, and after the completion of OS boot, this fault have never > occurred. > In my opinion, this fault rarely occurs after the completion of OS > boot. I'm very concerned that this could occur during boot (even if rarely), as that implies this is being taken EL1->EL1 or EL2->EL2. Which exception levels can the fault be taken from? e.g. is it possible for this fault to be taken from EL2 to EL2, or from EL3 to EL3? > > * Does this only apply to Stage-1, or can the same faults be taken at > > Stage-2? > This fault can be taken only at Stage-1. > > > I'm a bit surprised by the single retry. Is there any guarantee that a > > thread will eventually stop delivering this fault code? > I guarantee that a thread will stop delivering this fault code by the > this patch. > The hardware condition which cause this fault is reset at exception > entry, therefore execution of at least one instruction is guaranteed > by this single retry. Ok, so we can guarantee forward progress, but in the worst case that's down to single-step performance levels. > > Note that all CPUs and threads share the do_bad_ignore_first variable, > > so this is going to behave non-deterministically and kill threads in > > some cases. I see now that I'd misread the code, and we'll always retry the fault (on A64FX), so this is not true. > > This code is also preemptible, so checking the MIDR here doesn't make > > much sense. Either this is always uniform (and we can check once in the > > errata framework), or it's variable (e.g. on a big.LITTLE system) > > and we need to avoid preemption up until this point. ... though this may be a problem if A64FX is integrated into a non-uniform system (and we could unwittingly kill threads). > > Rather than dynamically checking the MIDR, this should use the errata > > framework, and if any A64FX CPU is discovered, set an erratum cap like > > ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something > > like: > I try to provide a new patch to reflect your comments in today. > Unfortunately this bug may occurs before init_cpu_hwcaps_indirect_list > called. As above, I'm very concerned that this could be taken from kernel context. There are a number of cases where we cannot handle such faults: * During boot, when we hand-over between agents (e.g. UEFI->kernel). * Before VBAR_EL1 is initialized. * During exception entry/return sequences (including when the KPTI trampoline vectors are installed). * While the KVM vectors are installed (for VHE). Are there any constraints on when the fault can be raised? Under which conditions does this happen? Thanks, Mark.