On Tue, Jan 22, 2019 at 02:05:26AM +0000, Zhang, Lei wrote:
> Hi, Mark
> 
> Thanks for your comments, and sorry for late.
> 
> > -----Original Message-----
> > * Under what conditions can the fault occur? e.g. is this in place of
> >   some other fault, or completely spurious?

> This fault can occur completely spurious under a specific hardware
> condition and instructions order.

Ok.

Can you be more specific regarding the conditions under which this
occurs? e.g. can this only occur with certain instruction sequences?

> > * Does this only occur for data abort? i.e. not instruction aborts?

> Yes. This fault only occurs for data abort.
> 
> > * How often does this fault occur?

> In my test, this fault occurs once every several times in the OS boot
> sequence, and after the completion of OS boot, this fault have never
> occurred.
> In my opinion, this fault rarely occurs after the completion of OS
> boot.

I'm very concerned that this could occur during boot (even if rarely),
as that implies this is being taken EL1->EL1 or EL2->EL2.

Which exception levels can the fault be taken from?

e.g. is it possible for this fault to be taken from EL2 to EL2, or from
EL3 to EL3?

> > * Does this only apply to Stage-1, or can the same faults be taken at
> >   Stage-2?
> This fault can be taken only at Stage-1.
> 
> > I'm a bit surprised by the single retry. Is there any guarantee that a
> > thread will eventually stop delivering this fault code?

> I guarantee that a thread will stop delivering this fault code by the
> this patch.
> The hardware condition which cause this fault is reset at exception
> entry, therefore execution of at least one instruction is guaranteed
> by this single retry.

Ok, so we can guarantee forward progress, but in the worst case that's
down to single-step performance levels.

> > Note that all CPUs and threads share the do_bad_ignore_first variable,
> > so this is going to behave non-deterministically and kill threads in
> > some cases.

I see now that I'd misread the code, and we'll always retry the fault
(on A64FX), so this is not true.

> > This code is also preemptible, so checking the MIDR here doesn't make
> > much sense. Either this is always uniform (and we can check once in the
> > errata framework), or it's variable (e.g. on a big.LITTLE system)
> > and we need to avoid preemption up until this point.

... though this may be a problem if A64FX is integrated into a
non-uniform system (and we could unwittingly kill threads).

> > Rather than dynamically checking the MIDR, this should use the errata
> > framework, and if any A64FX CPU is discovered, set an erratum cap like
> > ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something
> > like:

> I try to provide a new patch to reflect your comments in today.
> Unfortunately this bug may occurs before init_cpu_hwcaps_indirect_list
> called.

As above, I'm very concerned that this could be taken from kernel
context. There are a number of cases where we cannot handle such faults:

* During boot, when we hand-over between agents (e.g. UEFI->kernel).

* Before VBAR_EL1 is initialized.

* During exception entry/return sequences (including when the KPTI
  trampoline vectors are installed).

* While the KVM vectors are installed (for VHE).

Are there any constraints on when the fault can be raised? Under which
conditions does this happen?

Thanks,
Mark.

Reply via email to