RE: [RESEND PATCH] Make Fujitsu Erratum 010001 patch can be applied on A64FX v1r0

2019-03-17 Thread Zhang, Lei
Hi guys,

> -Original Message-
> From: linux-arm-kernel  On
> Behalf Of Mark Rutland
> Sent: Saturday, March 16, 2019 12:13 AM
> To: Okamoto, Takayuki/岡本 高幸 
> Cc: 'Catalin Marinas' ; 'Will Deacon'
> ; 'linux-kernel@vger.kernel.org'
> ; Zhang, Lei/張 雷 ;
> 'James Morse' ; hange-folder>?
> ;
> 'linux-arm-ker...@lists.infradead.org' 
> Subject: Re: [RESEND PATCH] Make Fujitsu Erratum 010001 patch can be
> applied on A64FX v1r0
> 
> On Fri, Mar 15, 2019 at 12:22:36PM +, Okamoto, Takayuki wrote:
> > I resend the patch due to whitespace munging.
> >
> > > -Original Message-
> > > From: James Morse 
> > > Sent: Wednesday, February 27, 2019 3:44 AM
> > > To: james.mo...@arm.com; linux-arm-ker...@lists.infradead.org
> > > Cc: linux-kernel@vger.kernel.org; Catalin Marinas
> > > ; Mark Rutland ; Will
> > > Deacon ; Zhang, Lei 
> > > Subject: [PATCH v5] arm64: Add workaround for Fujitsu A64FX erratum
> > > 010001
> > >
> > > +/* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and
> > > +v1r0) */ #define MIDR_FUJITSU_ERRATUM_010001
> > >   MIDR_FUJITSU_A64FX
> > > +#define MIDR_FUJITSU_ERRATUM_010001_MASK
> > >   (~MIDR_VARIANT(1))
> >
> > This workaround for the erratum should be applied for both A64FX v1r0
> > and v0r0, however, the patch v5 is only enabled on A64FX
> > v0r0(MIDR.Variant == 0 && MIDR.Revision == 0).
> > This issue is caused by the macro MIDR_FUJITSU_ERRATUM_010001_MASK.
> >
> > I have tested on both A64FX v1r0 and v0r0. This new patch will effect
> > only for A64FX.
> >
> > --
> > Changed to be applied for not only A64FX v0r0 but also v1r0.
> >
> > Signed-off-by: Zhang Lei 
> > ---
> >  arch/arm64/include/asm/cputype.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/cputype.h
> > b/arch/arm64/include/asm/cputype.h
> > index 2afb133..1fb47b5 100644
> > --- a/arch/arm64/include/asm/cputype.h
> > +++ b/arch/arm64/include/asm/cputype.h
> > @@ -129,7 +129,7 @@
> >
> >  /* Fujitsu Erratum 010001 affects A64FX 1.0 and 1.1, (v0r0 and v1r0) */
> >  #define MIDR_FUJITSU_ERRATUM_010001
>   MIDR_FUJITSU_A64FX
> > -#define MIDR_FUJITSU_ERRATUM_010001_MASK
>   (~MIDR_VARIANT(1))
> 
> The bug is is that MIDR_VARIANT() is meant to extract the variant from a full
> MIDR value, not generate an in-place field value.
> 
> > +#define MIDR_FUJITSU_ERRATUM_010001_MASK   (~(0x1 <<
> MIDR_VARIANT_SHIFT))
> 
> I beleive this can be:
> 
> #define MIDR_FUJITSU_ERRATUM_010001_MASK  (~MIDR_VAR_REV(1,
> 0))

Thanks for your comments.
I also have considered to use MIDR_CPU_VAR_REV macro,
but the implication of (~MIDR_CPU_VAR_REV(1, 0)) is "NOT v1r0".
I think it may cause confusion, so I choose the
simple way (~(0x1 << MIDR_VARIANT_SHIFT)).

> But otherwise this looks fine to me.

Will this patch be merged to v5.1?

Thanks,
Zhang Lei




RE: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-26 Thread Zhang, Lei
Hi James,

> -Original Message-
> From: linux-arm-kernel  On
> Behalf Of James Morse
> Sent: Tuesday, February 26, 2019 2:29 AM
> To: Zhang, Lei/張 雷 
> Cc: Mark Rutland ; 'Catalin Marinas'
> ; 'Will Deacon' ;
> 'linux-kernel@vger.kernel.org' ;
> 'linux-arm-ker...@lists.infradead.org' 
> Subject: Re: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum
> 010001
> 
> Hi Zhang,
> 
> On 23/02/2019 13:06, Zhang, Lei wrote:
> > Zhang, Lei wrote:
> >> I think you mean it may be a problem to modify the KPTI trampoline
> >> because some patches about KPTI will be merged to mainline in the near
> future.
> >> I understood that.
> >> I should discuss with my colleagues whether we can set NFDx=0 all of
> >> time on A64FX.
> >
> > The result of our investigation also supports your suggestion.
> > We surely agree with you that your proposed method (never set NFDx=1
> > on A64FX) is the best to resolve this erratum.
> >
> > For this erratum, James's patch should be merged to mainline instead
> > of my previous patches (v1 to v4).
> > Since KPTI fully covers the effect of NFD1 for A64FX, KPTI is
> > recommended to be used in conjunction with James’s patch.
> 
> >> And thanks for your patch.
> >> If we can set NFDx=0 all of time, I will review, test and report the 
> >> result.
> >
> > I have already tested James's patch on A64FX, and the result is no problem 
> > at
> all.
> >
> > Tested-by:zhang.lei
> 
> Thanks, I'll post it properly with this tag.
[>] 
I saw v5 patch you posted. Thanks a lot.

> 
> 
> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
> >> a4168d366127..b0b7f1c4e816 100644
> >> --- a/arch/arm64/Kconfig
> >> +++ b/arch/arm64/Kconfig
> >> @@ -643,6 +643,25 @@ config QCOM_FALKOR_ERRATUM_E1041
> >>
> >>  If unsure, say Y.
> >>
> >> +config FUJITSU_ERRATUM_010001
> >> +  bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur
> wrongly"
> >> +  default y
> >> +  help
> >> +This option adds workaround for Fujitsu-A64FX erratum E#010001.
> >> +On some variants of the Fujitsu-A64FX cores version (1.0, 1.1),
> memory
> >> +accesses may cause undefined fault (Data abort, DFSC=0b11).
> >> +This fault occurs under a specific hardware condition when a
> >> +load/store instruction performs an address translation using:
> >> +case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
> >> +case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
> >> +case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
> >> +case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> >> +
> >> +The workaround is to ensure these bits are clear in TCR_ELx.
> >> +The workaround only affect the Fujitsu-A64FX.
> >
> > I think it is better to add a notice here as follows:
> >
> >   Recommend to enable KPTI (UNMAP_KERNEL_AT_EL0 = y).
> 
> That unmap option is on by default, you can't turn it off without
> CONFIG_EXPERT. While I agree, I don't think we need to spell this out.
[>] 
I agree with you (that there is no need to mention here). 
Thank you for your suggestion.
 
Best Regards,
Zhang Lei



RE: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-23 Thread Zhang, Lei
Hi guys,

> -Original Message-
> From: linux-arm-kernel  On
> Behalf Of Zhang, Lei
> Sent: Friday, February 15, 2019 9:36 PM
> To: 'James Morse' ; Mark Rutland
> 
> Cc: 'Catalin Marinas' ; 'Will Deacon'
> ; 'linux-kernel@vger.kernel.org'
> ; 'linux-arm-ker...@lists.infradead.org'
> 
> Subject: RE: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum
> 010001
> 
> 
> I think you mean it may be a problem to modify the KPTI trampoline because
> some patches about KPTI will be merged to mainline in the near future.
> I understood that.
> I should discuss with my colleagues whether we can set NFDx=0 all of time on
> A64FX.

The result of our investigation also supports your suggestion. 
We surely agree with you that your proposed method (never set NFDx=1 on A64FX)
is the best to resolve this erratum.

For this erratum, James's patch should be merged to mainline
instead of my previous patches (v1 to v4).
Since KPTI fully covers the effect of NFD1 for A64FX, KPTI is
recommended to be used in conjunction with James’s patch.

> And thanks for your patch.
> If we can set NFDx=0 all of time, I will review, test and report the result.

I have already tested James's patch on A64FX, and the result is no problem at 
all.

Tested-by:zhang.lei


> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
> a4168d366127..b0b7f1c4e816 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -643,6 +643,25 @@ config QCOM_FALKOR_ERRATUM_E1041
> 
> If unsure, say Y.
> 
> +config FUJITSU_ERRATUM_010001
> + bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
> + default y
> + help
> +   This option adds workaround for Fujitsu-A64FX erratum E#010001.
> +   On some variants of the Fujitsu-A64FX cores version (1.0, 1.1), memory
> +   accesses may cause undefined fault (Data abort, DFSC=0b11).
> +   This fault occurs under a specific hardware condition when a
> +   load/store instruction performs an address translation using:
> +   case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
> +   case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
> +   case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
> +   case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> +
> +   The workaround is to ensure these bits are clear in TCR_ELx.
> +   The workaround only affect the Fujitsu-A64FX.

I think it is better to add a notice here as follows:

  Recommend to enable KPTI (UNMAP_KERNEL_AT_EL0 = y).

> +
> +   If unsure, say Y.
> +
>  endmenu

Thanks a lot.

Best Regards,
Zhang Lei 



RE: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-15 Thread Zhang, Lei
Hi guys,


> -Original Message-
> From: James Morse [mailto:james.mo...@arm.com]
> Sent: Friday, February 15, 2019 3:23 AM
> To: Mark Rutland; Zhang, Lei 
> Cc: 'Will Deacon'; 'Catalin Marinas'; 'linux-kernel@vger.kernel.org';
> 'linux-arm-ker...@lists.infradead.org'
> Subject: Re: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

> I think we should do this: never set NFDx on A64FX. I don't think we can 
> maintain
> the TCR
> swivel before any memory access in the KPTI trampoline. (It already uses the
> FAR as a
> scratch register!)
> 
> The errata means we can't use these bits. Its simpler than trying to work 
> around
> the symptoms.

I think you mean it may be a problem to modify the KPTI trampoline 
because some patches about KPTI will be merged to mainline in the near future.
I understood that.
I should discuss with my colleagues whether we can set NFDx=0 all of time on 
A64FX.

And thanks for your patch.
If we can set NFDx=0 all of time, I will review, test and report the result.

Best Regards,
Zhang Lei



[PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-13 Thread Zhang, Lei
Hi guys,

Thanks for your comments.
I am sending the revised patch, version 4, which includes a whole
description of the patch.

This patch adds a workaround for Fujitsu A64FX erratum 010001

There are some discussions on former versions, as follows:

[PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX
  https://lkml.org/lkml/2019/1/18/403
  
[PATCH v2 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001
  https://lkml.org/lkml/2019/1/22/137
  
[PATCH v2 1/1] arm64: Add workaround for Fujitsu A64FX erratum 010001
  https://lkml.org/lkml/2019/1/22/138
  
[PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001
  https://www.spinics.net/lists/arm-kernel/msg703111.html
  
[v3,1/1] Arm64: Add workaround for Fujitsu A64FX erratum 010001
  https://patchwork.kernel.org/patch/10786139/

Please merge this patch.

Note that this patch is for the linux-5.0-rc2 which set TCR_ELx.NFD1 to '1'
only once in the boot sequence and does not set TCR_ELx.NFD0.
If the newer kernel handles TCR_ELx.{NFD0,NFD1} in different way,
I will update the patch as soon as possible.


Changes since [v3]

 * Add description of the patch.
 * Add dependency to Kconfig.
  - Set default value of FUJITSU_ERRATUM_010001 depends on RANDOMIZE_BASE.

Changes since [v2]

 * Change TCR_ELx.NFD1.
  - Set TCR_ELx.NFD1 to 0 when entry kernel.
  - Set TCR_ELx.NFD1 to 1 when exit kernel.

Changes since [v1]

 * Use the errata framework to work around for Fujitsu A64FX erratum 010001.



On the Fujitsu-A64FX cores ver(1.0, 1.1), memory access may
cause an undefined fault (Data abort, DFSC=0b11).
This fault occurs under a specific hardware condition when a
load/store instruction performs an address translation.
Any load/store instruction, except non-fault access
including Armv8 and SVE might cause this undefined fault.

Since this erratum occurs only when TCR_ELx.NFD1=1,
I keep TCR_ELx.NFD1=0 during EL1/EL2.

By doing above, the erratum occurs only in EL0.
I deal with this erratum in EL0 by a new fault handler
which ignores this undefined fault.

Signed-off-by: Zhang Lei 
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 23 +++
 arch/arm64/include/asm/cpucaps.h   |  3 ++-
 arch/arm64/include/asm/cputype.h   |  4 
 arch/arm64/kernel/cpu_errata.c |  8 
 arch/arm64/kernel/entry.S  | 16 
 arch/arm64/mm/fault.c  | 16 +++-
 arch/arm64/mm/proc.S   | 20 
 8 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d04..26d64e9 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_FALKOR_ERRATUM_1009|
 | Qualcomm Tech. | QDF2400 ITS | E0065   | 
QCOM_QDF2400_ERRATUM_0065   |
 | Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
QCOM_FALKOR_ERRATUM_1041|
+| Fujitsu| A64FX   | E#010001| FUJITSU_ERRATUM_010001  
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d3..7c76c66 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -643,6 +643,29 @@ config QCOM_FALKOR_ERRATUM_E1041
 
  If unsure, say Y.
 
+config FUJITSU_ERRATUM_010001
+   bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
+depends on RANDOMIZE_BASE
+default RANDOMIZE_BASE
+   help
+ This option adds workaround for Fujitsu-A64FX erratum E#010001.
+ On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), memory 
accesses
+ may cause undefined fault (Data abort, DFSC=0b11).
+ This fault occurs under a specific hardware condition when a 
load/store
+  instruction performs an address translation using:
+ case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
+ case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
+ case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
+ case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
+
+ The workaround is to set '0' to TCR_ELx.NFD1 at kernel-entry,
+ to set '1' at kernel-exit. And also replace the fault handler
+ for Data abort DFSC=0b11 with a new fault handler to ignore this
+ undefined fault.
+ The workaround only affect the Fujitsu-A64FX.
+
+ If unsure, say Y.
+
 endmenu
 
 
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 82e9099..3a0b375 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -60,7 +60,8 @@
 #define ARM64_HAS_ADDRESS_AUTH_IMP_DEF 39
 #define ARM64_HAS_GENERIC_AUTH_ARCH40
 #define ARM64_HAS_GENERIC_AUTH_IMP_DEF 41
+#define ARM64_WORKAROUND_FUJITSU

RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-05 Thread Zhang, Lei
Hi Will,

> -Original Message-
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-boun...@lists.infradead.org] On Behalf Of
> Will Deacon
> Sent: Friday, February 01, 2019 7:52 PM
> To: Zhang, Lei 
> Cc: 'Mark Rutland'; 'Catalin Marinas'; 'James Morse';
> 'linux-kernel@vger.kernel.org';
> 'linux-arm-ker...@lists.infradead.org'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001

> So I guess we should boot with NFD1 clear, and then set it only when we
> realise we're not on an A64FX?

In my patch, I do similar things at __cpu_setup which we 
set NFD1=1 on all processors except A64FX.

Do you mean we would better to change the place where we 
set/clear NFD1?

Thanks,
Zhang Lei



RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-05 Thread Zhang, Lei
Hi Catalin,

> -Original Message-
> From: Catalin Marinas [mailto:catalin.mari...@arm.com]
> Sent: Wednesday, January 30, 2019 3:11 AM
> To: Zhang, Lei 
> Cc: 'linux-kernel@vger.kernel.org'; 'Mark Rutland';
> 'linux-arm-ker...@lists.infradead.org'; 'will.dea...@arm.com';
> 'james.mo...@arm.com'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
> 
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.

Thank you for your suggestion. I will send one email with whole description.

> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?

This erratum is that any load/store instruction, including Armv8 and SVE, 
except non-fault access might occur a spurious fault.

> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

Although the erratum occurs exceptionally rare, this path is required 
to handle the issue pointed out by James and Mark in:
  https://lkml.org/lkml/2019/1/22/533,
  https://lkml.org/lkml/2019/1/22/642.

As James and Mark pointed, if the erratum occurs at EL1/EL2 before 
system registers, ELR and SPSR, are backed up, these registers will 
be overwritten and we will lose that information.

So, we set the TCR_ELx.NFD1=0 during EL1/EL2.
Please see the supplemental explanation in the end of this mail.

> The problem is that this bit may be cached in the TLB (I haven't checked
> the ARM ARM but that's usually the case with the TCR_ELx bits). If
> that's the case, you can't guarantee a change unless you also perform
> a
> TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
> NFD bits in the TLB, we could apply the workaround but I'd rather have
> the spurious trap if it's not too often.

It is not necessary to perform a TLBI VMALL in A64FX microarchitecture 
to guarantee a change of TCR_ELx.{NFD0,NFD1}. 

> Could speculative loads also trigger this? Another option would be to
> toggle it during kernel_neon_begin/end (with the caveat of TLBI as
> mentioned above).

No, a speculative load does not trigger this erratum. 

Here are supplemental explanations:

Since this erratum occurs only when TCR_ELx.NFD1=1, 
we keep TCR_ELx.NFD1=0 during EL1/EL2.
By doing so, the erratum occurs only in EL0 and the 
spurious trap can be handled by the fault handler.

To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical 
sections to assure the completeness of the implementation.
One is the transition from EL0 to EL1/EL2 and the other 
is from EL1/EL2 to EL0

For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel. 
And there is no load/store instruction before setting 
TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened.

For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel. 
And there is no load/store instruction after setting 
TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened.

To handle the spurious fault in EL0,
I replace the fault handler for Data abort DFSC=0b11 with 
a new fault handler to ignore this spurious fault caused by the erratum.

Thanks,
Zhang Lei



RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-31 Thread Zhang, Lei
Hi James,

> -Original Message-
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-boun...@lists.infradead.org] On Behalf Of
> James Morse
> Sent: Thursday, January 31, 2019 12:00 AM
> To: Zhang, Lei/張 雷
> Cc: 'Mark Rutland'; 'Catalin Marinas'; 'will.dea...@arm.com';
> 'linux-kernel@vger.kernel.org';
> 'linux-arm-ker...@lists.infradead.org'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
> 
> 
> e03e61c3173c ("arm64: kaslr: Set TCR_EL1.NFD1 when
> CONFIG_RANDOMIZE_BASE=y") ?
> 
> So you'd never see it if you disabled CONFIG_RANDOMIZE_BASE?
For security, it is necessary to set CONFIG_RANDOMIZE_BASE=y.

Thanks,
Zhang Lei 



[PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-29 Thread Zhang, Lei
On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),  
memory accesses may cause undefined fault (Data abort, DFSC=0b11).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition 
when a load/store instruction perform an address translation using:
  case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
  case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
  case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
  case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.

Since TCR_ELx.NFD1 is set to '1' at the kernel in versions 
past 4.17, the case-3 or case-4 may happen.

This fault can be taken only at stage-1, 
so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1, 
or from EL2 to EL2.

I would like to post a workaround to avoid this problem on 
existing Fujitsu-A64FX version.

There are 2 points in this workaround.
Point1: trap from EL1 to EL1, EL2 to EL2
Set '0' to TCR_ELx.NFD1in kernel-entry, 
and set '1' in kernel-exit.

From the view point of ARM specification, there is no problem to 
reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because 
TCR_ELx.{NFD0,NFD1} controls whether to perform a translation 
table walk in response to an access from EL0.

I confirmed that:
・There is no load/store instruction between 
  tramp_ventry and setting TCR_ELx.NFD1 to '0'.
・There is no load/store instruction between 
  setting TCR_ELx.NFD1 to '1' and tramp_exit.

Point2: trap from EL0 to EL1/EL2
Since this fault also occurs in EL0,
replace the fault handler for Data abort
DFSC=0b11 with a new one to ignore this undefined fault.
I guarantee that a thread will stop delivering this fault code by ignore
this undefined fault.

The hardware condition which cause this fault is reset at exception entry, 
therefore execution of at least one instruction is 
guaranteed by this single retry.


This workaround is based on linux-5.0-rc2,
which TCR_ELx.NFD1 is set to '1' 
only once at boot sequence, 
and TCR_ELx.NFD0 is not set by kernel.
I will update my patch if new kernel makes some changes
about TCR_ELx.{NFD0,NFD1}.

Changes since [v1]
As Mark's review:

 * Adopted errata framework.

Changes since [v2]
As Mark and James' review:
 
 * Added framework to change TCR_ELx.NFD1.
  - Change TCR_ELx.NFD1 to 0 when entry kernel.
  - Change TCR_ELx.NFD1 to 1 when exit kernel.

I fully appreciate that if someone can test this patch on different chips 
to verity no harmful effect on other chips.

If there is no problem on other chips, please merge this patch.

The patch based on linux-5.0-rc2.

Zhang Lei (1):
  Arm64: Add workaround for Fujitsu A64FX erratum 010001

 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 22 ++
 arch/arm64/include/asm/cpucaps.h   |  3 ++-
 arch/arm64/include/asm/cputype.h   |  4 
 arch/arm64/kernel/cpu_errata.c |  8 
 arch/arm64/kernel/entry.S  | 16 
 arch/arm64/mm/fault.c  | 16 +++-
 arch/arm64/mm/proc.S   | 20 
 8 files changed, 88 insertions(+), 2 deletions(-)

-- 
1.8.3.1


[PATCH v3 1/1] Arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-29 Thread Zhang, Lei
Add workaround for Fujitsu A64FX erratum 010001

Signed-off-by: Zhang Lei 
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 22 ++
 arch/arm64/include/asm/cpucaps.h   |  3 ++-
 arch/arm64/include/asm/cputype.h   |  4 
 arch/arm64/kernel/cpu_errata.c |  8 
 arch/arm64/kernel/entry.S  | 16 
 arch/arm64/mm/fault.c  | 16 +++-
 arch/arm64/mm/proc.S   | 20 
 8 files changed, 88 insertions(+), 2 deletions(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d04..26d64e9 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_FALKOR_ERRATUM_1009|
 | Qualcomm Tech. | QDF2400 ITS | E0065   | 
QCOM_QDF2400_ERRATUM_0065   |
 | Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
QCOM_FALKOR_ERRATUM_1041|
+| Fujitsu| A64FX   | E#010001| FUJITSU_ERRATUM_010001  
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d3..60e193f
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -643,6 +643,28 @@ config QCOM_FALKOR_ERRATUM_E1041
 
  If unsure, say Y.
 
+config FUJITSU_ERRATUM_010001
+   bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
+   default y
+   help
+ This option adds workaround for Fujitsu-A64FX erratum E#010001.
+ On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), memory 
accesses
+ may cause undefined fault (Data abort, DFSC=0b11).
+ This fault occurs under a specific hardware condition when a 
load/store
+  instruction perform an address translation using:
+ case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
+ case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
+ case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
+ case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
+
+ The workaround is to set '0' to TCR_ELx.NFD1 at kernel-entry,
+ and set '1' at kernel-exit. And also replace the fault handler
+ for Data abort DFSC=0b11 with a new one to ignore this
+ undefined fault.
+ Only affect the Fujitsu-A64FX.
+
+ If unsure, say Y.
+
 endmenu
 
 
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 82e9099..3a0b375 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -60,7 +60,8 @@
 #define ARM64_HAS_ADDRESS_AUTH_IMP_DEF 39
 #define ARM64_HAS_GENERIC_AUTH_ARCH40
 #define ARM64_HAS_GENERIC_AUTH_IMP_DEF 41
+#define ARM64_WORKAROUND_FUJITSU_A64FX_011 42
 
-#define ARM64_NCAPS42
+#define ARM64_NCAPS43
 
 #endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 951ed1a..70203f9 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -76,6 +76,7 @@
 #define ARM_CPU_IMP_BRCM   0x42
 #define ARM_CPU_IMP_QCOM   0x51
 #define ARM_CPU_IMP_NVIDIA 0x4E
+#define ARM_CPU_IMP_FUJITSU0x46
 
 #define ARM_CPU_PART_AEM_V80xD0F
 #define ARM_CPU_PART_FOUNDATION0xD00
@@ -104,6 +105,8 @@
 #define NVIDIA_CPU_PART_DENVER 0x003
 #define NVIDIA_CPU_PART_CARMEL 0x004
 
+#define FUJITSU_CPU_PART_A64FX 0x001
+
 #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A53)
 #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A57)
 #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A72)
@@ -122,6 +125,7 @@
 #define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
 #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_DENVER)
 #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, 
FUJITSU_CPU_PART_A64FX)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 9950bb0..fc0737f 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -739,6 +739,14 @@ static bool has_ssbd_mitigation(const struct 
arm64_cpu_capabilities *entry,
ERRATA_MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 2, 0),
},
 #endif
+#ifdef CONFIG_FUJITSU_ERRATUM_010001
+   {
+   .desc = "Fujitsu erratum 010001",
+   .capability = ARM64_WORKAROUND_FUJITSU_A64FX_011,
+   ERRATA_MIDR_RANGE(MIDR_FUJITSU_A64FX, 0, 0, 1, 0),
+   },
+#endif
+
{
}
 };
diff --git a/arch/arm64/kernel/e

RE: [PATCH v2 1/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-29 Thread Zhang, Lei
Hi Catalin,
> -Original Message-
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-boun...@lists.infradead.org] On Behalf Of
> Catalin Marinas
> Sent: Saturday, January 26, 2019 3:08 AM
> To: Zhang, Lei/張 雷
> Cc: 'Mark Rutland'; 'will.dea...@arm.com';
> 'linux-kernel@vger.kernel.org';
> 'linux-arm-ker...@lists.infradead.org'
> Subject: Re: [PATCH v2 1/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
> 
> IIUC, this can happen very early when the errata framework isn't yet
> ready. Given that this is not on a fast path (you already took a fault),
> I don't think it's worth optimising for cpus_have_cap() (and
> ARM64_WORKAROUND_FUJITSU_A64FX_011). I've seen Mark's comments on
> why checking MIDR in a preemptible context is not a good idea but I
> suspect your platform is homogeneous (i.e. not big.LITTLE).
Thanks for comment.
I will post a new patch to resolve fast path problem in today.
By the way our platform is homogeneous.


Best Regards,
Lei Zhang
zhang@jp.fujitsu.com

 



RE: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

2019-01-24 Thread Zhang, Lei
Hi, Mark, James

> -Original Message-
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-boun...@lists.infradead.org] On Behalf Of
> Zhang, Lei
> Sent: Wednesday, January 23, 2019 9:51 PM
> To: 'Mark Rutland'; 'james.mo...@arm.com'
> Cc: 'catalin.mari...@arm.com'; 'will.dea...@arm.com';
> 'linux-kernel@vger.kernel.org';
> 'linux-arm-ker...@lists.infradead.org'; Zhang, Lei/張 雷
> Subject: RE: [PATCH] arm64 memory accesses may cause undefined fault on
> Fujitsu-A64FX
> 
> At first thanks for your comments.
> I thinks James's comments is quite similar with above comment.
> I am reviewing this point now.
> I will respond to your questions after I check this.
As your comments, this patch cannot avoid this problem
completely. So I will post a new patch to resolve this 
problem in different way.

Lei Zhang
zhang@jp.fujitsu.com




RE: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

2019-01-23 Thread Zhang, Lei
Hi, Mark, James
> -Original Message-
> From: Mark Rutland [mailto:mark.rutl...@arm.com]
> Sent: Wednesday, January 23, 2019 12:24 AM
> To: Zhang, Lei/張 雷
> Cc: 'catalin.mari...@arm.com'; 'will.dea...@arm.com';
> 'linux-arm-ker...@lists.infradead.org';
> 'linux-kernel@vger.kernel.org'
> Subject: Re: [PATCH] arm64 memory accesses may cause undefined fault on
> Fujitsu-A64FX
> 
 
> As above, I'm very concerned that this could be taken from kernel
> context. There are a number of cases where we cannot handle such faults:
At first thanks for your comments. 
I thinks James's comments is quite similar with above comment.
I am reviewing this point now.
I will respond to your questions after I check this.

Thanks
Lei Zhang




[PATCH v2 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-22 Thread Zhang, Lei
On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
 memory accesses may cause undefined fault (Data abort, DFSC=0b11).
This problem will be fixed by next version of Fujitsu-A64FX.
I would like to post a workaround to avoid this problem on existing version.
The workaround is to replace the fault handler for Data abort
DFSC=0b11 with a new one to ignore this undefined fault, 
which will only affect the Fujitsu-A64FX.

The detail for this problem.
> * Under what conditions can the fault occur? e.g. is this in place of
>   some other fault, or completely spurious?
This fault can occur completely spurious under a 
specific hardware condition and instructions order.
 
> * Does this only occur for data abort? i.e. not instruction aborts?
Yes. This fault only occurs for data abort.

> * How often does this fault occur?
In my test, this fault occurs once every several times 
in the OS boot sequence, and after the completion of OS boot, 
this fault have never occurred.
In my opinion, this fault rarely occurs after the completion of OS boot.

> * Does this only apply to Stage-1, or can the same faults be taken at
>   Stage-2?
This fault can be taken only at Stage-1.

> I'm a bit surprised by the single retry. Is there any guarantee that a 
> thread will eventually stop delivering this fault code?
I guarantee that a thread will stop delivering this fault code by the this 
patch.
The hardware condition which cause this fault is reset at exception entry, 
therefore execution of at least one instruction is 
guaranteed by this single retry.

Changes since [v1]
As Mark's review:

 * Adopted errata framework.

I have confirmed as followings:
 * Fujitsu A64FX - The problem doesn't happen.
 * QEMU  - No problems to boot.

I fully appreciate that if someone can test this patch on different chips 
to verity no harmful effect on other chips.

If there is no problem on other chips, please merge this patch.

The patch based on linux-5.0-rc2.

Zhang Lei (1):
  arm64: Add workaround for Fujitsu A64FX erratum 010001.

 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 13 +
 arch/arm64/include/asm/cpucaps.h   |  3 ++-
 arch/arm64/include/asm/cputype.h   |  4 
 arch/arm64/kernel/cpu_errata.c |  8 
 arch/arm64/mm/fault.c  | 24 +++-
 6 files changed, 51 insertions(+), 2 deletions(-)

-- 
1.8.3.1


[PATCH v2 1/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-22 Thread Zhang, Lei
On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort,
DFSC=0b11) due to the CPU Errata (Fujitsu #010001).

This patch introduces the workaround to the problem.
The workaround is to change the fault handler for Data abort
DFSC=0b11 to ignore this undefined fault, which will only
affect the Fujitsu-A64FX.

Signed-off-by: Lei Zhang 
Tested-by: Lei Zhang 
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 13 +
 arch/arm64/include/asm/cpucaps.h   |  3 ++-
 arch/arm64/include/asm/cputype.h   |  4 
 arch/arm64/kernel/cpu_errata.c |  8 
 arch/arm64/mm/fault.c  | 24 +++-
 6 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d04..26d64e9 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_FALKOR_ERRATUM_1009|
 | Qualcomm Tech. | QDF2400 ITS | E0065   | 
QCOM_QDF2400_ERRATUM_0065   |
 | Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
QCOM_FALKOR_ERRATUM_1041|
+| Fujitsu| A64FX   | E#010001| FUJITSU_ERRATUM_010001  
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d3..9c09b2b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -643,6 +643,19 @@ config QCOM_FALKOR_ERRATUM_E1041
 
  If unsure, say Y.
 
+config FUJITSU_ERRATUM_010001
+   bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
+   default y
+   help
+ This option adds workaround for Fujitsu-A64FX erratum E#010001.
+ On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), memory 
accesses
+ may cause undefined fault (Data abort, DFSC=0b11).
+ The workaround is to replace the fault handler for Data abort 
DFSC=0b11
+ with a new one to ignore this undefined fault, which will only affect
+ the Fujitsu-A64FX.
+
+ If unsure, say Y.
+
 endmenu
 
 
diff --git a/arch/arm64/include/asm/cpucaps.h b/arch/arm64/include/asm/cpucaps.h
index 82e9099..3a0b375 100644
--- a/arch/arm64/include/asm/cpucaps.h
+++ b/arch/arm64/include/asm/cpucaps.h
@@ -60,7 +60,8 @@
 #define ARM64_HAS_ADDRESS_AUTH_IMP_DEF 39
 #define ARM64_HAS_GENERIC_AUTH_ARCH40
 #define ARM64_HAS_GENERIC_AUTH_IMP_DEF 41
+#define ARM64_WORKAROUND_FUJITSU_A64FX_011 42
 
-#define ARM64_NCAPS42
+#define ARM64_NCAPS43
 
 #endif /* __ASM_CPUCAPS_H */
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 951ed1a..70203f9 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -76,6 +76,7 @@
 #define ARM_CPU_IMP_BRCM   0x42
 #define ARM_CPU_IMP_QCOM   0x51
 #define ARM_CPU_IMP_NVIDIA 0x4E
+#define ARM_CPU_IMP_FUJITSU0x46
 
 #define ARM_CPU_PART_AEM_V80xD0F
 #define ARM_CPU_PART_FOUNDATION0xD00
@@ -104,6 +105,8 @@
 #define NVIDIA_CPU_PART_DENVER 0x003
 #define NVIDIA_CPU_PART_CARMEL 0x004
 
+#define FUJITSU_CPU_PART_A64FX 0x001
+
 #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A53)
 #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A57)
 #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A72)
@@ -122,6 +125,7 @@
 #define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
 #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_DENVER)
 #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, 
FUJITSU_CPU_PART_A64FX)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
index 9950bb0..fc0737f 100644
--- a/arch/arm64/kernel/cpu_errata.c
+++ b/arch/arm64/kernel/cpu_errata.c
@@ -739,6 +739,14 @@ static bool has_ssbd_mitigation(const struct 
arm64_cpu_capabilities *entry,
ERRATA_MIDR_RANGE(MIDR_CORTEX_A76, 0, 0, 2, 0),
},
 #endif
+#ifdef CONFIG_FUJITSU_ERRATUM_010001
+   {
+   .desc = "Fujitsu erratum 010001",
+   .capability = ARM64_WORKAROUND_FUJITSU_A64FX_011,
+   ERRATA_MIDR_RANGE(MIDR_FUJITSU_A64FX, 0, 0, 1, 0),
+   },
+#endif
+
{
}
 };
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2c..37e4f18 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -666,6 +666,28 @@ static int do_sea(unsigned long addr, unsigned int esr, 
struct pt_regs *regs)
return 0;
 }
 
+static int do_bad_unkn

RE: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

2019-01-21 Thread Zhang, Lei
Hi, Mark

Thanks for your comments, and sorry for late.

> -Original Message-
> * Under what conditions can the fault occur? e.g. is this in place of
>   some other fault, or completely spurious?
This fault can occur completely spurious under
a specific hardware condition and instructions order.
 
> * Does this only occur for data abort? i.e. not instruction aborts?
Yes. This fault only occurs for data abort.

> * How often does this fault occur?
In my test, this fault occurs once every several times 
in the OS boot sequence, and after the completion of OS boot, 
this fault have never occurred.
In my opinion, this fault rarely occurs 
after the completion of OS boot.

> * Does this only apply to Stage-1, or can the same faults be taken at
>   Stage-2?
This fault can be taken only at Stage-1.

> I'm a bit surprised by the single retry. Is there any guarantee that a
> thread will eventually stop delivering this fault code?
I guarantee that a thread will stop delivering this 
fault code by the this patch.
The hardware condition which cause this fault is 
reset at exception entry, therefore execution of at 
least one instruction is guaranteed by this single retry.

> Note that all CPUs and threads share the do_bad_ignore_first variable,
> so this is going to behave non-deterministically and kill threads in
> some cases.
> 
> This code is also preemptible, so checking the MIDR here doesn't make
> much sense. Either this is always uniform (and we can check once in the
> errata framework), or it's variable (e.g. on a big.LITTLE system) and
> we
> need to avoid preemption up until this point.
> 
> Rather than dynamically checking the MIDR, this should use the errata
> framework, and if any A64FX CPU is discovered, set an erratum cap like
> ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something
> like:
I try to provide a new patch to reflect your comments in today.
Unfortunately this bug may occurs before 
init_cpu_hwcaps_indirect_list called.
It is means maybe errata cap is not available. I am trying to
figure out best way to resolve this problem.

---
Best regards,
Lei Zhang
zhang@jp.fujitsu.com



[PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

2019-01-18 Thread Zhang, Lei
On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), 
memory accesses may cause undefined fault (Data abort, DFSC=0b11).
This problem will be fixed by next version of Fujitsu-A64FX.
I would like to post a workaround to avoid this problem 
on existing version.
The workaround is to replace the fault handler for Data abort
DFSC=0b11 with a new one to ignore this undefined fault, 
which will only affect the Fujitsu-A64FX.

I have tested this patch on A64FX and QEMU(2.9.0).The test passed.
I will test this patch on ThunderX and report the result.
I fully appreciate that if someone can test this patch on different 
chips to verity no harmful effect on other chips.

If there is no problem on other chips, please merge this patch.

Below is my patch based on linux-5.0-rc2.

Signed-off-by: Lei Zhang 
Tested-by: Lei Zhang 
---
 Documentation/arm64/silicon-errata.txt |1 +
 arch/arm64/Kconfig |   13 +
 arch/arm64/include/asm/cputype.h   |4 
 arch/arm64/mm/fault.c  |   23 +++
 4 files changed, 41 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d04..26d64e9 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_FALKOR_ERRATUM_1009|
 | Qualcomm Tech. | QDF2400 ITS | E0065   | 
QCOM_QDF2400_ERRATUM_0065   |
 | Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
QCOM_FALKOR_ERRATUM_1041|
+| Fujitsu| A64FX   | E#010001| FUJITSU_ERRATUM_010001  
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d3..9c09b2b 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -643,6 +643,19 @@ config QCOM_FALKOR_ERRATUM_E1041
 
  If unsure, say Y.
 
+config FUJITSU_ERRATUM_010001
+   bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
+   default y
+   help
+ This option adds workaround for Fujitsu-A64FX erratum E#010001.
+ On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), memory 
accesses
+ may cause undefined fault (Data abort, DFSC=0b11).
+ The workaround is to replace the fault handler for Data abort 
DFSC=0b11
+ with a new one to ignore this undefined fault, which will only affect
+ the Fujitsu-A64FX.
+
+ If unsure, say Y.
+
 endmenu
 
 
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 951ed1a..166aa50 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/include/asm/cputype.h
@@ -76,6 +76,7 @@
 #define ARM_CPU_IMP_BRCM   0x42
 #define ARM_CPU_IMP_QCOM   0x51
 #define ARM_CPU_IMP_NVIDIA 0x4E
+#define ARM_CPU_IMP_FUJITSU0x46
 
 #define ARM_CPU_PART_AEM_V80xD0F
 #define ARM_CPU_PART_FOUNDATION0xD00
@@ -104,6 +105,8 @@
 #define NVIDIA_CPU_PART_DENVER 0x003
 #define NVIDIA_CPU_PART_CARMEL 0x004
 
+#define FUJTISU_CPU_PART_A64FX 0x001
+
 #define MIDR_CORTEX_A53 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A53)
 #define MIDR_CORTEX_A57 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A57)
 #define MIDR_CORTEX_A72 MIDR_CPU_MODEL(ARM_CPU_IMP_ARM, 
ARM_CPU_PART_CORTEX_A72)
@@ -122,6 +125,7 @@
 #define MIDR_QCOM_KRYO MIDR_CPU_MODEL(ARM_CPU_IMP_QCOM, QCOM_CPU_PART_KRYO)
 #define MIDR_NVIDIA_DENVER MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_DENVER)
 #define MIDR_NVIDIA_CARMEL MIDR_CPU_MODEL(ARM_CPU_IMP_NVIDIA, 
NVIDIA_CPU_PART_CARMEL)
+#define MIDR_FUJITSU_A64FX MIDR_CPU_MODEL(ARM_CPU_IMP_FUJITSU, 
FUJTISU_CPU_PART_A64FX)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index efb7b2c..c465b2f 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -666,6 +666,25 @@ static int do_sea(unsigned long addr, unsigned int esr, 
struct pt_regs *regs)
return 0;
 }
 
+static bool do_bad_ignore_first = FALSE;
+static int do_bad_ignore(unsigned long addr, unsigned int esr, struct pt_regs 
*regs)
+{
+   if (do_bad_ignore_first == TRUE)
+   return 0;
+   if (do_bad_ignore_first == FALSE) {
+   unsigned int current_cpu_midr = read_cpuid_id();
+   const struct midr_range fujitsu_a64fx_midr_range = {
+   MIDR_FUJITSU_A64FX, MIDR_CPU_VAR_REV(0, 0), 
MIDR_CPU_VAR_REV(1, 0)
+   };
+
+   if (is_midr_in_range(current_cpu_midr, 
&fujitsu_a64fx_midr_range) == TRUE) {
+   do_bad_ignore_first = TRUE;
+   return 0;
+   }
+   }
+   return 1; /* "fault" same as do_bad */
+}
+
 static const struct fault_info fault_info[] = {
{ do_bad,   SIGKILL, SI_KERNEL, "ttbr address size 
fault"   },
{ do_bad,

RE: [PATCH 00/10] GICv3 support for kexec/kdump on EFI systems

2018-09-27 Thread Zhang, Lei
Hi Marc

> -Original Message-
> From: linux-arm-kernel
> [mailto:linux-arm-kernel-boun...@lists.infradead.org] On Behalf Of Marc
> Zyngier
> Sent: Saturday, September 22, 2018 5:00 AM
> To: linux-kernel@vger.kernel.org; linux-arm-ker...@lists.infradead.org
> Cc: Jeffrey Hugo; Thomas Gleixner; Jason Cooper; Jeremy Linton; Ard
> Biesheuvel
> Subject: [PATCH 00/10] GICv3 support for kexec/kdump on EFI systems
> 
> The GICv3 architecture has the remarkable feature that once LPI tables
> have been assigned to redistributors and that LPI delivery is enabled,
> there is no guarantee that LPIs can be turned off (and most
> implementations do not allow it), nor can it be reprogrammed to use
> other tables.
> 
> This is a bit of a problem for kexec, where the secondary kernel
> completely looses track of the previous allocations. If the secondary
> kernel doesn't allocate the tables exactly the same way, no LPIs will
> be delivered by the GIC (which continues to use the old tables), and
> memory previously allocated for the pending tables will be slowly
> corrupted, one bit at a time.
> 
> The workaround for this is based on a series[1] by Ard Biesheuvel,
> which adds the required infrastructure for memory reservations to be
> passed from one kernel to another using an EFI table.
> 
> This infrastructure is then used to register the allocation of GIC
> tables with EFI, and allow the GIC driver to safely reuse the existing
> programming if it detects that the tables have been correctly
> registered. On non-EFI systems, there is not much we can do.
> 
> This has been tested on a TX2 system both as a host and a guest. I'd
> welcome additional testing of different HW. For convenience, I've
> stashed a branch containing the whole thing at [2].

We have done the test on our chip A64FX that When a write changes 
EnableLPI bit from 0 to 1, this bit becomes RES1. 
The result is that the kexec operation successfully works on our chip,
 and PCI based on LPI also works after kexec.

For detail:
We did "kexec -e" command, and the message, "Using preallocated 
redistributor tables", was shown.
After kexec, we can use our ssd normally.

Test environment
CPU: A64FX

Kernel version: v4.19 rc4 base
https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=irq/gicv3-kdump
8bc67da irqchip/gic-v3-its: Allow use of LPI tables in reserved memory

kexec version:kexec-tools-2.0.14-17.2.el7.aarch64

Tested-by: Lei Zhang 

Thanks a lot.

Best Regards,
Lei,Zhang
FUJITSU LIMITED.




RE: [PATCH 0/7] irqchip/gic-v3: LPI allocation refactoring

2018-07-17 Thread Zhang, Lei
Hi Marc

This patches is necessary for our device, thanks a lot for your patches.
We have done the tests for your patches on our prototype CPU chip.
All of tests's results are PASSED.

Below is the detail of our tests.
We did tests for 2 points.
point 1: No level down for existing device such as nvme, network interface card.
what we did: iozone benchmark on nvme, ssh command.
point 2: Our original device can work well.
what we did: Test set for our original device.

And we have done the review, we think the patches is no problem.
But we found a spelling mistake in you comments.
> + * The consequence of the above is that allocation is cost is low, but
I propose the following is correct.

+ * The consequence of the above is that allocation cost is low, but

Best Regards,
Lei Zhang

> -Original Message-
> From: Marc Zyngier [mailto:marc.zyng...@arm.com]
> Sent: Wednesday, June 20, 2018 10:52 PM
> To: linux-kernel@vger.kernel.org
> Cc: Thomas Gleixner; Ard Biesheuvel; Shanker Donthineni; Shameer
> Kolothum; MaJun; Laurentiu Tudor; Zhang, Lei/張 雷
> Subject: [PATCH 0/7] irqchip/gic-v3: LPI allocation refactoring
> 
> The GICv3 LPI allocator has served us well so far, but a number of new
> use cases have recently showed up:
> 
> - A new extension to the GICv3 architecture allows a hypervisor to
>   dramatically restrict the range of available LPIs. This means that
>   our current policy of allocating LPIs in blocks of 32 may quickly
>   deplete the number of devices that get LPIs
> 
> - New and currently undisclosed busses seem to come with thousands of
>   devices, each requiring a single LPI. Again, our current allocation
>   policy means they quickly run out of LPIs.
> 
> Simply expanding the bitmap doesn't seem to be a great idea, so let's
> change the LPI allocator altogether. This means we can move individual
> busses to a more minimal allocation scheme, though we only do it for
> PCI at the moment (Platform MSI looks like the Far West, and I'm
> clueless about the FSL MC thing).
> 
> This is a pretty invasive change, and I'm thus cc'ing the usual
> suspects that have access to weird and wonderful HW to verify
> everything still works as expected, and let me know if we can relax
> the allocation for their own pet bus implementation.
> 
> Only lightly tested in a KVM guest (PCI).
> 
> 
> Marc Zyngier (7):
>   irqchip/gic-v3-its: Refactor LPI allocator
>   irqchip/gic-v3-its: Use full range of LPIs
>   irqchip/gic-v3-its: Move minimum LPI requirements to individual busses
>   irqchip/gic-v3-its: Drop chunk allocation compatibility
>   irqchip/gic-v3: Expose GICD_TYPER in the rdist structure
>   irqchip/gic-v3-its: Honor hypervisor enforced LPI range
>   irqchip/gic-v3-its: Reduce minimum LPI allocation to 1 for PCI devices
> 
>  drivers/irqchip/irq-gic-v3-its-fsl-mc-msi.c   |   3 +
>  drivers/irqchip/irq-gic-v3-its-pci-msi.c  |  16 +-
>  drivers/irqchip/irq-gic-v3-its-platform-msi.c |   2 +
>  drivers/irqchip/irq-gic-v3-its.c  | 225
> --
>  drivers/irqchip/irq-gic-v3.c  |   4 +-
>  include/linux/irqchip/arm-gic-v3.h|   3 +-
>  6 files changed, 169 insertions(+), 84 deletions(-)
> 
> --
> 2.17.1
> 
>