[PATCH 2/4] Documentation: x86: Remove cdpl2 unspported statement and fix capitalisation

2019-06-07 Thread James Morse
"L2 cache does not support code and data prioritization". This isn't
true, elsewhere the document says it can be enabled with the cdpl2
mount option.

While we're here, these sample strings have lower-case code/data,
which isn't how the kernel exports them.

Signed-off-by: James Morse 
---
 Documentation/x86/resctrl_ui.rst | 14 ++
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst
index 066f94e53418..638cd987937d 100644
--- a/Documentation/x86/resctrl_ui.rst
+++ b/Documentation/x86/resctrl_ui.rst
@@ -418,16 +418,22 @@ L3 schemata file details (CDP enabled via mount option to 
resctrl)
 When CDP is enabled L3 control is split into two separate resources
 so you can specify independent masks for code and data like this::
 
-   L3data:=;=;...
-   L3code:=;=;...
+   L3DATA:=;=;...
+   L3CODE:=;=;...
 
 L2 schemata file details
 
-L2 cache does not support code and data prioritization, so the
-schemata format is always::
+CDP is supported at L2 using the 'cdpl2' mount option. The schemata
+format is either::
 
L2:=;=;...
 
+or
+
+   L2DATA:=;=;...
+   L2CODE:=;=;...
+
+
 Memory bandwidth Allocation (default mode)
 --
 
-- 
2.20.1



[PATCH 3/4] Documentation: x86: Clarify MBA takes MB as referring to mba_sc

2019-06-07 Thread James Morse
"If the MBA is specified in MB then user can enter the max b/w in MB"
is a tautology. How can the user know if the schemata takes a percentage
or a MB/s value?

This is referring to whether the software controller is interpreting
the schemata's value. Make this clear.

Signed-off-by: James Morse 
---
 Documentation/x86/resctrl_ui.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst
index 638cd987937d..866b66aa289b 100644
--- a/Documentation/x86/resctrl_ui.rst
+++ b/Documentation/x86/resctrl_ui.rst
@@ -677,8 +677,8 @@ allocations can overlap or not. The allocations specifies 
the maximum
 b/w that the group may be able to use and the system admin can configure
 the b/w accordingly.
 
-If the MBA is specified in MB(megabytes) then user can enter the max b/w in MB
-rather than the percentage values.
+If resctrl is using the software controller (mba_sc) then user can enter the
+max b/w in MB rather than the percentage values.
 ::
 
   # echo "L3:0=3;1=c\nMB:0=1024;1=500" > /sys/fs/resctrl/p0/schemata
-- 
2.20.1



[PATCH 4/4] Documentation: x86: fix some typos

2019-06-07 Thread James Morse
These are all obvious typos.

Signed-off-by: James Morse 
---
 Documentation/x86/resctrl_ui.rst | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst
index 866b66aa289b..5368cedfb530 100644
--- a/Documentation/x86/resctrl_ui.rst
+++ b/Documentation/x86/resctrl_ui.rst
@@ -40,7 +40,7 @@ mount options are:
Enable the MBA Software Controller(mba_sc) to specify MBA
bandwidth in MBps
 
-L2 and L3 CDP are controlled seperately.
+L2 and L3 CDP are controlled separately.
 
 RDT features are orthogonal. A particular system may support only
 monitoring, only control, or both monitoring and control.  Cache
@@ -118,7 +118,7 @@ related to allocation:
  Corresponding region is pseudo-locked. No
  sharing allowed.
 
-Memory bandwitdh(MB) subdirectory contains the following files
+Memory bandwidth(MB) subdirectory contains the following files
 with respect to allocation:
 
 "min_bandwidth":
@@ -209,7 +209,7 @@ All groups contain the following files:
CPUs to/from this group. As with the tasks file a hierarchy is
maintained where MON groups may only include CPUs owned by the
parent CTRL_MON group.
-   When the resouce group is in pseudo-locked mode this file will
+   When the resource group is in pseudo-locked mode this file will
only be readable, reflecting the CPUs associated with the
pseudo-locked region.
 
@@ -380,7 +380,7 @@ where L2 external  is 10GBps (hence aggregate L2 external 
bandwidth is
 240GBps) and L3 external bandwidth is 100GBps. Now a workload with '20
 threads, having 50% bandwidth, each consuming 5GBps' consumes the max L3
 bandwidth of 100GBps although the percentage value specified is only 50%
-<< 100%. Hence increasing the bandwidth percentage will not yeild any
+<< 100%. Hence increasing the bandwidth percentage will not yield any
 more bandwidth. This is because although the L2 external bandwidth still
 has capacity, the L3 external bandwidth is fully used. Also note that
 this would be dependent on number of cores the benchmark is run on.
@@ -398,7 +398,7 @@ In order to mitigate this and make the interface more user 
friendly,
 resctrl added support for specifying the bandwidth in MBps as well.  The
 kernel underneath would use a software feedback mechanism or a "Software
 Controller(mba_sc)" which reads the actual bandwidth using MBM counters
-and adjust the memowy bandwidth percentages to ensure::
+and adjust the memory bandwidth percentages to ensure::
 
"actual bandwidth < user specified bandwidth".
 
-- 
2.20.1



[PATCH 1/4] Documentation: x86: Contiguous cbm isn't all X86

2019-06-07 Thread James Morse
Since commit 4d05bf71f157 ("x86/resctrl: Introduce AMD QOS feature")
resctrl has supported non-contiguous cache bit masks. The interface
for this is currently try-it-and-see.

Update the documentation to say Intel CPUs have this requirement,
instead of X86.

Cc: Babu Moger 
Signed-off-by: James Morse 
---
 Documentation/x86/resctrl_ui.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/x86/resctrl_ui.rst b/Documentation/x86/resctrl_ui.rst
index 225cfd4daaee..066f94e53418 100644
--- a/Documentation/x86/resctrl_ui.rst
+++ b/Documentation/x86/resctrl_ui.rst
@@ -342,7 +342,7 @@ For cache resources we describe the portion of the cache 
that is available
 for allocation using a bitmask. The maximum value of the mask is defined
 by each cpu model (and may be different for different cache levels). It
 is found using CPUID, but is also provided in the "info" directory of
-the resctrl file system in "info/{resource}/cbm_mask". X86 hardware
+the resctrl file system in "info/{resource}/cbm_mask". Intel hardware
 requires that these masks have all the '1' bits in a contiguous block. So
 0x3, 0x6 and 0xC are legal 4-bit masks with two bits set, but 0x5, 0x9
 and 0xA are not.  On a system with a 20-bit mask each bit represents 5%
-- 
2.20.1



[PATCH 0/4] Documentation: x86: resctrl_ui.txt fixes and clarification

2019-06-07 Thread James Morse
While updating resctrl_ui.rst for arm64 support these things stuck
out as being unclear or no longer true.

(I've shortened the CC list to just RDT+Documentation)


Thanks,

James Morse (4):
  Documentation: x86: Contiguous cbm isn't all X86
  Documentation: x86: Remove cdpl2 unspported statement and fix
capitalisation
  Documentation: x86: Clarify MBA takes MB as referring to mba_sc
  Documentation: x86: fix some typos

 Documentation/x86/resctrl_ui.rst | 30 ++
 1 file changed, 18 insertions(+), 12 deletions(-)

-- 
2.20.1



Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-21 Thread James Morse
Hi Peter,

On 19/12/2018 19:02, Peter Maydell wrote:
> On Mon, 17 Dec 2018 at 15:56, James Morse  wrote:
>> I don't think this really matters. Its only the NMIlike notifications that 
>> the
>> guest doesn't have to register or poll. The ones we support today extend the
>> architectures existing behaviour: you would have taken an external-abort on a
>> real system, whether you know about the additional metadata doesn't matter 
>> to Qemu.
> 
> Consider the case where we booted the guest using a DTB and no ACPI
> table at all -- we certainly can't just call QEMU code that tries to
> add entries to a nonexistent table.

Sure, because you know which of the two sets of firmware-table you're providing.

I'm taking the behaviour of physical machines as the template for what we should
do here. I can boot a DT-only kernel on Seattle. Firmware has no idea I did
this, it will still take DRAM uncorrected-error IRQs in firmware, and generate
CPER records in the POLLed areas. But the kernel will never look, because it
booted with DT.
What happens if the kernel goes on to access the corrupt location? It either
gets corrupt values back, or an external abort, depending on the design of the
memory-controller.

X-gene uses an IRQ for its firmware-first notification. Booted with DT that
interrupt can be asserted, but as the OS has didn't know to register it, its
never taken. We eventually get the same corrupt-values/external-abort behaviour.

KVM/Linux is acting as the memory controller using stage2. When an error is
detected by the host it unmaps the page from stage2, and refuses to map it again
until its fixed up in Qemu's memory map (which can happen automatically). If the
kernel can't fix it itself, the AO signal is like the DRAM-IRQ above, and the AR
like the external abort.
We don't have a parallel to the 'gets corrupt values back' behaviour as Linux
will always unmap hwpoison pages from user-space/guests.

If the host-kernel wasn't build with CONFIG_MEMORY_FAILURE, its like the memory
controller doesn't support any of the above. I think knowing this is the closest
to what you want.


> My main point is that there
> needs to be logic in Dongjiu's QEMU patches that checks more than
> just "does this KVM feature exist". I'm not sufficiently familiar
> with all this RAS stuff to be certain what those checks should
> be and what the right choices are; I just know we need to check
> *something*...

I think this is the crux of where we don't see this the same way.
The v8.2 RAS stuff is new, RAS support on arm64 is not. Kernel support arrived
at roughly the same time, but not CPU support. There are v8.0 systems that
support RAS. There are DT systems that can do the same with edac drivers.
The physical v8.0 systems that do this, are doing it without any extra CPU 
support.

I think x86's behaviour here includes some history, which we don't have.
>From the order of the HEST entries, it looks like the machine-check stuff came
first, then firmware-first using a 'GHES' entry in that table.
I think Qemu on x86 only supports the emulated machine check stuff, so it needs
to check KVM has the widget to do this.
If Qemu on x86 supported firmware-first, I don't think there would be anything
to check. (details below)


>>> Let us see the X86's QEMU logic:
>>> 1. Before the vCPU created, it will set a default env->mcg_cap value with
>>
>>> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
>>> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether 
>>> host
>>> kernel support this feature[2]. Only when host kernel and default 
>>> env->mcg_cap
>>> value all expected this feature, then it will setup vCPU support RAS error
>>> recovery[3].
>>
>> This looks like KVM exposing a CPU capability to Qemu, which then configures 
>> the
>> behaviour KVM gives to the guest. This doesn't tell you anything about what 
>> the
>> guest supports.
> 
> It tells you what the *guest CPU* supports, which for x86 is a combination
> of (a) what did the user/machine model ask for and (b) what can KVM
> actually implement. I don't much care whether the guest OS supports
> anything or not, that's its business... but it does seem odd to me
> that the equivalent Arm code is not similarly saying "what were we
> asked for, and what can we do?".

The flow is something like:
For AO, generate CPER records, and notify the OS via NOTIFY_POLL (which isn't
really a notification) or some flavour of IRQ.
To do this, Qemu needs to be able to write to its reserved area of guest memory,
and possibly trigger an interrupt.

For AR, generate CPER records and noti

Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-17 Thread James Morse
Hi Peter,

On 14/12/2018 14:33, Peter Maydell wrote:
> On Fri, 14 Dec 2018 at 13:56, James Morse  wrote:
>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>>> When user space do memory recovery, it will check whether KVM and
>>> guest support the error recovery, only when both of them support,
>>> user space will do the error recovery. This patch exports this
>>> capability of KVM to user space.
>>
>> I can understand user-space only wanting to do the work if host and guest
>> support the feature. But 'error recovery' isn't a KVM feature, its a Linux
>> kernel feature.
>>
>> KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying 
>> to
>> map a page at stage2 that the kernel-mm code refuses this because its 
>> poisoned.
>> (e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)
>>
>> This is exactly the same as happens to a normal user-space process.
>>
>> I think you really want to know if the host kernel was built with
>> CONFIG_MEMORY_FAILURE.
> 
> Does userspace need to care about that? Presumably if the host kernel
> wasn't built with that support then it will simply never deliver
> any memory failure events to QEMU, which is fine.

Aha, I thought this is what you wanted.
Always being prepared to handle the signals is the best choice.


> The point I was trying to make in the email Dongjiu references
> (https://patchwork.codeaurora.org/patch/652261/) is simply that
> "QEMU gets memory-failure notifications from the host kernel"
> does not imply "the guest is prepared to receive memory
> failure notifications", and so the code path which handles
> the SIGBUS must do some kind of check for whether the guest

> CPU is a type which expects them

I don't understand this bit.

The CPU support is just about barriers for containment and reporting a
standardised classification to software. Firmware-first replaces all this. It
doesn't depend on any CPU feature.
APM-X-Gene has firmware-first support, it uses some kind of external processor
that takes the error-interrupt from DRAM and generates CPER records, before
triggering the firmware-first notification.

> and that the board code
> set up the ACPI tables that it wants to fill in.
ACPI has some complex stuff around claiming 'platform-wide capabilities'. Qemu
could use this to know if the guest understands APEI.

Section 6.2.11.2 "Platform-Wide OSPM Capabilities" of ACPI v6.2 describes the
\_SB._OSC method, which has an APEI support bit. This is used in some kind of
handshake.

Linux does this during boot if its built with APEI GHES support. Linux seems to
think the APEI bit enables firmware-first:
| [   63.804907] GHES: APEI firmware first mode is enabled by APEI bit.

... but its not clear from the spec. (APEI is more than firmware-first)

(where do these things go? Platform AML in the DSDT)


I don't think this controls anything on a real system, (we've seen X-Gene
generate CPER records before Linux started booting), and I don't think it really
matters as 'what happens if the guest doesn't know' falls out of the way these
SIGBUS codes map back onto the firmware-first notifications:

For 'AO' signals you can dump CPER records in a NOTIFY_POLLed area. If the guest
doesn't care, it can avert is eyes. If you used one of the NOTIFY_$(interrupt)
types, the guest can not-register the interrupt.

The AR signals map to external-abort. On a firmware-first system EL3 takes
these, generates some extra metadata using CPER records in the agreed location,
and re-injects an emulated external-abort.
If Qemu takes an AR signal, this is effectively an external-abort, the page has
been accessed and the kernel will not map it because the page is poisoned. These
would have been an external-abort on a real system, its not a problem if the
guest doesn't know about the extra CPER metadata.

Centriq is an example of a system that does this external-abort+CPER-metadata
without the v8.2 CPU extensions.

All v8.0 CPUs have synchronous/asynchronous external abort, there is nothing new
going on here, its just extra metadata. (critically: the physical address of the
fault)


Thanks,

James


Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-17 Thread James Morse
Hi gengdongjiu, Peter,

I think the root issue here is the name of the cpufeature 'RAS Extensions', this
doesn't mean RAS is new, or even requires these features. It's just standardised
records, classification and a barrier.
Not only is it possible to build a platform that supports RAS without this
extensions: there are at least three platforms out there that do!


On 15/12/2018 00:12, gengdongjiu wrote:
>> On Fri, 14 Dec 2018 at 13:56, James Morse  wrote:
>>> On 14/12/2018 10:15, Dongjiu Geng wrote:
>>>> When user space do memory recovery, it will check whether KVM and 
>>>> guest support the error recovery, only when both of them support, 
>>>> user space will do the error recovery. This patch exports this 
>>>> capability of KVM to user space.
>>>
>>> I can understand user-space only wanting to do the work if host and 
>>> guest support the feature. But 'error recovery' isn't a KVM feature, 
>>> its a Linux kernel feature.

[...]

> Thanks Peter's explanation. Frankly speaking, I agree Peter's suggestion.
> 
> To James, I explain more to you, as peter said QEMU needs to check whether
> the guest CPU is a type which can handle the error though guest ACPI table.

I don't think this really matters. Its only the NMIlike notifications that the
guest doesn't have to register or poll. The ones we support today extend the
architectures existing behaviour: you would have taken an external-abort on a
real system, whether you know about the additional metadata doesn't matter to 
Qemu.


> Let us see the X86's QEMU logic:
> 1. Before the vCPU created, it will set a default env->mcg_cap value with

> MCE_CAP_DEF flag, MCG_SER_P means it expected the guest CPU model supports
> RAS error recovery.[1] 2. when the vCPU initialize, it will check whether host
> kernel support this feature[2]. Only when host kernel and default env->mcg_cap
> value all expected this feature, then it will setup vCPU support RAS error
> recovery[3].

This looks like KVM exposing a CPU capability to Qemu, which then configures the
behaviour KVM gives to the guest. This doesn't tell you anything about what the
guest supports. This doesn't tell you if the host-kernel supports
memory_failure(). You can think of this as being equivalent to the VSESR_EL2
support. Just because the CPU has it doesn't mean the host or guest kernel have
been built to know what to do.

I test NOTIFY_SEA by injecting an address into memory_failure() using
CONFIG_HWPOISON_INJECT. This causes kvmtool to take an AR signal next time the
guest accesses the page, which then gets presented to the guest as an
external-abort, with the CPER records describing the abort created by kvmtool.
This is all on v8.0 hardware, nothing about the CPU is relevant here.


> -For James's 
> comments-
>> KVM doesn't detect these errors.
>> The hardware detects them and notifies the OS via one of a number of 
>> mechanisms.
>> This gets plumbed into memory_failure(), which sets a flag that the mm 
>> code uses to prevent the page being used again.
> 
>> KVM is only involved when it tries to map a page at stage2 and the mm 
>> code rejects it with -EHWPOISON. This is the same as the architectures
>> do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of 
>> handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.
> --
> James, for your above comments, I completed understand, but KVM also delivered
> the SIGBUS,

kvm_send_hwpoison_signal()? This is just making guest-accesses look like
Qemu-acesses to linux. It's just plumbing.

You could just as easily take the signal from memory_failure()s kill_proc() 
code.


> which means KVM supports guest memory RAS error recovery, so maybe
> we need to tell user space this capability.

It was merged with ARCH_SUPPORTS_MEMORY_FAILURE. You're really asking if the
host kernel supports CONFIG_MEMORY_FAILURE, and its plumbed in in all the right
places.

It's not practical for user-space to know this, handling the signal when it
arrives is the best thing to do.


> -- For James's comments 
> ---
>> The CPU RAS Extensions are not at all relevant here. It is perfectly 
>> possible to support memory-failure without them, AMD-Seattle and 
>> APM-X-Gene do this. These systems would report not-supported here, but the 
>> kernel does support this stuff.
>>

Re: [RFC RESEND PATCH] kvm: arm64: export memory error recovery capability to user space

2018-12-14 Thread James Morse
Hi Dongjiu Geng,

On 14/12/2018 10:15, Dongjiu Geng wrote:
> When user space do memory recovery, it will check whether KVM and
> guest support the error recovery, only when both of them support,
> user space will do the error recovery. This patch exports this
> capability of KVM to user space.

I can understand user-space only wanting to do the work if host and guest
support the feature. But 'error recovery' isn't a KVM feature, its a Linux
kernel feature.

KVM will send it's user-space a SIGBUS with MCEERR code whenever its trying to
map a page at stage2 that the kernel-mm code refuses this because its poisoned.
(e.g. check_user_page_hwpoison(), get_user_pages() returns -EHWPOISON)

This is exactly the same as happens to a normal user-space process.

I think you really want to know if the host kernel was built with
CONFIG_MEMORY_FAILURE. The not-at-all-portable way to tell this from user-space
is the presence of /proc/sys/vm/memory_failure_* files.
(It looks like the prctl():PR_MCE_KILL/PR_MCE_KILL_GET options silently update
an ignored policy if the kernel isn't built with CONFIG_MEMORY_FAILURE, so they
aren't helpful)


> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index cd209f7..241e2e2 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4895,3 +4895,12 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +
> +8.21 KVM_CAP_ARM_MEMORY_ERROR_RECOVERY
> +
> +Architectures: arm, arm64
> +
> +This capability indicates that guest memory error can be detected by the KVM 
> which
> +supports the error recovery.

KVM doesn't detect these errors.
The hardware detects them and notifies the OS via one of a number of mechanisms.
This gets plumbed into memory_failure(), which sets a flag that the mm code uses
to prevent the page being used again.

KVM is only involved when it tries to map a page at stage2 and the mm code
rejects it with -EHWPOISON. This is the same as the architectures
do_page_fault() checking for (fault & VM_FAULT_HWPOISON) out of
handle_mm_fault(). We don't have a KVM cap for this, nor do we need one.


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index b72a3dd..90d1d9a 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -82,6 +82,7 @@ int kvm_arch_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   r = kvm_arm_support_pmu_v3();
>   break;
>   case KVM_CAP_ARM_INJECT_SERROR_ESR:
> + case KVM_CAP_ARM_MEMORY_ERROR_RECOVERY:
>   r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
>   break;

The CPU RAS Extensions are not at all relevant here. It is perfectly possible to
support memory-failure without them, AMD-Seattle and APM-X-Gene do this. These
systems would report not-supported here, but the kernel does support this stuff.
Just because the CPU supports this, doesn't mean the kernel was built with
CONFIG_MEMORY_FAILURE. The CPU reports may be ignored, or upgraded to SIGKILL.



Thanks,

James


Re: [PATCH v11 0/4] set VSESR_EL2 by user space and support NOTIFY_SEI notification

2018-04-12 Thread James Morse
Hi gengdongjiu,

On 12/04/18 07:09, gengdongjiu wrote:
> On 2018/4/10 22:15, James Morse wrote:
>> On 09/04/18 22:36, Dongjiu Geng wrote:
>>> 1. Detect whether KVM can set set guest SError syndrome
>>> 2. Support to Set VSESR_EL2 and inject SError by user space.
>>> 3. Support live migration to keep SError pending state and VSESR_EL2 value.
>>> 4. ACPI 6.1 adds support for NOTIFY_SEI as a GHES notification mechanism, 
>>> so support this
>>>notification in software, KVM or kernel ARCH code call 
>>> handle_guest_sei() to let ACP driver
>>>to handle this notification.
>>
>> Please don't post code during the merge-window, will this apply to 
>> v4.17-rc1? We
>> can't know until its tagged.

Posting code during the merge-window isn't helpful as the kernel is a moving
target, its better to wait for an 'rc' to base it on.

> I do not know when it is merge-window. About the apply version, it does not 
> have limited.

'git fetch' Linus' tree and look at the tags. 'v4.16' lost its '-rc' suffixes,
and there isn't a 'v4.17-rc1' yet, so we are still in the merge window.

Linus sends a message to LKML. eg:
https://lkml.org/lkml/2018/4/1/175

net-next closes shortly before the merge window, and re-opens afterwards. There
is a handy web page:
http://vger.kernel.org/~davem/net-next.html


>> This series is doing two separate things, please split it into two series.
> OK, thanks!
> 
>>
>> But on the ACPI front: I don't see how any OS can support your NOTIFY_SEI 
>> when
>> firmware is ignoring the normal world's PSTATE.A.
>>
>> The latest lobe of that discussion was on the list here:
>> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1611496.html
> I have replied the mail.
> I still have some questions that need to clarify with you.
> After clarification, we will follow that.
> The question is in the reply of this mail 
> "https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1611496.html";

Lets keep that discussion on v9 then.


>> As it is, we would need to spot SError being delivered while SError is 
>> masked,
>> spray nasty messages about firmware being horrifically buggy, then panic(). 
>> For
>> a corrected error, this looks bad, but its preferable to letting firmware
>> silently overwrite the exception registers, causing linux to spin through the
>> vectors 'eret' with all exceptions masked.
>> I still think its best to wait for firmware that does the right thing.

> Let us  discuss that in another mail.
> In a summary, I think firmware follow below rule can be OK, right?
> 1. The exception came from the EL that SError should be routed to(according 
> to hcr_EL2.{AMO, TGE}),but PSTATE.A was set, EL3 firmware can't deliver 
> SError;

> 2. The exception came from the EL that SError should not be routed 
> to(according to hcr_EL2.{AMO, TGE}),even though the PSTATE.A was set,EL3 
> firmware still deliver SError

Problem here, more on v9.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8

2018-04-12 Thread James Morse
Hi gengdongjiu,

On 12/04/18 06:00, gengdongjiu wrote:
> 2018-02-16 1:55 GMT+08:00 James Morse :
>> On 05/02/18 11:24, gengdongjiu wrote:
>>>> Is the emulated SError routed following the routing rules for HCR_EL2.{AMO,
>>>> TGE}?
>>>
>>> Yes, it is.
>>
>> ... and yet ...
>>
>>
>>>> What does your firmware do when it wants to emulate SError but its masked?
>>>> (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had
>>>> PSTATE.A  set.
>>>>  e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the
>>>> emulated  SError should go to EL1. This effectively masks SError.)
>>>
>>> Currently we does not consider much about the mask status(SPSR).
>>
>> .. this is a problem.
>>
>> If you ignore SPSR_EL3 you may deliver an SError to EL1 when the exception
>> interrupted EL2. Even if you setup the EL1 register correctly, EL1 can't 
>> eret to
>> EL2. This should never happen, SError is effectively masked if you are 
>> running
>> at an EL higher than the one its routed to.
>>
>> More obviously: if the exception came from the EL that SError should be 
>> routed
>> to, but PSTATE.A was set, you can't deliver SError. Masking SError is the 
>> only

> James, I  summarized the masking and routing rules for SError to
> confirm with you for the firmware first solution,

You also said "Currently we does not consider much about the mask status(SPSR)."


> 1. If the HCR_EL2.{AMO,TGE} is set,

If one or the other of these bits is set: (AMO==1 || TGE==1)

> which means the SError should route to EL2,
> When system happens SError and trap to EL3,   If EL3 find
> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both set,
> and find this SError come from EL2, it will not deliver an SError:
> store the RAS error in the BERT and 'reboot'; but if
> it find that this SError come from EL1 or EL0, it also need to deliver
> an SError, right?

Yes.


> 2. If the HCR_EL2.{AMO,TGE} is not set,

If neither of these bits is set: (AMO==0 && TGE == 0)

> which means the SError should route to EL1,
> When system happens SError and trap to EL3, If EL3 find
> HCR_EL2.{AMO,TGE} and SPSR_EL3.A are both not set,

(I'm reading this as all three of these bits are clear)

> and find this SError come from EL1, it will not deliver an SError:
> store the RAS error in the BERT and 'reboot'; 

No, (AMO==0 && TGE == 0) means SError is routed to EL1, this exception
interrupted EL1 and the A bit was clear, so EL1 can take an SError.

The two cases here are:
AMO==0,TGE==0 means SError should be routed to EL1. If SPSR_EL3 says the
exception interrupted EL1 and the A bit was set, you need to do the BERT trick.

If SPSR_EL3 says the exception interrupted EL2, you need to do the BERT trick
regardless of the A bit, as SError is implicitly masked by running at a higher
exception level than it was routed to.


>From your v11 reply:
> 2. The exception came from the EL that SError should not be routed
> to(according to hcr_EL2.{AMO, TGE}),even though the PSTATE.A was set,EL3
> firmware still deliver SError

(this is re-iterating the two-cases above:)
'not be routed to' is one of two things: Route-to-EL2+interruted-EL1, or
Route-to-EL1+interrupted-EL2.

Route-to-EL2+interrupted-EL1 is fine, regardless of SPSR_EL3.A the emulated
SError can be delivered to EL2, as EL2 can't mask SError when executing at a
lower EL.

Route-to-EL1+interrupted-EL2 is the problem. SError is implicitly masked by
running at a higher EL. Regardless of SPSR_EL3.A, the emulated SError can not be
delivered.
KVM does this on the way out of a guest, if an SError occurs during this time
the CPU will wait until execution returns to EL1 before delivering the SError.
Your firmware has to do the same.

Table D1-15 in "D1.14.2 Asynchronous exception masking" has a table with all the
combinations. The ARM-ARM is what we need to match with this behaviour.


> but if it find that this SError come from EL0, it also need to deliver an
> SError, right?

I thought interrupted-EL0 could always be delivered: but re-reading the
ARM-ARM's "D1.14.2 Asynchronous exception masking", if asynchronous exceptions
are routed to EL1 then EL0&EL1 are treated the same.
So if SError is routed to EL1, the exception interrupted EL0, and SPSR_EL3.A was
set, you still can't deliver the emulated-SError you have to do the BERT-trick.
Linux doesn't do this today, but another OS might (e.g. UEFI), and we might do
this in the future.

This is really tricky for firmware to get right. Another alternative would be to
put the CPER records in a Polled buffer, unless something needs doing right now,
in which case a BERT-reboot is probably best.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 0/4] set VSESR_EL2 by user space and support NOTIFY_SEI notification

2018-04-10 Thread James Morse
Hi Dongjiu Geng,

On 09/04/18 22:36, Dongjiu Geng wrote:
> 1. Detect whether KVM can set set guest SError syndrome
> 2. Support to Set VSESR_EL2 and inject SError by user space.
> 3. Support live migration to keep SError pending state and VSESR_EL2 value.
> 4. ACPI 6.1 adds support for NOTIFY_SEI as a GHES notification mechanism, so 
> support this
>notification in software, KVM or kernel ARCH code call handle_guest_sei() 
> to let ACP driver
>to handle this notification.

Please don't post code during the merge-window, will this apply to v4.17-rc1? We
can't know until its tagged.


This series is doing two separate things, please split it into two series.

But on the ACPI front: I don't see how any OS can support your NOTIFY_SEI when
firmware is ignoring the normal world's PSTATE.A.

The latest lobe of that discussion was on the list here:
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1611496.html


As it is, we would need to spot SError being delivered while SError is masked,
spray nasty messages about firmware being horrifically buggy, then panic(). For
a corrected error, this looks bad, but its preferable to letting firmware
silently overwrite the exception registers, causing linux to spin through the
vectors 'eret' with all exceptions masked.
I still think its best to wait for firmware that does the right thing.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 2/4] arm/arm64: KVM: Add KVM_GET/SET_VCPU_EVENTS

2018-04-10 Thread James Morse
Hi Dongjiu Geng,

On 09/04/18 22:36, Dongjiu Geng wrote:
> This new IOCTL exports user-invisible states related to SError.
> Together with appropriate user space changes, it can inject
> SError with specified syndrome to guest by setup kvm_vcpu_events
> value.

> Also it can support live migration.

Could you explain what user-space is expected to do for this?
(this is also relevant for snapshot-ing/suspending VMs)

It's probably worth noting that this solves an existing problem: KVM may make an
SError pending, but user-space has no way to discover/migrate this.


> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 8a3d708..45719b4 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -819,11 +819,13 @@ struct kvm_clock_data {
>  
>  Capability: KVM_CAP_VCPU_EVENTS
>  Extended by: KVM_CAP_INTR_SHADOW
> -Architectures: x86
> +Architectures: x86, arm, arm64
>  Type: vm ioctl
>  Parameters: struct kvm_vcpu_event (out)
>  Returns: 0 on success, -1 on error
>  
> +X86:
> +
>  Gets currently pending exceptions, interrupts, and NMIs as well as related
>  states of the vcpu.
>  
> @@ -865,15 +867,31 @@ Only two fields are defined in the flags field:
>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>smi contains a valid state.
>  
> +ARM, ARM64:
> +
> +Gets currently pending SError exceptions as well as related states of the 
> vcpu.
> +
> +struct kvm_vcpu_events {
> + struct {
> + __u8 serror_pending;
> + __u8 serror_has_esr;
> + /* Align it to 4 bytes */
> + __u8 pad[2];
> + __u64 serror_esr;
> + } exception;
> +};
> +

I'm not convinced we should change this struct from the layout/size x86 has. Its
confusing for the documentation, is this API call really the same on all
architectures?

What if we want to add some future interrupt, NMI or related state? We've found
ourselves needing to add this API, it seems odd to remove its other uses on x86.
We can't put them back in the future.

Having a different layout would force user-space to ifdef/duplicate any code
that accesses this between architectures.



The compiler will want that __u64 to be naturally aligned to 8-bytes, so your
4-byte padding still causes some secret compiler-padding to be inserted.
Different versions of the compiler may put it in different places.


>  4.32 KVM_SET_VCPU_EVENTS
>  
>  Capability: KVM_CAP_VCPU_EVENTS
>  Extended by: KVM_CAP_INTR_SHADOW
> -Architectures: x86
> +Architectures: x86, arm, arm64
>  Type: vm ioctl
>  Parameters: struct kvm_vcpu_event (in)
>  Returns: 0 on success, -1 on error
>  
> +X86:
> +
>  Set pending exceptions, interrupts, and NMIs as well as related states of the
>  vcpu.
>  
> @@ -894,6 +912,12 @@ shall be written into the VCPU.
>  
>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>  
> +ARM, ARM64:
> +
> +Set pending SError exceptions as well as related states of the vcpu.
> +
> +See KVM_GET_VCPU_EVENTS for the data structure.
> +
>  
>  4.33 KVM_GET_DEBUGREGS
>  


> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index 9abbf30..855cc9a 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -39,6 +39,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_VCPU_EVENTS
>  
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>  
> @@ -153,6 +154,17 @@ struct kvm_sync_regs {
>  struct kvm_arch_memory_slot {
>  };
>  
> +/* for KVM_GET/SET_VCPU_EVENTS */
> +struct kvm_vcpu_events {
> + struct {
> + __u8 serror_pending;
> + __u8 serror_has_esr;

> + /* Align it to 4 bytes */
> + __u8 pad[2];

(padding noted above)


> + __u64 serror_esr;
> + } exception;
> +};
> +
>  /* If you need to interpret the index values, here is the key: */
>  #define KVM_REG_ARM_COPROC_MASK  0x0FFF
>  #define KVM_REG_ARM_COPROC_SHIFT 16


> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 5c7f657..42e1222 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -277,6 +277,37 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>   return -EINVAL;
>  }
>  
> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
> + struct kvm_vcpu_events *events)
> +{
> + events->exception.serror_pending = (vcpu_get_hcr(vcpu) & HCR_VSE);
> + events->exception.serror_has_esr =
> + cpus_have_const_cap(ARM64_HAS_RAS_EXTN) &&
> + (!!vcpu_get_vsesr(vcpu));

> + events->exception.serror_esr = vcpu_get_vsesr(vcpu);

This will return a stale ESR even if nothing is pending. On systems without the
RAS extensions it will return 'ESR_ELx_ISV' if kvm_inject_vabt() has ever been
called for this C

Re: [PATCH v11 1/4] arm64: KVM: export the capability to set guest SError syndrome

2018-04-10 Thread James Morse
Hi Dongjiu Geng,

On 09/04/18 22:36, Dongjiu Geng wrote:
> Before user space injects a SError, it needs to know whether it can
> specify the guest Exception Syndrome, so KVM should tell user space
> whether it has such capability.

(you could improve the commit message by briefly explaining how/why user-space
would want to do this. As this is patch 1, you don't have the context of the
previous patch to say that some systems can provide an ESR with virtual-SError)


> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index fc3ae95..8a3d708 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4415,3 +4415,14 @@ Parameters: none
>  This capability indicates if the flic device will be able to get/set the
>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
>  to discover this without having to create a flic device.
> +
> +8.14 KVM_CAP_ARM_SET_SERROR_ESR
> +
> +Architectures: arm, arm64
> +
> +This capability indicates that userspace can specify syndrome value reported 
> to

(Nit: 'the syndrome value')

> +guest OS when guest takes a virtual SError interrupt exception.

(Nit: 'the guest')

> +If KVM has this capability, userspace can only specify the ISS field for the 
> ESR
> +syndrome, can not specify the EC field which is not under control by KVM.

(Nit: 'it can not specify...')

> +If this virtual SError is taken to EL1 using AArch64, this value will be 
> reported
> +into ISS filed of ESR_EL1.

(Nit: 'in the ISS field')


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 3256b92..38c8a64 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, 
> long ext)
>   case KVM_CAP_ARM_PMU_V3:
>   r = kvm_arm_support_pmu_v3();
>   break;
> + case KVM_CAP_ARM_INJECT_SERROR_ESR:
> + r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
> + break;
>   case KVM_CAP_SET_GUEST_DEBUG:
>   case KVM_CAP_VCPU_ATTRIBUTES:
>   r = 1;

'dev_ioctl' feels a bit weird, but we already have cpu_has_32bit_el1() in here.


> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8fb90a0..3587b33 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_AIS_MIGRATION 150
>  #define KVM_CAP_PPC_GET_CPU_CHAR 151
>  #define KVM_CAP_S390_BPB 152
> +#define KVM_CAP_ARM_INJECT_SERROR_ESR 153
>  
>  #ifdef KVM_CAP_IRQ_ROUTING

(patch 1&2 should probably be swapped around, as on its own this does thing).

Reviewed-by: James Morse 


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl

2018-03-15 Thread James Morse
Hi gengdongjiu,

On 08/03/18 06:18, gengdongjiu wrote:
> Hi James,
>sorry for my late response due to chines new year.

Happy new year,


> 2018-02-16 1:55 GMT+08:00 James Morse :
>> On 12/02/18 10:19, gengdongjiu wrote:
>>> On 2018/2/10 1:44, James Morse wrote:
>>>> The point? We can't know what a CPU without the RAS extensions puts in 
>>>> there.
>>>>
>>>> Why Does this matter? When migrating a pending SError we have to know the
>>>> difference between 'use this 64bit value', and 'the CPU will generate it'.
>>>> If I make an SError pending with ESR=0 on a CPU with VSESR, I can't 
>>>> migrated to
>>>> a system that generates an impdef SError-ESR, because I can't know it will 
>>>> be 0.
>>
>>> For the target system, before taking the SError, no one can know whether 
>>> its syndrome value
>>> is IMPLEMENTATION DEFINED or architecturally defined.
>>
>> For a virtual-SError, the hypervisor knows what it generated. (do I have
>> VSESR_EL2? What did I put in there?).
>>
>>
>>> when the virtual SError is taken, the ESR_ELx.IDS will be updated, then we 
>>> can know
>>> whether the ESR value is impdef or architecturally defined.
>>
>> True, the guest can't know anything about a pending virtual SError until it
>> takes it. Why is this a problem?
>>
>>
>>> It seems migration is only allowed only when target system and source 
>>> system all support
>>> RAS extension, because we do not know whether its syndrome is 
>>> IMPLEMENTATION DEFINED or
>>> architecturally defined.
>>
>> I don't think Qemu allows migration between hosts with differing guest-ID
>> registers. But we shouldn't depend on this, and we may want to hide the v8.2 
>> RAS
>> features from the guest's ID register, but still use them from the host.
>>
>> The way I imagined it working was we would pack the following information 
>> into
>> that events struct:
>> {
>> bool serror_pending;
>> bool serror_has_esr;
>> u64  serror_esr;
>> }
> 
> I have used your suggestion struct

Ah! This is where it came from. Sorry, this was just to illustrate the
information/sizes we wanted to transfer I didn't mean it literally.

I should have said "64 bits of ESR, so that we can transfer anything that is
added to VSESR_EL2 in the future, a flag somewhere to indicate an serror is
pending, and another flag to indicate the ESR has a value we should use".


Thanks/Sorry!

James


>> The problem I was trying to describe is because there is no value of 
>> serror_esr
>> we can use to mean 'Ignore this, I'm a v8.0 CPU'. VSESR_EL2 is a 64bit 
>> register,
>> any bits we abuse may get a meaning we want to use in the future.
>>
>> When it comes to migration, v8.{0,1} systems can only GET/SET events where
>> serror_has_esr == false, they can't use the serror_esr. On v8.2 systems we
>> should require serror_has_esr to be true.
> yes, I agreed.
> 
>>
>> If we need to support migration from v8.{0,1} to v8.2, we can make up an 
>> impdef
>> serror_esr.
> 
> For the Qemu migration, I need to check more the QEMU code.


> Hi Andrew,
>   I use KVM_GET/SET_VCPU_EVENTS IOCTL to migrate the Serror
> exception status of VM,
> The even struct is shown below:
> 
> {
>   bool serror_pending;
>   bool serror_has_esr;
>  u64  serror_esr;
> }
> 
> Only when the target machine is armv8.2, it needs to set the
> serror_esr(SError Exception Syndrome Register).
> for the armv8.0,  software can not set the serror_esr(SError Exception
> Syndrome Register).
> so when migration from v8.{0,1} to v8.2, QEMU should make up an impdef
> serror_esr for the v8.2 target.
> can you give me some suggestion how to set that register in the QEMU?
> I do not familar with the QEMU migration.
> Thanks very much.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 2/5] arm64: KVM: export the capability to set guest SError syndrome

2018-03-15 Thread James Morse
Hi Dongjiu Geng,

On 03/03/18 16:09, Dongjiu Geng wrote:
> Before user space injects a SError, it needs to know whether it can
> specify the guest Exception Syndrome, so KVM should tell user space
> whether it has such capability.

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index fc3ae95..8a3d708 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -4415,3 +4415,14 @@ Parameters: none
>  This capability indicates if the flic device will be able to get/set the
>  AIS states for migration via the KVM_DEV_FLIC_AISM_ALL attribute and allows
>  to discover this without having to create a flic device.
> +
> +8.14 KVM_CAP_ARM_SET_SERROR_ESR
> +
> +Architectures: arm, arm64
> +
> +This capability indicates that userspace can specify syndrome value reported 
> to
> +guest OS when guest takes a virtual SError interrupt exception.

"when userspace triggers a virtual SError"... how?


> +If KVM has this capability, userspace can only specify the ISS field for the 
> ESR
> +syndrome, can not specify the EC field which is not under control by KVM.

Where do I put the ESR?
If you re-order this after the patch that adds the API, you can describe how
this can be used.


Thanks,

James



> +If this virtual SError is taken to EL1 using AArch64, this value will be 
> reported
> +into ISS filed of ESR_EL1.
> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 3256b92..38c8a64 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, 
> long ext)
>   case KVM_CAP_ARM_PMU_V3:
>   r = kvm_arm_support_pmu_v3();
>   break;
> + case KVM_CAP_ARM_INJECT_SERROR_ESR:
> + r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
> + break;
>   case KVM_CAP_SET_GUEST_DEBUG:
>   case KVM_CAP_VCPU_ATTRIBUTES:
>   r = 1;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 8fb90a0..3587b33 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -934,6 +934,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_S390_AIS_MIGRATION 150
>  #define KVM_CAP_PPC_GET_CPU_CHAR 151
>  #define KVM_CAP_S390_BPB 152
> +#define KVM_CAP_ARM_INJECT_SERROR_ESR 153
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v10 3/5] arm/arm64: KVM: Introduce set and get per-vcpu event

2018-03-15 Thread James Morse
Hi Dongjiu Geng,

On 03/03/18 16:09, Dongjiu Geng wrote:
> RAS Extension provides VSESR_EL2 register to specify
> virtual SError syndrome value, this patch adds a new
> IOCTL to export user-invisible states related to
> SError exceptions. User space can setup the
> kvm_vcpu_events to inject specified SError, also it
> can support live migration.

> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 8a3d708..26ae151 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -819,11 +819,13 @@ struct kvm_clock_data {
>  
>  Capability: KVM_CAP_VCPU_EVENTS
>  Extended by: KVM_CAP_INTR_SHADOW
> -Architectures: x86
> +Architectures: x86, arm, arm64
>  Type: vm ioctl
>  Parameters: struct kvm_vcpu_event (out)
>  Returns: 0 on success, -1 on error
>  
> +X86:
> +
>  Gets currently pending exceptions, interrupts, and NMIs as well as related
>  states of the vcpu.
>  
> @@ -865,15 +867,29 @@ Only two fields are defined in the flags field:
>  - KVM_VCPUEVENT_VALID_SMM may be set in the flags field to signal that
>smi contains a valid state.
>  
> +ARM, ARM64:
> +
> +Gets currently pending SError exceptions as well as related states of the 
> vcpu.
> +
> +struct kvm_vcpu_events {
> +   struct {
> +   bool serror_pending;
> +   bool serror_has_esr;
> +   u64 serror_esr;
> +   } exception;
> +};

Don't put bool in an ABI struct. The encoding is up to the compiler.
The compiler will insert padding in this struct to make serror_esr naturally
aligned. Different compilers may do it differently. You'll see that the existing
struct kvm_vcpu_events has 'pad' fields to ensure each element in the struct is
naturally aligned.

serror_pending and serror_has_esr need to be in a flags field.

I thought the logic for re-using the CAP was so user-space could re-use
save/restore code to transfer whatever we put in here during migration. If the
struct is a different size the code has to be different anyway.
My understanding of Drew and Christoffer's comments was that we should re-use
the existing struct. (but now that I look at it, its not so clear).

(If we reuse the struct, we can put the esr in exception.error_code, if we can
get away with it: It would be good to union exception up with a u64, then use
that. This would let us transfer anything we need in those RES0 bits of the
64bit VSESR_EL2).


>  4.32 KVM_SET_VCPU_EVENTS
>  
>  Capability: KVM_CAP_VCPU_EVENTS
>  Extended by: KVM_CAP_INTR_SHADOW
> -Architectures: x86
> +Architectures: x86, arm, arm64
>  Type: vm ioctl
>  Parameters: struct kvm_vcpu_event (in)
>  Returns: 0 on success, -1 on error
>  
> +X86:
> +
>  Set pending exceptions, interrupts, and NMIs as well as related states of the
>  vcpu.
>  
> @@ -894,6 +910,12 @@ shall be written into the VCPU.
>  
>  KVM_VCPUEVENT_VALID_SMM can only be set if KVM_CAP_X86_SMM is available.
>  
> +ARM, ARM64:
> +
> +Set pending SError exceptions as well as related states of the vcpu.
> +
> +See KVM_GET_VCPU_EVENTS for the data structure.
> +
>  
>  4.33 KVM_GET_DEBUGREGS
>  

> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index 9abbf30..32c0eae 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -39,6 +39,7 @@
>  #define __KVM_HAVE_GUEST_DEBUG
>  #define __KVM_HAVE_IRQ_LINE
>  #define __KVM_HAVE_READONLY_MEM
> +#define __KVM_HAVE_VCPU_EVENTS
>  
>  #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>  
> @@ -153,6 +154,15 @@ struct kvm_sync_regs {
>  struct kvm_arch_memory_slot {
>  };
>  
> +/* for KVM_GET/SET_VCPU_EVENTS */
> +struct kvm_vcpu_events {
> + struct {
> + bool serror_pending;
> + bool serror_has_esr;
> + u64 serror_esr;
> + } exception;
> +};
> +

>  /* If you need to interpret the index values, here is the key: */
>  #define KVM_REG_ARM_COPROC_MASK  0x0FFF
>  #define KVM_REG_ARM_COPROC_SHIFT 16
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 5c7f657..62d49c2 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -277,6 +277,32 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>   return -EINVAL;
>  }
>  
> +int kvm_arm_vcpu_get_events(struct kvm_vcpu *vcpu,
> + struct kvm_vcpu_events *events)
> +{
> + events->exception.serror_pending = (vcpu_get_hcr(vcpu) & HCR_VSE);
> + events->exception.serror_has_esr =
> + cpus_have_const_cap(ARM64_HAS_RAS_EXTN) &&
> + (!!vcpu_get_vsesr(vcpu));
> + events->exception.serror_esr = vcpu_get_vsesr(vcpu);
> +
> + return 0;

Nothing checks the return value. Why is it here?

> +}
> +
> +int kvm_arm_vcpu_set_events(struct kvm_vcpu *vcpu,
> + struct kvm_vcpu_events *events)
> +{
> + bool injected = events->exception.serror_pending;
> + bool has_es

Re: [PATCH v10 1/5] arm64: KVM: Prepare set virtual SEI syndrome value

2018-03-15 Thread James Morse
Hi Dongjiu Geng,

On 03/03/18 16:09, Dongjiu Geng wrote:
> Export one API to specify virtual SEI syndrome value
> for guest, and add a helper to get the VSESR_EL2 value.

This patch adds two helpers that nothing calls... its not big, please merge it
with the patch that uses these.


> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index 413dc82..3294885 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -71,6 +71,11 @@ static inline void vcpu_set_hcr(struct kvm_vcpu *vcpu, 
> unsigned long hcr)
>   vcpu->arch.hcr_el2 = hcr;
>  }
>  
> +static inline unsigned long vcpu_get_vsesr(struct kvm_vcpu *vcpu)
> +{
> + return vcpu->arch.vsesr_el2;
> +}
> +
>  static inline void vcpu_set_vsesr(struct kvm_vcpu *vcpu, u64 vsesr)
>  {
>   vcpu->arch.vsesr_el2 = vsesr;
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index a73f63a..3dc49b7 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -354,6 +354,8 @@ void handle_exit_early(struct kvm_vcpu *vcpu, struct 
> kvm_run *run,
>  int kvm_perf_init(void);
>  int kvm_perf_teardown(void);
>  
> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome);
> +
>  struct kvm_vcpu *kvm_mpidr_to_vcpu(struct kvm *kvm, unsigned long mpidr);
>  
>  static inline void __cpu_init_hyp_mode(phys_addr_t pgd_ptr,
> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index 60666a0..78ecb28 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -186,3 +186,8 @@ void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>  {
>   pend_guest_serror(vcpu, ESR_ELx_ISV);
>  }
> +
> +void kvm_set_sei_esr(struct kvm_vcpu *vcpu, u64 syndrome)
> +{
> + pend_guest_serror(vcpu, syndrome & ESR_ELx_ISS_MASK);

If you move the ISS_MASK into pend_guest_serror(), you wouldn't need this at 
all.

It would be better if any validation were in the user-space helpers so we can
check user-space hasn't put something funny in the top bits.

> +}
> 


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl

2018-02-15 Thread James Morse
Hi gengdongjiu,

On 12/02/18 10:19, gengdongjiu wrote:
> On 2018/2/10 1:44, James Morse wrote:
>> The point? We can't know what a CPU without the RAS extensions puts in there.
>>
>> Why Does this matter? When migrating a pending SError we have to know the
>> difference between 'use this 64bit value', and 'the CPU will generate it'.
>> If I make an SError pending with ESR=0 on a CPU with VSESR, I can't migrated 
>> to
>> a system that generates an impdef SError-ESR, because I can't know it will 
>> be 0.

> For the target system, before taking the SError, no one can know whether its 
> syndrome value
> is IMPLEMENTATION DEFINED or architecturally defined.

For a virtual-SError, the hypervisor knows what it generated. (do I have
VSESR_EL2? What did I put in there?).


> when the virtual SError is taken, the ESR_ELx.IDS will be updated, then we 
> can know
> whether the ESR value is impdef or architecturally defined.

True, the guest can't know anything about a pending virtual SError until it
takes it. Why is this a problem?


> It seems migration is only allowed only when target system and source system 
> all support
> RAS extension, because we do not know whether its syndrome is IMPLEMENTATION 
> DEFINED or
> architecturally defined.

I don't think Qemu allows migration between hosts with differing guest-ID
registers. But we shouldn't depend on this, and we may want to hide the v8.2 RAS
features from the guest's ID register, but still use them from the host.

The way I imagined it working was we would pack the following information into
that events struct:
{
bool serror_pending;
bool serror_has_esr;
u64  serror_esr;
}

The problem I was trying to describe is because there is no value of serror_esr
we can use to mean 'Ignore this, I'm a v8.0 CPU'. VSESR_EL2 is a 64bit register,
any bits we abuse may get a meaning we want to use in the future.

When it comes to migration, v8.{0,1} systems can only GET/SET events where
serror_has_esr == false, they can't use the serror_esr. On v8.2 systems we
should require serror_has_esr to be true.

If we need to support migration from v8.{0,1} to v8.2, we can make up an impdef
serror_esr.

We will need to decide what KVM does when SET is called but an SError was
already pending. 2.5.3 "Multiple SError interrupts" of [0] has something to say.


Happy new year,

James


[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8

2018-02-15 Thread James Morse
Hi gengdongjiu, liu jun

On 05/02/18 11:24, gengdongjiu wrote:
> James Morse wrote:
>> I'd like to pick these patches onto the end of that series, but first I want 
>> to
>> know what NOTIFY_SEI means for any OS. The ACPI spec doesn't say, and
>> because its asynchronous, route-able and mask-able, there are many more
>> corners than NOTFIY_SEA.
>>
>> This thing is a notification using an emulated SError exception. (emulated
>> because physical-SError must be routed to EL3 for firmware-first, and
>> virtual-SError belongs to EL2).
>>
>> Does your firmware emulate SError exactly as the TakeException() pseudo code
>> in the Arm-Arm?
> 
> Yes, it is.
> 
>> Is the emulated SError routed following the routing rules for HCR_EL2.{AMO,
>> TGE}?
> 
> Yes, it is.

... and yet ...


>> What does your firmware do when it wants to emulate SError but its masked?
>> (e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had
>> PSTATE.A  set.
>>  e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the
>> emulated  SError should go to EL1. This effectively masks SError.)
> 
> Currently we does not consider much about the mask status(SPSR).

.. this is a problem.

If you ignore SPSR_EL3 you may deliver an SError to EL1 when the exception
interrupted EL2. Even if you setup the EL1 register correctly, EL1 can't eret to
EL2. This should never happen, SError is effectively masked if you are running
at an EL higher than the one its routed to.

More obviously: if the exception came from the EL that SError should be routed
to, but PSTATE.A was set, you can't deliver SError. Masking SError is the only
way the OS has to indicate it can't take an exception right now. VBAR_EL1 may be
'wrong' if we're doing some power-management, the FAR/ELR/ESR/SPSR registers may
contain live values that the OS would lose if you deliver another exception over
the top.

If you deliver an emulated-SError as the OS eret's, your new ELR will point at
the eret instruction and the CPU will spin on this instruction forever.

You have to honour the masking and routing rules for SError, otherwise no OS can
run safely with this firmware.


> I remember that you ever suggested firmware should reboot if the mask status
> is set(SPSR), right?

Yes, this is my suggestion of what to do if you can't deliver an SError: store
the RAS error in the BERT and 'reboot'.


> I CC "liu jun"  who is our UEFI firmware Architect,
> if you have firmware requirements, you can raise again.

(UEFI? I didn't think there was any of that at EL3, but I'm not familiar with
all the 'PI' bits).

The requirement is your emulated-SError from EL3 looks exactly like a
physical-SError as if EL3 wasn't implemented.
Your CPU has to handle cases where it can't deliver an SError, your emulation
has to do the same.

This is not something any OS can work around.


>> Answers to these let us determine whether a bug is in the firmware or the
>> kernel. If firmware is expecting the OS to do something special, I'd like to 
>> know
>> about it from the beginning!
> 
> I know your meaning, thanks for raising it again.


Happy new year,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl

2018-02-09 Thread James Morse
Hi gengdongjiu,

On 05/02/18 06:19, gengdongjiu wrote:
> On 2018/1/31 3:21, James Morse wrote:
>> On 24/01/18 20:06, gengdongjiu wrote:
>>>> On 06/01/18 16:02, Dongjiu Geng wrote:
>>>>> The ARM64 RAS SError Interrupt(SEI) syndrome value is specific to the
>>>>> guest and user space needs a way to tell KVM this value. So we add a
>>>>> new ioctl. Before user space specifies the Exception Syndrome Register
>>>>> ESR(ESR), it firstly checks that whether KVM has the capability to set
>>>>> the guest ESR, If has, will set it. Otherwise, nothing to do.
>>>>>
>>>>> For this ESR specifying, Only support for AArch64, not support AArch32.
>>>>
>>>> After this patch user-space can trigger an SError in the guest. If it 
>>>> wants to migrate the guest, how does the pending SError get migrated?
>>>>
>>>> I think we need to fix migration first. Andrew Jones suggested using
>>>> KVM_GET/SET_VCPU_EVENTS:
>>>> https://www.spinics.net/lists/arm-kernel/msg616846.html
>>>>
>>>> Given KVM uses kvm_inject_vabt() on v8.0 hardware too, we should cover 
>>>> systems without the v8.2 RAS Extensions with the same API. I
>>>> think this means a bit to read/write whether SError is pending, and 
>>>> another to indicate the ESR should be set/read.
>>>> CPUs without the v8.2 RAS Extensions can reject pending-SError that had an 
>>>> ESR.
>>>
>>> For the CPUs without the v8.2 RAS Extensions, its ESR is always 0, 
>>> we only can inject a SError with ESR 0 to guest, cannot set its ESR.
>>
>> 0? It's always implementation-defined. On Juno it seems to be always-0, but
>> other systems may behave differently. (Juno may generate another ESR value 
>> when
>> I'm not watching it...)

> For the armv8.0 cpu without RAS Extensions, it does not have vsesr_el2, so 
> when
> guest take a virtual SError,

> its ESR is 0, can not control the virtual SError's syndrom value, it does not 
> have
> such registers to control that.

My point was its more nuanced than this: the ARM-ARM's
TakeVirtualSErrorException() pseudo-code lets virtual-SError have an
implementation defined syndrome. I've never seen Juno generate anything other
than '0', but it might do something different on a thursday.

The point? We can't know what a CPU without the RAS extensions puts in there.

Why Does this matter? When migrating a pending SError we have to know the
difference between 'use this 64bit value', and 'the CPU will generate it'.
If I make an SError pending with ESR=0 on a CPU with VSESR, I can't migrated to
a system that generates an impdef SError-ESR, because I can't know it will be 0.


> Does Juno not have RAS Extension? 

It's two types of v8.0 CPU, no RAS extensions.


> if yes, I think we can only inject an SError, but can not change its ESR 
> value,
> because it does not have vsesr_el2.

I agree, this means we need to be able to tell the difference between 'pending'
and 'pending with this ESR'.


>> Just because we can't control the ESR doesn't mean injecting an SError isn't
>> something user-space may want to do.

> yes, we may need to support user-space injects an SError even through CPU 
> does not have RAS Extension.
> 
>> If we tackle migration of pending-SError first, I think that will give us the
>> API to create a new pending SError with/without an ESR as appropriate.
> 
> sure, we should. Last week, I checked the KVM_GET/SET_VCPU_EVENTS IOCTL, it 
> should meet our
> migration requirements

Great!


Thanks,

James



>>> The IOCTL KVM_GET/SET_VCPU_EVENTS has been used by X86.
>>
>> We would be re-using the struct to have values with slightly different 
>> meanings.
>> But for migration the upshot is the same, call KVM_GET_VCPU_EVENTS on one
>> system, and pass the struct to KVM_SET_VCPU_EVENTS on the new system. If 
>> we're
>> lucky Qemu may be able to do this in shared x86/arm64 code.
>>
> Thanks for the reminder, I know your meaning.
> In the x86, the kvm_vcpu_events includes exception/interrupt/nmi/smi. For the 
> RAS, we
> only involves the exception, so Qemu handling logic is different. Anyway, I 
> will try to
> share code for the two platform in Qemu.
> 
> 
> /* for KVM_GET/SET_VCPU_EVENTS */
> struct kvm_vcpu_events {
>   struct {
>   __u8 injected;
>   __u8 nr;
>   __u8 has_error_code;
>   __u8 pad;
>   __u32 error_code;
>   } exception

Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8

2018-01-30 Thread James Morse
Hi gengdongjiu,

On 23/01/18 09:23, gengdongjiu wrote:
> On 2018/1/23 3:39, James Morse wrote:
>> gengdongjiu wrote:
>>> This error source parsing and handling method
>>> is similar with the SEA.
>>
>> There are problems with doing this:
>>
>> Oct. 18, 2017, 10:26 a.m. James Morse wrote:
>> | How do SEA and SEI interact?
>> |
>> | As far as I can see they can both interrupt each other, which isn't 
>> something
>> | the single in_nmi() path in APEI can handle. I thinks we should fix this
>> | first.
>>
>> [..]
>>
>> | SEA gets away with a lot of things because its synchronous. SEI isn't. Xie
>> | XiuQi pointed to the memory_failure_queue() code. We can use this directly
>> | from SEA, but not SEI. (what happens if an SError arrives while we are
>> | queueing memory_failure work from an IRQ).
>> |
>> | The one that scares me is the trace-point reporting stuff. What happens if 
>> an
>> | SError arrives while we are enabling a trace point? (these are static-keys
>> | right?)
>> |
>> |  I don't think we can just plumb SEI in like this and be done with it.
>> |  (I'm looking at teasing out the estatus cache code from being x86:NMI 
>> only.
>> |  This way we solve the same 'cant do this from NMI context' with the same
>> |  code'.)
>>
>>
>> I will post what I've got for this estatus-cache thing as an RFC, its not 
>> ready
>> to be considered yet.

> Yes, I know you are dong that. Your serial's patch will consider all above 
> things, right?

Assuming I got it right, yes. It currently makes the race Xie XiuQi spotted
worse, which I want to fix too. (details on the cover letter)


> If your patch can be consider that, this patch can based on your patchset. 
> thanks.

I'd like to pick these patches onto the end of that series, but first I want to
know what NOTIFY_SEI means for any OS. The ACPI spec doesn't say, and because
its asynchronous, route-able and mask-able, there are many more corners than
NOTFIY_SEA.

This thing is a notification using an emulated SError exception. (emulated
because physical-SError must be routed to EL3 for firmware-first, and
virtual-SError belongs to EL2).

Does your firmware emulate SError exactly as the TakeException() pseudo code in
the Arm-Arm?
Is the emulated SError routed following the routing rules for HCR_EL2.{AMO, 
TGE}?
What does your firmware do when it wants to emulate SError but its masked?
(e.g.1: The physical-SError interrupted EL2 and the SPSR shows EL2 had PSTATE.A
 set.
 e.g.2: The physical-SError interrupted EL2 but HCR_EL2 indicates the emulated
 SError should go to EL1. This effectively masks SError.)


Answers to these let us determine whether a bug is in the firmware or the
kernel. If firmware is expecting the OS to do something special, I'd like to
know about it from the beginning!


>>> Expose API ghes_notify_sei() to external users. External
>>> modules can call this exposed API to parse APEI table and
>>> handle the SEI notification.
>>
>> external modules? You mean called by the arch code when it gets this 
>> NOTIFY_SEI?

> yes, called by kernel ARCH code, such as below, I remember I have discussed 
> with you.

Sure. The phrase 'external modules' usually means the '.ko' files that live in
/lib/modules, nothing outside the kernel tree should be doing this stuff.


Thanks,

James

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl

2018-01-30 Thread James Morse
Hi gengdongjiu,

On 24/01/18 20:06, gengdongjiu wrote:
>> On 06/01/18 16:02, Dongjiu Geng wrote:
>>> The ARM64 RAS SError Interrupt(SEI) syndrome value is specific to the
>>> guest and user space needs a way to tell KVM this value. So we add a
>>> new ioctl. Before user space specifies the Exception Syndrome Register
>>> ESR(ESR), it firstly checks that whether KVM has the capability to set
>>> the guest ESR, If has, will set it. Otherwise, nothing to do.
>>>
>>> For this ESR specifying, Only support for AArch64, not support AArch32.
>>
>> After this patch user-space can trigger an SError in the guest. If it wants 
>> to migrate the guest, how does the pending SError get migrated?
>>
>> I think we need to fix migration first. Andrew Jones suggested using
>> KVM_GET/SET_VCPU_EVENTS:
>> https://www.spinics.net/lists/arm-kernel/msg616846.html
>>
>> Given KVM uses kvm_inject_vabt() on v8.0 hardware too, we should cover 
>> systems without the v8.2 RAS Extensions with the same API. I
>> think this means a bit to read/write whether SError is pending, and another 
>> to indicate the ESR should be set/read.
>> CPUs without the v8.2 RAS Extensions can reject pending-SError that had an 
>> ESR.
> 
> For the CPUs without the v8.2 RAS Extensions, its ESR is always 0, 
> we only can inject a SError with ESR 0 to guest, cannot set its ESR.

0? It's always implementation-defined. On Juno it seems to be always-0, but
other systems may behave differently. (Juno may generate another ESR value when
I'm not watching it...)

Just because we can't control the ESR doesn't mean injecting an SError isn't
something user-space may want to do.
If we tackle migration of pending-SError first, I think that will give us the
API to create a new pending SError with/without an ESR as appropriate.


> About how about to use the KVM_GET/SET_VCPU_EVENTS, I will check the code, and
> consider your suggestion at the same time.

(Not my suggestion, It was Andrew Jones idea.)

> The IOCTL KVM_GET/SET_VCPU_EVENTS has been used by X86.

We would be re-using the struct to have values with slightly different meanings.
But for migration the upshot is the same, call KVM_GET_VCPU_EVENTS on one
system, and pass the struct to KVM_SET_VCPU_EVENTS on the new system. If we're
lucky Qemu may be able to do this in shared x86/arm64 code.


>>> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c index
>>> 5c7f657..738ae90 100644
>>> --- a/arch/arm64/kvm/guest.c
>>> +++ b/arch/arm64/kvm/guest.c
>>> @@ -277,6 +277,11 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu 
>>> *vcpu,
>>> return -EINVAL;
>>>  }
>>>
>>> +int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome) {
>>> +   return -EINVAL;
>>> +}
>>
>> Does nothing in the patch that adds the support? This is a bit odd.
>> (oh, its hiding in patch 6...)
> 
> To make this patch simple and small, I add it in patch 6.

But that made the functionality of this patch: A new API to return -EINVAL from
the kernel.

Swapping the patches round would have avoided this.
Regardless, I think this will fold out in a rebase.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 6/7] arm64: kvm: Set Virtual SError Exception Syndrome for guest

2018-01-23 Thread James Morse
Hi Dongjiu Geng,

On 06/01/18 16:02, Dongjiu Geng wrote:
> RAS Extension add a VSESR_EL2 register which can provide
> the syndrome value reported to software on taking a virtual
> SError interrupt exception. This patch supports to specify
> this Syndrome.
> 
> In the RAS Extensions we can not set all-zero syndrome value
> for SError, which means 'RAS error: Uncategorized' instead of
> 'no valid ISS'. So set it to IMPLEMENTATION DEFINED syndrome
> by default.
> 
> We also need to support userspace to specify a valid syndrome
> value, Because in some case, the recovery is driven by userspace.
> This patch can support that userspace specify it.
> 
> In the guest/host world switch, restore this value to VSESR_EL2
> only when HCR_EL2.VSE is set. This value no need to be saved
> because it is stale vale when guest exit.

A version of this patch has been queued by Catalin.

Now that the cpufeature bits are queued, I think this can be split up into two
separate series for v4.16-rc1, one to tackle NOTIFY_SEI and the associated
plumbing. The second for the KVM 'make SError pending' API.


> Signed-off-by: Dongjiu Geng 
> [Set an impdef ESR for Virtual-SError]
> Signed-off-by: James Morse 

I didn't sign-off this patch. If you pick some bits from another version and
want to credit someone else you can 'CC:' them or just mention it in the
commit-message.


> diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
> index 47b967d..3b035cc 100644
> --- a/arch/arm64/include/asm/sysreg.h
> +++ b/arch/arm64/include/asm/sysreg.h
> @@ -86,6 +86,9 @@
>  #define REG_PSTATE_PAN_IMM   sys_reg(0, 0, 4, 0, 4)
>  #define REG_PSTATE_UAO_IMM   sys_reg(0, 0, 4, 0, 3)
>  
> +/* virtual SError exception syndrome register */
> +#define REG_VSESR_EL2  sys_reg(3, 4, 5, 2, 3)

Irrelevant-Nit: sys-regs usually have a 'SYS_' prefix, and are in instruction
encoding order lower down the file.

(These PSTATE PAN things are a bit odd as they were used to generate and
instruction before the fancy {read,write}_sysreg() helpers were added).


>  #define SET_PSTATE_PAN(x) __emit_inst(0xd500 | REG_PSTATE_PAN_IMM |  
> \
> (!!x)<<8 | 0x1f)
>  #define SET_PSTATE_UAO(x) __emit_inst(0xd500 | REG_PSTATE_UAO_IMM |  
> \
> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 738ae90..ffad42b 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -279,7 +279,16 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>  
>  int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome)

Bits of this are spread between patches 5 and 6. If you put them in the other
order this wouldn't happen.

(but after a rebase most of this patch should disappear)

>  {
> - return -EINVAL;
> + u64 reg = *syndrome;
> +
> + /* inject virtual system Error or asynchronous abort */
> + kvm_inject_vabt(vcpu);

So this writes an impdef ESR, because its the existing code-path in KVM.


> + if (reg)
> + /* set vsesr_el2[24:0] with value that user space specified */
> + kvm_vcpu_set_vsesr(vcpu, reg & ESR_ELx_ISS_MASK);

And then you overwrite it. Which is a bit odd as there is a helper to do both in
one go:


> +
> + return 0;
>  }

>  int __attribute_const__ kvm_target_cpu(void)

> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
> index 3556715..fb94b5e 100644
> --- a/arch/arm64/kvm/inject_fault.c
> +++ b/arch/arm64/kvm/inject_fault.c
> @@ -246,14 +246,25 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>   inject_undef64(vcpu);
>  }
>  
> +static void pend_guest_serror(struct kvm_vcpu *vcpu, u64 esr)
> +{
> + kvm_vcpu_set_vsesr(vcpu, esr);
> + vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
> +}

How come you don't use this in kvm_arm_set_sei_esr()?



Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 5/7] arm64: kvm: Introduce KVM_ARM_SET_SERROR_ESR ioctl

2018-01-23 Thread James Morse
Hi Dongjiu Geng,

On 06/01/18 16:02, Dongjiu Geng wrote:
> The ARM64 RAS SError Interrupt(SEI) syndrome value is specific to the
> guest and user space needs a way to tell KVM this value. So we add a
> new ioctl. Before user space specifies the Exception Syndrome Register
> ESR(ESR), it firstly checks that whether KVM has the capability to
> set the guest ESR, If has, will set it. Otherwise, nothing to do.
> 
> For this ESR specifying, Only support for AArch64, not support AArch32.

After this patch user-space can trigger an SError in the guest. If it wants to
migrate the guest, how does the pending SError get migrated?

I think we need to fix migration first. Andrew Jones suggested using
KVM_GET/SET_VCPU_EVENTS:
https://www.spinics.net/lists/arm-kernel/msg616846.html

Given KVM uses kvm_inject_vabt() on v8.0 hardware too, we should cover systems
without the v8.2 RAS Extensions with the same API. I think this means a bit to
read/write whether SError is pending, and another to indicate the ESR should be
set/read.
CPUs without the v8.2 RAS Extensions can reject pending-SError that had an ESR.

user-space can then use the 'for migration' calls to make a 'new' SError 
pending.

Now that the cpufeature bits are queued, I think this can be split up into two
separate series for v4.16-rc1, one to tackle NOTIFY_SEI and the associated
plumbing. The second for the KVM 'make SError pending' API.


> diff --git a/arch/arm64/kvm/guest.c b/arch/arm64/kvm/guest.c
> index 5c7f657..738ae90 100644
> --- a/arch/arm64/kvm/guest.c
> +++ b/arch/arm64/kvm/guest.c
> @@ -277,6 +277,11 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
>   return -EINVAL;
>  }
>  
> +int kvm_arm_set_sei_esr(struct kvm_vcpu *vcpu, u32 *syndrome)
> +{
> + return -EINVAL;
> +}

Does nothing in the patch that adds the support? This is a bit odd.
(oh, its hiding in patch 6...)


Thanks,

James

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 3/7] acpi: apei: Add SEI notification type support for ARMv8

2018-01-22 Thread James Morse
Hi Dongjiu Geng,

(versions of patches 1,2 and 4 have been queued by Catalin)

(Nit 'ACPI / APEI:' is the normal subject prefix for ghes.c, this helps the
maintainers know which patches they need to pay attention to when you are
touching multiple trees)

On 06/01/18 16:02, Dongjiu Geng wrote:
> ARMv8.2 requires implementation of the RAS extension.

> In
> this extension, it adds SEI(SError Interrupt) notification
> type, this patch adds new GHES error source SEI handling
> functions. 

This reads as if this patch is handling SError RAS notifications generated by a
CPU with the RAS extensions. These are about CPU->Software notifications. APEI
and GHES are a firmware first mechanism which is Software->Software.
Reading the v8.2 documents won't help anyone with the APEI/GHES code.

Please describe this from the ACPI view, "ACPI 6.x adds support for NOTIFY_SEI
as a GHES notification mechanism... ",  its up to the arch code to spot a v8.2
RAS Error based on the cpu caps.


> This error source parsing and handling method
> is similar with the SEA.

There are problems with doing this:

Oct. 18, 2017, 10:26 a.m. James Morse wrote:
| How do SEA and SEI interact?
|
| As far as I can see they can both interrupt each other, which isn't something
| the single in_nmi() path in APEI can handle. I thinks we should fix this
| first.

[..]

| SEA gets away with a lot of things because its synchronous. SEI isn't. Xie
| XiuQi pointed to the memory_failure_queue() code. We can use this directly
| from SEA, but not SEI. (what happens if an SError arrives while we are
| queueing memory_failure work from an IRQ).
|
| The one that scares me is the trace-point reporting stuff. What happens if an
| SError arrives while we are enabling a trace point? (these are static-keys
| right?)
|
|  I don't think we can just plumb SEI in like this and be done with it.
|  (I'm looking at teasing out the estatus cache code from being x86:NMI only.
|  This way we solve the same 'cant do this from NMI context' with the same
|  code'.)


I will post what I've got for this estatus-cache thing as an RFC, its not ready
to be considered yet.


> Expose API ghes_notify_sei() to external users. External
> modules can call this exposed API to parse APEI table and
> handle the SEI notification.

external modules? You mean called by the arch code when it gets this NOTIFY_SEI?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 16/12/17 03:44, gengdongjiu wrote:
> On 2017/12/16 2:52, James Morse wrote:
>>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>>> below:
>>>
>>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>>
>>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>>> trigger method, which all
>>>
>>> not involve _trigger_ an SError.
>> It's a policy choice. How does your virtual CPU notify RAS errors to its 
>> virtual
>> software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type 
>> of
>> CPU you are trying to emulate.
>>
>> I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
>> the guest doesn't take the SError immediately, instead tries to re-execute 
>> the

> I agree it is better to use NOTIFY_SEA for SIGBUS_MCEERR_AR in this case.

>> code KVM has unmapped from stage2 because its corrupt. (You could detect this
>> happening in Qemu and try something else)

> For something else, using NOTIFY_SEI for SIGBUS_MCEERR_AR?

Sorry that was unclear. If you use NOTIFY_SEI, the guest may have PSTATE.A set,
in which case the the CPU will patiently wait for it to unmask, (or consume it
with an ESB-instruction), before delivering the notification. The guest may not
have unmasked SError because its hammering the same page taking the same fault
again and again. Pending the asynchronous notification and re-running the vcpu
doesn't guarantee progress will be made.

In this case user-space can spot its pended an asynchronous notification (for
the same address!) more than once in the last few seconds, and try something
else, like firing a guest:reboot watchdog on another CPU.


> At current implementation,
> It seems only have this case that "KVM has unmapped from stage2", do you 
> thing we
> still have something else?

I'm wary that this only works for errors where we know the guest PC accessed the
faulting location.

The arch code will send this signal too if user-space touches the PG_poisoned
page. (I recall you checked Qemu spots this case and acts differently).
Migration is the obvious example for Qemu read/writing guest memory.

On x86 the MachineCheck code sends these signals too, so our kernel-first
implementation may do the same. As a response to a RAS error notified by
synchronous-external-abort, this is fine. But we need to remember '_AR' implies
the error is related to the code the signal interrupted, which wouldn't be true
for an error notified by SError.


>> Synchronous/asynchronous external abort matters to the CPU, but once the 
>> error
>> has been notified to software the reasons for this distinction disappear. 
>> Once
>> the error has been handled, all trace of this distinction is gone.
>>
>> CPER records only describe component failures. You are trying to re-create 
>> some
>> state that disappeared with one of the firmware-first abstractions. Trying to
>> re-create this information isn't worth the effort as the distinction doesn't
>> matter to linux, only to the CPU.
>>
>>
>>> so there is no chance for Qemu to trigger the SError when gets the 
>>> SIGBUS_MCEERR_A{O,R}.
>> You mean there is no reason for Qemu to trigger an SError when it gets a 
>> signal
>> from the kernel.
>>
>> The reasons the CPU might have to generate an SError don't apply to linux and
>> KVM user space. User-space will never get a signal for an uncontained error, 
>> we
>> will always panic(). We can't give user-space a signal for imprecise 
>> exceptions,
>> as it can't return from the signal. The classes of error that are left are
>> covered by polled/irq and NOTIFY_SEA.
>>
>> Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really 
>> wants
>> to, (but I don't think you should, the kernel may have unmapped the page at 
>> PC
>> from stage2 due to corruption).

> yes, you also said you do not want to generate RAS SErrors for 
> SIGBUS_MCEERR_AR,
> so Qemu does not know in which condition to generate RAS SErrors.

There are two things going on here, firstly the guest may have masked PSTATE.A,
and be hammering an unmapped page. (this this 'sorry that was unclear' case
above). This would happen if the exception-entry code or stack became corrupt
when an exception was taken.
The second is what does existing non-RAS-aware software do? For SError it
panic()s, whereas for synchronous external abort there are some cases that can
be handled. (e.g. on linux: synchronous external abort from user-space).


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-22 Thread James Morse
Hi gengdongjiu,

On 21/01/18 02:45, gengdongjiu wrote:
> For the ESR_ELx_AET_UER, this exception is precise, closing the VM may
> be better[1].
> But if you think panic is better until we support kernel-first, it is
> also OK to me.

I'm not convinced SError while a guest was running means only guest memory could
be affected. Mechanisms like KSM means the error could affect multiple guests.

Both firmware-fist and kernel-first will give us the address, with which we can
know which processes are affected, isolated the memory and signal affected
processes.

Until we have one of these panic() is the only way we have to contain an error,
but its an interim fix.
Not panic()ing the host for an error that should be contained to the guest is a
fudge, we don't actually know its safe (KSM, page-table etc). I want to improve
on this with {firmware, kernel}-first support (or both!), I don't want to expose
that this is happening to user-space, as once we have one of {firmware,
kernel}-first, it shouldn't happen.


>> This is inventing something new for RAS errors not claimed by firmware-first.
>> If we have kernel-first too, this will never happen. (unless your system is
>> losing the error description).

> In fact, if we have kernel-first, I think we still need to judge the
> error type by ESR, right?

The kernel-first mechanism should consider the ESR/FAR, yes, but once the error
has been claimed and handled, KVM shouldn't care about any of these values.
(maybe we'll sanity check for uncontained errors, just in case the error escaped
to the RAS code...)

My point here was exposing 'unhandled' (ignored) RAS errors to user-space
creates an ABI: someone will complain once we start handling the error, and they
no longer get a notification via this 'unhandled' interface. Code written to use
this interface becomes useless/untested.


> If the handle_guest_sei() , may be the system does not support firmware-first,
> so we judge the ESR value,

...and panic()/ignore as appropriate.

I agree not all systems will support firmware-first, (big-endian is the obvious
example), but if we get kernel-first support this ESR guessing can disappear,
I'm against exposing it to user-space in the meantime.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 15/12/17 03:30, gengdongjiu wrote:
> On 2017/12/7 14:37, gengdongjiu wrote:
>>> We need to tackle (1) and (3) separately. For (3) we need some API that lets
>>> Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't 
>>> have
>>> a way of migrating pending SError yet... which is where I got stuck last 
>>> time I
>>> was looking at this.
>> I understand you most idea.
>>
>> But In the Qemu one signal type can only correspond to one behavior, can not 
>> correspond to two behaviors,
>> otherwise Qemu will do not know how to do.
>>
>> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
>> the CPER
>> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
>> receives the SIGBUS_MCEERR_AO
>> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
>> below:
>>
>> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
>> SIGBUS_MCEERR_AO trigger GPIO IRQ.
>>
>> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
>> trigger method, which all
>>
>> not involve _trigger_ an SError.
>>
>> so there is no chance for Qemu to trigger the SError when gets the 
>> SIGBUS_MCEERR_A{O,R}.
> 
> As I explained above:
> 
> If Qemu received SIGBUS_MCEERR_AR, it will record CPER and trigger 
> Synchronous External Abort;
> If Qemu received SIGBUS_MCEERR_AO, it will record CPER and trigger GPIO IRQ;

> So Qemu does not know when to _trigger_ an SError.

There is no answer to this. How the CPU decides is specific to the CPU design.
How Qemu decides is going to be specific to the machine it emulates.

My understanding is there is some overlap for which RAS errors are reported as
synchronous external abort, and which use SError. (Obviously the imprecise ones
are all SError). Which one the CPU uses depends on how the CPU is designed.

When you take an SIGBUS_MCEERR_AR from KVM, its because KVM can't complete a
stage2 fault because the page is marked with PG_poisoned. These started out as a
synchronous exception, but you could still report these with SError.

We don't have a way to signal user-space about imprecise exceptions, this isn't
a KVM specific problem.


> so here I "return a error" to Qemu if ghes_notify_sei() return failure in 
> [1], if you opposed KVM "return error",
> do you have a better idea about it? thanks

If ghes_notify_sei() fails to claim the error, we should drop through to
kernel-first-handling. We don't have that yet, just the stub that ignores errors
where we can make progress.

If neither firmware-first nor kernel-first claim a RAS error, we're in trouble.
I'd like to panic() as we got a RAS notification but no description of the
error. We can't do this until we have kernel-first support, hence that stub.


> About the way of migrating pending SError, I think it is a separate case, 
> because Qemu still does not know
> how and when to trigger the SError.

I agree, but I think we should fix this first before we add another user of this
unmigratable hypervisor state.

(I recall someone saying migration is needed for any new KVM/cpu features, but I
can't find the thread)


> [1]:
> static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> {
> ...
> +   case ESR_ELx_AET_UER:   /* The error has not been propagated */
> +   /*
> +* Userspace only handle the guest SError Interrupt(SEI) if 
> the
> +* error has not been propagated
> +*/
> +   run->exit_reason = KVM_EXIT_EXCEPTION;
> +   run->ex.exception = ESR_ELx_EC_SERROR;

I'm against telling user space RAS errors ever happened, only the final
user-visible error when the kernel can't fix it.

This is inventing something new for RAS errors not claimed by firmware-first.
If we have kernel-first too, this will never happen. (unless your system is
losing the error description).


Your system has firmware-first, why isn't it claiming the notification?
If its not finding CPER records written by firmware, check firmware and the UEFI
memory map agree on the attributes to be used when read/writing that area.


> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> +   return 0;


Thanks,

James

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2018-01-12 Thread James Morse
Hi gengdongjiu,

On 16/12/17 04:47, gengdongjiu wrote:
> [...]
>>
>>> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
>>> + /*
>>> +  * Userspace only handle the guest SError Interrupt(SEI) if 
>>> the
>>> +  * error has not been propagated
>>> +  */
>>> + run->exit_reason = KVM_EXIT_EXCEPTION;
>>> + run->ex.exception = ESR_ELx_EC_SERROR;
>>> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> + return 0;
>>
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
> 
> For the  ESR_ELx_AET_UER(Recoverable error), let us see its definition
> below, which get from [0]

[..]

> so we can see the  exception is precise and PE can recover execution
> from the preferred return address of the exception, 

> so let guest handling it is
> better, for example, if it is guest application RAS error, we can kill
> the guest application instead of panic whole OS; if it is guest kernel
> RAS error, guest will panic.

If the kernel takes an unhandled RAS error it should panic - we don't know where
the error is.

I understand you want to kill-off guest tasks as a result of RAS errors, but
this needs to go through the whole APEI->memory_failure()->sigbus machinery so
that the kernel knows the kernel can keep running.

This saves us signalling user-space when we don't need to. An example:
code-corruption. Linux can happily re-read affected user-space executables from
disk, there is absolutely nothing user-space can do about it.
Handling errors first in the kernel allows us to do recovery for all the
affected processes, not just the one that happens to be running right now.


> Host does not know which application of guest has error, so host can
> not handle it,

It has to work this out, otherwise the errors we can handle never get a chance.

This kernel is expected to look at the error description, (which for some reason
we aren't talking about here), e.g. the CPER records, and determine what
recovery action is necessary for this error.
For memory errors this may be re-reading from disk, or at the worst case,
unmapping from all user-space users (including KVM's stage2) and raining signals
on all affected processes.

For a memory error the important piece of information is the physical address.
Only the kernel can do anything with this, it determines who owns the affected
memory and what needs doing to recover from the error.

If you pass the notification to user-space, all it can do is signal the guest to
"stop doing whatever it is you're doing". The guest may have been able to
re-read pages from disk, or otherwise handle the error.
Has the error been handled? No: The error remains latent in the system.


> panic OS is not a good choice for the Recoverable error.

If we don't know where the error is, and we can't make progress, its the only
sane choice.

This code is never expected to run! (why are we arguing about it?) We should get
RAS errors as GHES notifications from firmware via some mechanism. If those are
NOTIFY_SEI then APEI should claim the notification and kick off the appropriate
handling based on the CPER records. If/when we get kernel-first, that can claim
the SError. What we're left with is RAS notifications that no-one claimed
because there was no error-description found.



James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-15 Thread James Morse
Hi gengdongjiu,

On 07/12/17 06:37, gengdongjiu wrote:
> I understand you most idea.
> 
> But In the Qemu one signal type can only correspond to one behavior, can not 
> correspond to two behaviors,
> otherwise Qemu will do not know how to do.
> 
> For the Qemu, if it receives the SIGBUS_MCEERR_AR signal, it will populate 
> the CPER
> records and inject a SEA to guest through KVM IOCTL "KVM_SET_ONE_REG"; if 
> receives the SIGBUS_MCEERR_AO
> signal, it will record the CPER and trigger a IRQ to notify guest, as shown 
> below:
> 
> SIGBUS_MCEERR_AR trigger Synchronous External Abort.
> SIGBUS_MCEERR_AO trigger GPIO IRQ.
> 
> For the SIGBUS_MCEERR_AO and SIGBUS_MCEERR_AR, we have already specify 
> trigger method, which all
> 
> not involve _trigger_ an SError.

It's a policy choice. How does your virtual CPU notify RAS errors to its virtual
software? You could use SError for SIGBUS_MCEERR_AR, it depends on what type of
CPU you are trying to emulate.

I'd suggest using NOTIFY_SEA for SIGBUS_MCEERR_AR as it avoids problems where
the guest doesn't take the SError immediately, instead tries to re-execute the
code KVM has unmapped from stage2 because its corrupt. (You could detect this
happening in Qemu and try something else)


Synchronous/asynchronous external abort matters to the CPU, but once the error
has been notified to software the reasons for this distinction disappear. Once
the error has been handled, all trace of this distinction is gone.

CPER records only describe component failures. You are trying to re-create some
state that disappeared with one of the firmware-first abstractions. Trying to
re-create this information isn't worth the effort as the distinction doesn't
matter to linux, only to the CPU.


> so there is no chance for Qemu to trigger the SError when gets the 
> SIGBUS_MCEERR_A{O,R}.

You mean there is no reason for Qemu to trigger an SError when it gets a signal
from the kernel.

The reasons the CPU might have to generate an SError don't apply to linux and
KVM user space. User-space will never get a signal for an uncontained error, we
will always panic(). We can't give user-space a signal for imprecise exceptions,
as it can't return from the signal. The classes of error that are left are
covered by polled/irq and NOTIFY_SEA.

Qemu can decide to generate RAS SErrors for SIGBUS_MCEERR_AR if it really wants
to, (but I don't think you should, the kernel may have unmapped the page at PC
from stage2 due to corruption).


I think the problem here is you're applying the CPU->software behaviour and
choices to software->software. By the time user-space gets the error, the
behaviour is different.



Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-12-06 Thread James Morse
Hi gengdongjiu,

On 06/12/17 10:26, gengdongjiu wrote:
> On 2017/11/15 0:00, James Morse wrote:
>>> +* error has not been propagated
>>> +*/
>>> +   run->exit_reason = KVM_EXIT_EXCEPTION;
>>> +   run->ex.exception = ESR_ELx_EC_SERROR;
>>> +   run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
>>> +   return 0;
>> We should not pass RAS notifications to user space. The kernel either handles
>> them, or it panics(). User space shouldn't even know if the kernel supports 
>> RAS
>> until it gets an MCEERR signal.
>>
>> You're making your firmware-first notification an EL3->EL0 signal, bypassing 
>> the OS.
>>
>> If we get a RAS SError and there are no CPER records or values in the ERR 
>> nodes,
>> we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
>> errors)

> do you think whether we need to set the guest ESR by user space?  if need, I 
> need to
> notify user space that there is a SError happen and need to set ESR for guest 
> in some place of
> KVM.

I think you are still coming from a world where user-space gets raw RAS
notifications via KVM. This should not happen because the notification method is
private to firmware and the kernel. KVM is just in the way when a guest is 
running.

Notifications reaching KVM should be plumbed into the APEI-firmware-first-code
or eventually, a kernel-first mechanism if APEI doesn't 'claim' them.

The kernel RAS code may signal user-space with the symptoms of the error, and
user-space may decided to generate a new RAS notification for the guest.

This should function in exactly the same way, regardless of which notification
method is in use between the kernel and firmware. (its the only way to make this
future-proof).

Which notification user-space chooses to use entirely depends on what (if
anything) it advertised to the guest in the HEST. User-space has to be in
control of triggering any SError, not just overriding the ESR when KVM has
decided it wants to kill the guest.


> so here I return a error code to user space. you mean we should not pass RAS 
> notifications
> to user space, so could you give some suggestion how to notify user space to 
> set guest ESR.

KVM shouldn't give the guest an SError when it takes a RAS notification, it
should pass the notification to the kernel RAS code. It only needs to 'fall
through' to some default cause if both APEI and kernel-first deny-all-knowledge
of this notification.


The end-to-end flow is then (assuming no-VHE):
(1)An error occurs, taking the CPU to EL3.
EL3: triage the error, generate CPER, notify the OS
EL2: KVM takes the notification, exits the guest, returns to host EL1.
EL1: KVM handle_exit() calls APEI to handle the error.
This is the end of KVMs involvement in RAS - its just plumbing.

(2)APEI processes the CPER records and signals affected processes.
If KVM's user-space is affected, KVM will spot the pending signal when it goes
to re-enter the guest, and exit to user-space instead.
Qemu takes the SIGBUS_MCEERR_A{O,R}.

(3) Qemu decides it wants to hand the guest a RAS error, it populates the CPER
records (in memory only Qemu knows about), then drives the KVM API to make the
appropriate notification appear.


(1) only happens if the guest was running when the error arrived. GHES has ~4
flavours of IRQ which may be used to describe corruption in guest memory. Steps
(2) and (3) are exactly the same in this case.

Qemu may decide to trigger RAS errors all by itself, (probably for testing and
debugging), in which case (1) and (2) don't happen, but (3), is exactly the 
same.


This way platform-firmware/host-kernel can use kernel-first or firmware-first
with any of the notifications, independently from Qemu/guest-kernel making a
different kernel-first or firmware-first with different notifications.

Passing information out of KVM breaks this, forcing Qemu to know about the
mechanism platform-firmware is using.


We need to tackle (1) and (3) separately. For (3) we need some API that lets
Qemu _trigger_ an SError in the guest, with a specified ESR. But, we don't have
a way of migrating pending SError yet... which is where I got stuck last time I
was looking at this.



James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 7/7] arm64: kvm: handle SError Interrupt by categorization

2017-11-14 Thread James Morse
Hi Dongjiu Geng,

On 10/11/17 19:54, Dongjiu Geng wrote:
> If it is not RAS SError, directly inject virtual SError,
> which will keep the old way. If it is RAS SError, firstly
> let host ACPI module to handle it.

> For the ACPI handling,
> if the error address is invalid, APEI driver will not
> identify the address to hwpoison memory and can not notify
> guest to do the recovery.

The guest can't do any recover either. There is no recovery you can do without
some information about what the error is.

This is your memory corruption at an unknown address? We should reboot.

(I agree memory_failure.c's::me_kernel() is ignoring kernel errors, we should
try and fix this. It makes some sense for polled or irq notifications, but not
SEA/SEI).


> In order to safe, KVM continues
> categorizing errors and handle it separately.

> If the RAS error is not propagated, let host user space to
> handle it. 

No. Host user space should not know anything about the kernel or platform RAS
support. Doing so creates an ABI link between EL3 firmware and Qemu. This is
totally unmaintainable.

This thing needs to be portable. The kernel should handle the error, and report
any symptoms to user-space. e.g. 'this memory is gone'.

We shouldn't special case KVM.


> The reason is that sometimes we can only kill the
> guest effected application instead of panic whose guest OS.
> Host user space specifies a valid ESR and inject virtual
> SError, guest can just kill the current application if the
> non-consumed error coming from guest application.
> 
> Signed-off-by: Dongjiu Geng 
> Signed-off-by: Quanming Wu 

The last Signed-off-by should match the person posting the patch. It's a chain
of custody for GPL-signoff purposes, not a 'partially-written-by'. If you want
to credit Quanming Wu you can add CC and they can Ack/Review your patch.


> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 7debb74..1afdc87 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -178,6 +179,66 @@ static exit_handle_fn kvm_get_exit_handler(struct 
> kvm_vcpu *vcpu)
>   return arm_exit_handlers[hsr_ec];
>  }
>  
> +/**
> + * kvm_handle_guest_sei - handles SError interrupt or asynchronous aborts
> + * @vcpu:the VCPU pointer
> + *
> + * For RAS SError interrupt, firstly let host kernel handle it.
> + * If the AET is [ESR_ELx_AET_UER], then let user space handle it,
> + */
> +static int kvm_handle_guest_sei(struct kvm_vcpu *vcpu, struct kvm_run *run)
> +{
> + unsigned int esr = kvm_vcpu_get_hsr(vcpu);
> + bool impdef_syndrome =  esr & ESR_ELx_ISV;  /* aka IDS */
> + unsigned int aet = esr & ESR_ELx_AET;
> +
> + /*
> +  * This is not RAS SError
> +  */
> + if (!cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }

> + /* The host kernel may handle this abort. */
> + handle_guest_sei();

This has to claim the SError as a notification. If APEI claims the error, KVM
doesn't need to do anything more. You ignore its return code.


> +
> + /*
> +  * In below two conditions, it will directly inject the
> +  * virtual SError:
> +  * 1. The Syndrome is IMPLEMENTATION DEFINED
> +  * 2. It is Uncategorized SEI
> +  */
> + if (impdef_syndrome ||
> + ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR)) {
> + kvm_inject_vabt(vcpu);
> + return 1;
> + }
> +
> + switch (aet) {
> + case ESR_ELx_AET_CE:/* corrected error */
> + case ESR_ELx_AET_UEO:   /* restartable error, not yet consumed */
> + return 1;   /* continue processing the guest exit */

> + case ESR_ELx_AET_UER:   /* The error has not been propagated */
> + /*
> +  * Userspace only handle the guest SError Interrupt(SEI) if the
> +  * error has not been propagated
> +  */
> + run->exit_reason = KVM_EXIT_EXCEPTION;
> + run->ex.exception = ESR_ELx_EC_SERROR;
> + run->ex.error_code = KVM_SEI_SEV_RECOVERABLE;
> + return 0;

We should not pass RAS notifications to user space. The kernel either handles
them, or it panics(). User space shouldn't even know if the kernel supports RAS
until it gets an MCEERR signal.

You're making your firmware-first notification an EL3->EL0 signal, bypassing 
the OS.

If we get a RAS SError and there are no CPER records or values in the ERR nodes,
we should panic as it looks like the CPU/firmware is broken. (spurious RAS 
errors)


> + default:
> + /*
> +  * Until now, the CPU supports RAS and SEI is fatal, or host
> +  * does not support to handle the SError.
> +  */
> + panic("This Asynchronous SError interrupt is dangerous, panic");
> + }
> +
> + return 0;
> +}
> +
>  /*
>   * Return > 0 to return to guest, < 0 on error, 0 (and set exit_reason) on
>   * prop

Re: [PATCH v8 0/7] Support RAS virtualization in KVM

2017-11-14 Thread James Morse
Hi Dongjiu Geng,

On 10/11/17 19:54, Dongjiu Geng wrote:
> This series patches mainly do below things:
> 
> 1. Trap RAS ERR* registers Accesses to EL2 from Non-secure EL1,
>KVM will will do a minimum simulation, there registers are simulated
>to RAZ/WI in KVM.
> 2. Route synchronous External Abort exceptions from Non-secure EL0
>and EL1 to EL2. When exception EL3 routing is enabled by firmware,
>system will trap to EL3 firmware instead of EL2 KVM, then firmware
>judges whether El2 routing is enabled, if enabled, jump to EL2 KVM, 
>otherwise jump to EL1 host kernel.
> 3. Enable APEI ARv8 SEI notification to parse the CPER records for SError
>in the ACPI GHES driver, KVM will call handle_guest_sei() to let ACPI
>driver to parse the CPER record for SError which happened in the guest
> 4. Although we can use APEI driver to handle the guest SError, but not all
>system support SEI notification, such as kernel-first. So here KVM will
>also classify the Error through Exception Syndrome Register and do 
> different
>approaches according to Asynchronous Error Type

> 5. If the guest SError error is not propagated and not consumed, then KVM 
> return
>recoverable error status to user-space, user-space will specify the guest 
> ESR

I thought we'd gone over this. There should be no RAS errors/notifications in
user space. Only the symptoms should be sent, using the SIGBUS_MCEERR_A{O,R} if
the kernel has handled as much as it can. This hides the actual mechanisms the
kernel and firmware used.

User-space should not have to know how to handle RAS errors directly. This is a
service the operating system provides for it. This abstraction means the smae
user-space code is portable between x86, arm64, powerpc etc.

What if the firmware uses another notification method? User space should expect
the kernel to hide things like this from it.

If the kernel has no information to interpret a notification, how is user space
supposed to know?

I understand you are trying to work around your 'memory corruption at an unknown
address'[0] problem, but if the kernel can't know where this corrupt memory is
it should really reboot. What stops this corrupt data being swapped to disk?

Killing 'the thing' that was running at the time is not sufficient because we
don't know that this 'got' all the users of the corrupt memory. KSM can merge
pages between guests. This is the difference between the error persisting
forever killing off all the VMs one by one, and the corrupt page being silently
re-read from disk clearing the error.


>and inject a virtual SError. For other Asynchronous Error Type, KVM 
> directly
>injects virtual SError with IMPLEMENTATION DEFINED ESR or KVM is panic if 
> the
>error is fatal. In the RAS extension, guest virtual ESR must be set, 
> because
>all-zero  means 'RAS error: Uncategorized' instead of 'no valid ISS', so 
> set
>this ESR to IMPLEMENTATION DEFINED by default if user space does not 
> specify it.


Thanks,

James


[0] https://www.spinics.net/lists/arm-kernel/msg605345.html
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 1/4] arm64: kvm: route synchronous external abort exceptions to EL2

2017-10-18 Thread James Morse
Hi Dongjiu Geng,

On 17/10/17 15:14, Dongjiu Geng wrote:
> ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
> route synchronous external aborts to EL2, and adds a
> trap control bit HCR_EL2.TERR which controls to
> trap all Non-secure EL1&0 error record accesses to EL2.

The bulk of this patch is about trap-and-emulating these ERR registers, but
that's not reflected in the title:
> KVM: arm64: Emulate RAS error registers and set HCR_EL2's TERR & TEA


> This patch enables the two bits for the guest OS.
> when an synchronous abort is generated in the guest OS,
> it will trap to EL3 firmware, EL3 firmware will check the

*buzz*
This depends on SCR_EL3.EA, which this patch doesn't touch and the normal-world
can't even know about. This is what your system does, the commit message should
be about the change to Linux.

(I've said this before)


> HCR_EL2.TEA value to decide to jump to hypervisor or host
> OS. Enabling HCR_EL2.TERR makes error record access
> from guest trap to EL2.
> 
> Add some minimal emulation for RAS-Error-Record registers.
> In the emulation, ERRIDR_EL1 and ERRSELR_EL1 are zero.
> Then, the others ERX* registers are RAZ/WI.

> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index fe39e68..47983db 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -47,6 +47,13 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>   vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS;
>   if (is_kernel_in_hyp_mode())
>   vcpu->arch.hcr_el2 |= HCR_E2H;
> + if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {

This ARM64_HAS_RAS_EXTN isn't in mainline, nor is it added by your series. I
know where it comes from, but other reviewers may not. If you have dependencies
on another series, please call them out in the cover letter.

This is the first cpus_have_const_cap() user in this header file, it probably 
needs:
#include 


> + /* route synchronous external abort exceptions to EL2 */
> + vcpu->arch.hcr_el2 |= HCR_TEA;
> + /* trap error record accesses */
> + vcpu->arch.hcr_el2 |= HCR_TERR;
> + }
> +
>   if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
>   vcpu->arch.hcr_el2 &= ~HCR_RW;
>  }
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index d686300..af55b3bc 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -105,6 +105,8 @@ enum vcpu_sysreg {
>   TTBR1_EL1,  /* Translation Table Base Register 1 */
>   TCR_EL1,/* Translation Control Register */
>   ESR_EL1,/* Exception Syndrome Register */

> + ERRIDR_EL1, /* Error Record ID Register */

Page 39 of [0]: 'ERRIDR_EL1 is a 64-bit read-only ...'.


> + ERRSELR_EL1,/* Error Record Select Register */

We're emulating these as RAZ/WI, do we really need to allocate
vcpu->arch.ctxt.sys_regs space for them? If we always return 0 for ERRIDR, then
we don't need to keep ERRSELR as 'the value read back [..] is UNKNOWN'.

I think we only need space for these once their value needs to be migrated,
user-space doesn't need to know they exist until then.


>   AFSR0_EL1,  /* Auxiliary Fault Status Register 0 */
>   AFSR1_EL1,  /* Auxiliary Fault Status Register 1 */
>   FAR_EL1,/* Fault Address Register */

> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 2e070d3..a74617b 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -775,6 +775,36 @@ static bool access_pmovs(struct kvm_vcpu *vcpu, struct 
> sys_reg_params *p,
>   return true;
>  }
>  
> +static bool access_error_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
> +  const struct sys_reg_desc *r)
> +{
> + /* accessing ERRIDR_EL1 */
> + if (r->CRm == 3 && r->Op2 == 0) {
> + if (p->is_write)
> + vcpu_sys_reg(vcpu, ERRIDR_EL1) = 0;

As above, this register is read-only.


> + return trap_raz_wi(vcpu, p, r);

If we can get rid of the vcpu storage this just becomes trap_raz_wi() .


> + }
> +
> + /* accessing ERRSELR_EL1 */
> + if (r->CRm == 3 && r->Op2 == 1) {
> + if (p->is_write)
> + vcpu_sys_reg(vcpu, ERRSELR_EL1) = 0;
> +
> + return trap_raz_wi(vcpu, p, r);
> + }

Same here.


> +
> + /*
> +  * If ERRSELR_EL1.SEL is greater than or equal to ERRIDR_EL1.NUM,
> +  * the ERX* registers are RAZ/WI.
> +  */
> + if ((vcpu_sys_reg(vcpu, ERRSELR_EL1) & 0xff) >=
> + (vcpu_sys_reg(vcpu, ERRIDR_EL1) && 0xff))
> + return trap_raz_wi(vcpu, p, r);

The trick here is ERRIDR_EL1 is read only, KVM can choose how many records it
emulates. Let's pick zero:
> Zero indicates no records can be accessed through the Error Record System
> registers.

Which lets 

Re: [PATCH v4 3/3] arm64: kvm: inject SError with user space specified syndrome

2017-07-06 Thread James Morse
Hi gengdongjiu,

On 05/07/17 09:14, gengdongjiu wrote:
> On 2017/7/4 18:14, James Morse wrote:
>> Can you give us a specific example of an error you are trying to handle?

> For example:
> guest OS user space accesses device type memory, but happen SError. because 
> the
> SError is asynchronous faults, it does not take immediately. when guest OS 
> call "SVC" to enter guest os
> kernel space, the ESB instruction(Error Synchronization Barrier) will defter 
> this SError. so the SError happen immediately.

Ah, this isn't necessarily a 'RAS notification' SError/SEI, it may be a
'vanilla', v8.0 SError.

You've given a guest access to a physical device (how?), the guest has done
something, which caused the device to respond with SError.

Do you have a specific use-case for this? What is the ESR? What kinds of CPER
records does firmware generate? (if any)


We have to be careful here as devices can still generate asynchronous-interrupts
using SError, these aren't contained by ESB barriers. For these we should fall
back to KVM's v8.0 SError behaviour. KVM can tell them apart as the APEI code
doesn't claim the SError as an SEI notification, and with the RAS extensions the
ESR has the 'IDS' bit set.


>> How would a non-KVM user space process handle the error?

> it is indeed, non-KVM user space can not get the notification from hypervisor 
> or host kernel. thanks for the pointing out
> do you mean still Signal SIGBUS from memory_failure?

No, I was assuming this was a RAS notification SEI, (because your patch 1/3 of
touched the RAS cpu-features) being given to user space to handle.

Instead, can I ask how the host should handle this SError if it had accessed the
device itself?


I agree device pass-through is going to be a special case for KVM, but before
the host can deliver a device RAS error into the guest that was using the
device, it needs to fully understand what the error means:

The error may mean that the careful configuration that makes device-passthrough
safe no longer works, letting the guest continue to access the device may let it
damage another guest or the hyper visor.


We may need a way for the host RAS code to identify the driver responsible, to
handle the device error, or delegate it if that's appropriate.


[...]

>> So (a): a physical-CPU hardware error occurs, and then (c) we tell 
>> Qemu/kvmtool
>> via a KVM-specific API.
>>
>> Don't do this, it doesn't work for non-KVM users. You are exposing 
>> host-specific
>> implementation details to user space. What if I discover the same error via a
>> Polling GHES, or one of the IRQ flavours?

> James, you mainly concern the way that "tell Qemu/kvmtool via a KVM-specific 
> API", right?
> so how about still delivered SIGBUS same as the SEA(Synchronous External 
> Abort)?

> by the way, what is your meaning of below words?
>  >"What if I discover the same error via a Polling GHES, or one of the IRQ 
> flavours?"

This was my mistaken assumption that you were passing an APEI RAS SEI
notification to user space via a KVM specific API. This wouldn't work for
applications not using KVM, or notifications not using SEI.

Here I was asking what happens if the notification used NOTIFY_POLL or
NOTIFY_IRQ (instead of NOTIFY_SEI) in the GHES, but this isn't relevant as it
doesn't look like this is a APEI RAS notification.

[...]


>> If there is another type of CPER record where we should notify userspace, 
>> please
>> do it from mm/memory-failure.c, drivers/acpi/apei/ghes.c or
>> drivers/firmware/efi/cper.c. These should consider all user-space 
>> applications,
>> not just users of KVM, and not just on arm64.
> 
> here I have a question, in the "drivers/acpi/apei/ghes.c" code, it only 
> handle the memory section of CPER.

Yes, we are certainly missing processing for the other record types.


> if the section type of CPER is processor, it will not notify user-space. so 
> only let userspace handle the memory section is reasonable?

I think the only errors that user-space can know more than the kernel are memory
errors. These are the only RAS errors we should expect user space to handle. All
the others fall into either 'corrected by the kernel' or 'fatal for userspace -
SIGKILL'.


>> For memory errors we already have BUS_MCEERR_AR - action-required, and
>> BUS_MCEERR_AO - action-optional.
>>
>> For a TLB error, (Table 250 of UEFI 2.6), what is Qemu expected to do? Linux 
>> has
>> to classify the error and handle it as far as possible. In most cases the 
>> error
>> is either handled (no notification required), or fatal. Memory errors are the
>> only example I've found so far where an appl

Re: [PATCH v4 3/3] arm64: kvm: inject SError with user space specified syndrome

2017-07-04 Thread James Morse
Hi gengdongjiu,

Can you give us a specific example of an error you are trying to handle?
How would a non-KVM user space process handle the error?

KVM-users should be regular user space processes, we should not have a KVM-way
and everyone-else-way of handling errors.


On 04/07/17 05:46, gengdongjiu wrote:
> On 2017/7/3 16:39, Christoffer Dall wrote:
>> On Mon, Jun 26, 2017 at 08:46:39PM +0800, Dongjiu Geng wrote:
>>> when SError happen, kvm notifies user space to record the CPER,
>>> user space specifies and passes the contents of ESR_EL1 on taking
>>> a virtual SError interrupt to KVM, KVM enables virtual system
>>> error or asynchronous abort with this specifies syndrome. This
>>> patch modify the world-switch to restore VSESR_EL2, VSESR_EL2
>>> saves the virtual SError syndrome, it becomes the ESR_EL1 value when
>>> HCR_EL2.VSE injects an SError. This register is added by the
>>> RAS Extensions.
>>
>> This commit message is confusing and doesn't help me understand the
>> patch.
> (1) what is the rationale for the guest OS SError interrupt(SEI) handling in 
> the RAS solution?

>   a). In the firmware-first RAS solution, when guest OS happen a SError 
> interrupt (SEI), it will firstly trap to EL3(SCR_EL3.EA = 1);
>   b). The firmware logs, triages, and delegates the error exception to the 
> hypervisor. As the error came from guest OS  EL1, firmware
>   does by faking an SError interrupt exception entry to EL2.
>   c). Control transfers to the hypervisor's delegated error recovery 
> agent.Because HCR_EL2.AMO is set to 1, the hypervisor can use a
>   Virtual SError interrupt to delegate an asynchronous abort to EL1, by 
> setting HCR_EL2.VSE to 1 and using VESR_EL2 to pass syndrome.

So (a): a physical-CPU hardware error occurs, and then (c) we tell Qemu/kvmtool
via a KVM-specific API.

Don't do this, it doesn't work for non-KVM users. You are exposing host-specific
implementation details to user space. What if I discover the same error via a
Polling GHES, or one of the IRQ flavours?

User space should not have to know, or care, how linux is notified about APEI
RAS errors.


> (2) what is this patch mainly do?
>   As mentioned above, the hypervisor needs to enable virtual SError and pass 
> the virtual syndrome to the guest OS.
> 
>   a). when Control transfers to the hypervisor from firmware by faking an 
> SError interrupt, the hypervisor delivered the syndrome_info(esr_el2) and
>   host VA address( Qemu translate this VA address to the virtual machine 
> physical address(IPA)) using below new added "serror_intr" struct.
>   /* KVM_EXIT_SERROR_INTR */
>   struct {
>   __u32 syndrome_info;
>   __u64 address;
>   } serror_intr;

This is for a guest exit to host user-space. Here you are telling Qemu that a
physical CPU hardware error occurred. Qemu/kvmtool should not be expected to
parse the ESR, this is the job of the operating system.

When you're using ACPI firmware-first, SError/SEI is just a notification, the
important data is in the CPER records, which Qemu can't access, (and should be
processed by Linux APEI code).


It looks like you've calculated an address from FAR_EL2/HPFAR_EL2. For an
SError, these are meaningless.

(These registers hold real values for Synchronous External Abort, but for
 firmware-first we should prefer the CPER records.)


>   b). Qemu gets the address(host VA) delivered by KVM, translate this host VA 
> address to virtual machine physical address(IPA), and runtime record this 
> virtual
>  machine physical address(IPA) to the guest OS's APEI table.

I agree with this step, but you're acting on the wrong data. (You're converting
fault_ipa -> virtual address -> fault_ipa, something isn't right ...)

Qemu should react to a signal like BUS_MCEERR_A{R,O} from memory_failure(). This
mechanism serves all user space processes, not just kvm users. This is where the
user-space virtual address should come from. Qemu/kvmtool have to generate the
guest IPA once they discover the affected memory was presented to the guest
through KVM.


Your KVM-specific mechanism exposes too much raw information (raw ESR values to
user space), and only serves applications using KVM.

If there is another type of CPER record where we should notify userspace, please
do it from mm/memory-failure.c, drivers/acpi/apei/ghes.c or
drivers/firmware/efi/cper.c. These should consider all user-space applications,
not just users of KVM, and not just on arm64.


>   c). Qemu gets the syndrome_info delivered by KVM, it refers to this 
> syndrome value(but can be different from it) to specify the virtual SError 
> interrupt's syndrome through setting VESR_EL2.

'but can be different from it' is because a classification step is required, the
operating system should do this. We should only signal Qemu/kvmtool for errors
that can actually be handled. Some APEI notifications may be for corrected
errors, (I would hope these always 

Re: [PATCH 18/20] arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32

2017-06-27 Thread James Morse
Hi Yury, Zhou,

On 23/06/17 23:28, Yury Norov wrote:
> On Fri, Jun 23, 2017 at 06:03:37PM +0100, James Morse wrote:
>> Hi Yury,
>>
>> On 04/06/17 13:00, Yury Norov wrote:
>>> ILP32 has context-related structures different from both aarch32 and
>>> aarch64/lp64. In this patch compat_arch_ptrace() renamed to
>>> compat_a32_ptrace(), and compat_arch_ptrace() only makes choice between
>>> compat_a32_ptrace() and new compat_ilp32_ptrace() handler.
>>>
>>> compat_ilp32_ptrace() calls generic compat_ptrace_request() for all
>>> requests except PTRACE_GETSIGMASK and PTRACE_SETSIGMASK, which need
>>> special handling.
>>
>> Can you elaborate on this special handling?
>>
>> How come we don't need to wrap PTRACE_{G,S}ETSIGMASK for aarch32 compat?
>> >From kernel/signal32.c that uses compat_sigset_t too.
>>
>> It looks like aarch64, ilp32 and aarch32 all use the same size sigset_t,
>> so doesn't compat_ptrace_request() already do everything we need?
>>
>> ...
>>
>> Is this fixing an endian problem? If so, can we document it as such. Do we
>> already have the same bug for aarch32 compat?
> 
> Originally, the problem was found by Zhou Chengming: 
> https://lkml.org/lkml/2016/6/27/18
> But I think you right, this is the fix for endian.
> 
> It lookd like aarch32 is buggy, but IIUC to confirm it, the BE arm64
> machine is needed. I use qemu and AFAIR it has no BE support.
> 
> Zhou, can you test it on your machine and if the bug will be reproduced,
> send the patch for aarch32?

I've reproduced this on big endian compat-aarch32: yes its broken. I will respin
Zhou's patch as a fix.


Thanks,

James

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/20] arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32

2017-06-23 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> ILP32 has context-related structures different from both aarch32 and
> aarch64/lp64. In this patch compat_arch_ptrace() renamed to
> compat_a32_ptrace(), and compat_arch_ptrace() only makes choice between
> compat_a32_ptrace() and new compat_ilp32_ptrace() handler.
> 
> compat_ilp32_ptrace() calls generic compat_ptrace_request() for all
> requests except PTRACE_GETSIGMASK and PTRACE_SETSIGMASK, which need
> special handling.

Can you elaborate on this special handling?

How come we don't need to wrap PTRACE_{G,S}ETSIGMASK for aarch32 compat?
>From kernel/signal32.c that uses compat_sigset_t too.

It looks like aarch64, ilp32 and aarch32 all use the same size sigset_t,
so doesn't compat_ptrace_request() already do everything we need?

...

Is this fixing an endian problem? If so, can we document it as such. Do we
already have the same bug for aarch32 compat?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 16/20] arm64: signal32: move ilp32 and aarch32 common code to separated file

2017-06-19 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> Signed-off-by: Yury Norov 

Can I offer a body for the commit message:
ILP32 needs to mix 32bit struct siginfo and 64bit sigframe for its signal
handlers. Move the existing compat code for copying siginfo to user space and
manipulating signal masks into signal32_common.c so it can be used to deliver
aarch32 and ilp32 signals.


> diff --git a/arch/arm64/include/asm/signal32.h 
> b/arch/arm64/include/asm/signal32.h
> index e68fcce538e1..1c4ede717bd2 100644
> --- a/arch/arm64/include/asm/signal32.h
> +++ b/arch/arm64/include/asm/signal32.h
> @@ -13,6 +13,9 @@
>   * You should have received a copy of the GNU General Public License
>   * along with this program.  If not, see .
>   */
> +
> +#include 
> +
>  #ifndef __ASM_SIGNAL32_H
>  #define __ASM_SIGNAL32_H

Nit: This should go inside the guard.


> diff --git a/arch/arm64/kernel/signal32_common.c 
> b/arch/arm64/kernel/signal32_common.c
> new file mode 100644
> index ..5bddc25dca12
> --- /dev/null
> +++ b/arch/arm64/kernel/signal32_common.c
> @@ -0,0 +1,135 @@
[...]
> +#include 
> +#include 
> +#include 

What do you need ratelimit.h for?


> +#include 
> +
> +#include 

I can't see anything using these ESR_ macros in here...


> +#include 

This was for the VFP save/restore code, which you didn't move...


> +#include 
> +#include 

[...]


> +int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t 
> *from)
[...]
> + case __SI_FAULT:
> + err |= __put_user((compat_uptr_t)(unsigned long)from->si_addr,
> +   &to->si_addr);

This looks tricky. si_addr comes from FAR_EL1 when user-space touches something
it shouldn't. This could be a 64bit value as ilp32 processes can still branch to
64bit addresses in registers and generate loads that cross the invisible 4GB
boundary. Here you truncate the 64bit address.
Obviously this can't happen at all with aarch32, and for C programs its into
undefined-behaviour territory, but it doesn't feel right to pass an address to
user-space that we know is wrong... but we don't have an alternative.

This looks like a class of problem particular to ilp32/x32: 'accessed an address
you can't encode with a signal'. After a quick dig in x86's x32 code, it looks
like they only pass the first 32bits of si_addr too.

One option is to mint a new si_code to go with SIGBUS meaning something like
'address overflowed si_addr'. Alternatively we could just kill tasks that do 
this.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/20] arm64: rename COMPAT to AARCH32_EL0 in Kconfig

2017-06-19 Thread James Morse
Hi Yury,

On 04/06/17 12:59, Yury Norov wrote:
> From: Andrew Pinski 
> 
> In this patchset  ILP32 ABI support is added. Additionally to AARCH32,
> which is binary-compatible with ARM, ILP32 is (mostly) ABI-compatible.
> 
> From now, AARCH32_EL0 (former COMPAT) config option means the support of
> AARCH32 userspace, ARM64_ILP32 - support of ILP32 ABI (see next patches),
> and COMPAT indicates that one of them, or both, is enabled.
> 
> Where needed, CONFIG_COMPAT is changed over to use CONFIG_AARCH32_EL0 instead

Nit: You have 'COMPAT' around compat_hwcap_str's definition, but its only user
is wrapped in 'AARCH32_EL0'.


After this patch
arch/arm64/kernel/perf_callchain.c::perf_callchain_user() still has:
>   if (!compat_user_mode(regs)) {
>   /* AARCH64 mode */
...
>   } else {
> #ifdef CONFIG_COMPAT
>   /* AARCH32 compat mode */
...
> #endif
>   }

I think this one should become CONFIG_AARCH32_EL0. compat to this code means the
fp is 'compat_fp' in x11, and it should read a 32bit call chain from user-space.

This is confusing as 'is_compat_task()' matches one of aarch32 or ilp32, but
compat_user_mode(regs) only matches aarch32 as it checks the saved spsr. I can't
see any problem caused by this today, but its going to bite someone in the
future. Can this be renamed aarch32_user_mode()? (turns out 'a32' is the name of
just one of aarch32's instruction sets[0].)


Thanks,

James

[0] 
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16137.html

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/20] arm64: ilp32: add sys_ilp32.c and a separate table (in entry.S) to use it

2017-06-08 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> From: Andrew Pinski 
> 
> Add a separate syscall-table for ILP32, which dispatches either to native
> LP64 system call implementation or to compat-syscalls, as appropriate.

(I'm still reading through this series trying to understand it, but spotted 
this: )

> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 279bc2ab10c3..7d52fe1ec6bd 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -577,6 +594,7 @@ el0_svc_compat:
>* AArch32 syscall handling
>*/
>   adrpstbl, compat_sys_call_table // load compat syscall table 
> pointer
> + ldr x16, [tsk, #TSK_TI_FLAGS]
>   uxtwscno, w7// syscall number in w7 (r7)
>   mov sc_nr, #__NR_compat_syscalls
>   b   el0_svc_naked
> @@ -798,15 +816,21 @@ ENDPROC(ret_from_fork)
>   .align  6
>  el0_svc:
>   adrpstbl, sys_call_table// load syscall table pointer
> + ldr x16, [tsk, #TSK_TI_FLAGS]
>   uxtwscno, w8// syscall number in w8
>   mov sc_nr, #__NR_syscalls
> +#ifdef CONFIG_ARM64_ILP32
> + tst x16, #_TIF_32BIT_AARCH64
> + b.eqel0_svc_naked   // We are using LP64  syscall 
> table
> + adrpstbl, sys_call_ilp32_table  // load ilp32 syscall table 
> pointer
> + delouse_input_regs
> +#endif
>  el0_svc_naked:   // compat entry point
>   stp x0, scno, [sp, #S_ORIG_X0]  // save the original x0 and 
> syscall number
>   enable_dbg_and_irq
>   ct_user_exit 1
>  

> - ldr x16, [tsk, #TSK_TI_FLAGS]   // check for syscall hooks

If built with CONFIG_CONTEXT_TRACKING, ct_user_exit will call
context_tracking_user_exit(), this will clobber x16 which you depend on not
changing below:


> - tst x16, #_TIF_SYSCALL_WORK
> + tst x16, #_TIF_SYSCALL_WORK // check for syscall hooks

>   b.ne__sys_trace
>   cmp scno, sc_nr // check upper syscall limit
>   b.hsni_sys


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] hwmon: xgene: Add hwmon driver

2016-09-09 Thread James Morse
Hi,

On 09/09/16 04:18, AKASHI Takahiro wrote:
> On Thu, Sep 08, 2016 at 11:47:59AM +0100, James Morse wrote:
>> On 08/09/16 09:14, Arnd Bergmann wrote:
>>> On Wednesday, September 7, 2016 3:37:05 PM CEST Guenter Roeck wrote:
>>>> On Wed, Sep 07, 2016 at 11:41:44PM +0200, Arnd Bergmann wrote:
>>>>> On Thursday, July 21, 2016 1:55:56 PM CEST Hoan Tran wrote:
>>>>>> +   ctx->comm_base_addr = cppc_ss->base_address;
>>>>>> +   if (ctx->comm_base_addr) {
>>>>>> +   ctx->pcc_comm_addr =
>>>>>> +   
>>>>>> acpi_os_ioremap(ctx->comm_base_addr,
>>>>>> +   cppc_ss->length);
>>>>>>
>>>>>
>>>>> This causes the arm64 allmodconfig build to fail now, according to
>>>>> kernelci:
>>>>>
>>>>>   1  ERROR: "memblock_is_memory" [drivers/hwmon/xgene-hwmon.ko] 
>>>>> undefined!
>>>>>
>>>>> Should this perhaps call ioremap() or memremap() instead?
>>>>>
>>>> Hmmm ... almost sounds to me like blaming the messenger. e7cd190385d1 
>>>> ("arm64:
>>>> mark reserved memblock regions explicitly in iomem") starts using a 
>>>> function
>>>> in acpi_os_ioremap() which is not exported. On top of that, 
>>>> memblock_is_memory()
>>>> is declared as __init_memblock, which makes me really uncomfortable.
>>>> If acpi_os_ioremap() must not be used by modules, and possibly only during
>>>> early (?) initialization, maybe its declaration should state those 
>>>> limitations ?
>>>
>>> Ah, I didn't notice that. I guess both patches were correct individually and
>>> got added to linux-next around the same time but caused allmodconfig to 
>>> blow up
>>> when used together.
>>>
>>> Adding everyone who was involved in the memblock patch to Cc here, maybe one
>>> of them has an idea what the correct fix is. There are only two other 
>>> drivers
>>> using acpi_os_ioremap() and one of them is x86-specific, so it's still 
>>> likely
>>> that drivers are not actually supposed to use this symbol. Making
>>> acpi_os_ioremap() an exported function in arm64 would also work.
>>
>> You could use acpi_os_map_iomem()/acpi_os_unmap_iomem() from acpi/acpi_io.h.
>> If there isn't an existing mapping these end up in acpi_os_ioremap(), and are
>> already EXPORT_SYMBOL_GPL().
> 
> acpi_os_ioremap() is re-defined in arm64/include/asm/acpi.h.
> 
> The problem is that, as memblock_is_memory() is declared as __init,

__init_memblock ...

... as is memblock_is_map_memory(), which we call from pfn_valid() which is
EXPORT_SYMBOL()'d
and used from modules, (e.g. mac80211.ko). So something fishy is going on...

>From include/linux/memblock.h:
> #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
> #define __init_memblock __meminit
> #define __initdata_memblock __meminitdata
> #else
> #define __init_memblock
> #define __initdata_memblock
> #endif

arm64 doesn't define ARCH_DISCARD_MEMBLOCK, so we always keep these symbols.
If we didn't, pfn_valid() would break too.


Thanks,

James


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] hwmon: xgene: Add hwmon driver

2016-09-08 Thread James Morse
Hi,

On 08/09/16 09:14, Arnd Bergmann wrote:
> On Wednesday, September 7, 2016 3:37:05 PM CEST Guenter Roeck wrote:
>> On Wed, Sep 07, 2016 at 11:41:44PM +0200, Arnd Bergmann wrote:
>>> On Thursday, July 21, 2016 1:55:56 PM CEST Hoan Tran wrote:
 +   ctx->comm_base_addr = cppc_ss->base_address;
 +   if (ctx->comm_base_addr) {
 +   ctx->pcc_comm_addr =
 +   
 acpi_os_ioremap(ctx->comm_base_addr,
 +   cppc_ss->length);

>>>
>>> This causes the arm64 allmodconfig build to fail now, according to
>>> kernelci:
>>>
>>>   1  ERROR: "memblock_is_memory" [drivers/hwmon/xgene-hwmon.ko] 
>>> undefined!
>>>
>>> Should this perhaps call ioremap() or memremap() instead?
>>>
>> Hmmm ... almost sounds to me like blaming the messenger. e7cd190385d1 
>> ("arm64:
>> mark reserved memblock regions explicitly in iomem") starts using a function
>> in acpi_os_ioremap() which is not exported. On top of that, 
>> memblock_is_memory()
>> is declared as __init_memblock, which makes me really uncomfortable.
>> If acpi_os_ioremap() must not be used by modules, and possibly only during
>> early (?) initialization, maybe its declaration should state those 
>> limitations ?
> 
> Ah, I didn't notice that. I guess both patches were correct individually and
> got added to linux-next around the same time but caused allmodconfig to blow 
> up
> when used together.
> 
> Adding everyone who was involved in the memblock patch to Cc here, maybe one
> of them has an idea what the correct fix is. There are only two other drivers
> using acpi_os_ioremap() and one of them is x86-specific, so it's still likely
> that drivers are not actually supposed to use this symbol. Making
> acpi_os_ioremap() an exported function in arm64 would also work.

You could use acpi_os_map_iomem()/acpi_os_unmap_iomem() from acpi/acpi_io.h.
If there isn't an existing mapping these end up in acpi_os_ioremap(), and are
already EXPORT_SYMBOL_GPL().

(I'm still waiting for allmodconfig on linux-next to finish building)


Thanks,

James


--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-12 Thread James Morse
On 11/04/16 19:03, Kees Cook wrote:
> On Mon, Apr 11, 2016 at 1:00 AM, James Morse  wrote:
>> On 06/04/16 20:44, Kees Cook wrote:
>>> When building with both CONFIG_HIBERNATION and CONFIG_RANDOMIZE_BASE,
>>> one or the other must be chosen at boot-time. Until now, hibernation
>>> was selected when no choice was made on the command line.
>>>
>>> To make the security benefits of kASLR more widely available to end
>>> users (since the use of hibernation is becoming more rare and kASLR,
>>> already available on x86, will be available on arm64 and MIPS soon),
>>> this changes the default to preferring kASLR over hibernation. Users
>>> wanting hibernation can turn off kASLR by adding "nokaslr" to the kernel
>>> command line.
>>
>> While hibernate isn't yet merged for arm64, it does work with kASLR in 
>> v4.6-rc*,
>> it would be a shame to have to choose at boot time, (but that's my problem to
>> fix if/when its merged).
> 
> Ah, interesting, so they work together on arm64? (i.e. you've actually
> tested a boot loader that provides the seed for kASLR to operate?)

Almost: due to a lack of firmware support I hacked the efi stub to read a 'seed'
from a system counter.

To check it works I printed the address of 'panic' out during boot:
> [0.353712] DEBUG: &panic == ff960819a4e8

Then hibernated, and powered the board back on, the resume kernel gives:
> [0.353528] DEBUG: &panic == ff840819a4e8

But after it has restored the hibernate image, I can dig in /proc/kallsyms to
see the original value:
> root@localhost:~# cat /proc/kallsyms  | grep "T panic"
> ff960819a4e8 T panic


> Maybe RANDOMIZE_BASE_DEFAULT (default to y) and if x86 && hibernation,
> select =n. and use that for selecting it?

Looks good to me.


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Prefer kASLR over Hibernation

2016-04-11 Thread James Morse
Hi Kees,

On 06/04/16 20:44, Kees Cook wrote:
> When building with both CONFIG_HIBERNATION and CONFIG_RANDOMIZE_BASE,
> one or the other must be chosen at boot-time. Until now, hibernation
> was selected when no choice was made on the command line.
> 
> To make the security benefits of kASLR more widely available to end
> users (since the use of hibernation is becoming more rare and kASLR,
> already available on x86, will be available on arm64 and MIPS soon),
> this changes the default to preferring kASLR over hibernation. Users
> wanting hibernation can turn off kASLR by adding "nokaslr" to the kernel
> command line.

While hibernate isn't yet merged for arm64, it does work with kASLR in v4.6-rc*,
it would be a shame to have to choose at boot time, (but that's my problem to
fix if/when its merged).


> diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
> index fca9254280ee..be5041354b1e 100644
> --- a/kernel/power/hibernate.c
> +++ b/kernel/power/hibernate.c
> @@ -1159,6 +1164,13 @@ static int __init kaslr_nohibernate_setup(char *str)
>   return nohibernate_setup(str);
>  }
>  
> +static int __init nokaslr_hibernate_setup(char *str)
> +{
> + noresume = 0;
> + nohibernate = 0;
> + return 1;
> +}
> +
>  static int __init page_poison_nohibernate_setup(char *str)
>  {
>  #ifdef CONFIG_PAGE_POISONING_ZERO
> @@ -1183,4 +1195,5 @@ __setup("resumewait", resumewait_setup);
>  __setup("resumedelay=", resumedelay_setup);
>  __setup("nohibernate", nohibernate_setup);
>  __setup("kaslr", kaslr_nohibernate_setup);
> +__setup("nokaslr", nokaslr_hibernate_setup);

So one or the other option has to be specified at boot?

The kASLR patches for arm64 in v4.6-rc1 enable kASLR at boot if you chose to
select it at compile time. I guess this patch is preparation for doing the same
on x86?

I didn't hit this kaslr_nohibernate_setup() call during testing because of the
on by default behaviour. Is it worth exposing this via Kconfig? Something like:
ARCH_RANDOMIZE_BASE_DEFAULT_ON ?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] arm64: erratum: Workaround for Kryo reserved system register read

2016-04-10 Thread James Morse
On 08/04/16 11:24, Marc Zyngier wrote:
> On 08/04/16 10:58, Suzuki K Poulose wrote:
>> On 07/04/16 18:31, Marc Zyngier wrote:
>>
 +  All system register encodings above use the form
 +
 +  Op0, Op1, CRn, CRm, Op2.
 +
 +  Note that some of the encodings listed above include
 +  the system register space reserved for the following
 +  identification registers which may appear in future revisions
 +  of the ARM architecture beyond ARMv8.0.
 +  This space includes:
 +  ID_AA64PFR[2-7]_EL1
 +  ID_AA64DFR[2-3]_EL1
 +  ID_AA64AFR[2-3]_EL1
 +  ID_AA64ISAR[2-7]_EL1
 +  ID_AA64MMFR[2-7]_EL1
>>
>>
>> AFAIK, the id space is unassigned. So the naming above could cause confusion
>> if the register is named something else.
> 
> It is reserved *at the moment*, but already has a defined behaviour. My
> worry is that when some new architecture revision comes around, we start
> using these registers without thinking much about it (because we should
> be able to). At this point, your SoC will catch fire and nobody will
> have a clue about the problem because it is not apparent in the code.
> 
> I'd really like to see something a bit more forward looking that covers
> that space for good.

At the risk of volunteering...
Registering these instructions with the undef hooks would be ideal, but they
won't catch this instruction abort. I guess refactor them to be generic faulting
instruction hooks, and have a list for the existing undef cases, and a new one
for this instruction abort.

This won't cover early code in head.S, or KVM code that runs at EL2. Is this
sufficient, or should any approach cover those too?


Thanks,

James
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html