Re: get_online_cpus() from a preemptible() context (bug?)

2017-11-08 Thread James Morse
Hi Peter,

On 06/11/17 21:07, Peter Zijlstra wrote:
> On Mon, Nov 06, 2017 at 06:51:35PM +0000, James Morse wrote:
>>> If you look at percpu_down_read(), you'll note it'll disable preemption
>>> before calling __percpu_down_read().
>>
>> Yes, this is how __percpu_down_read() protects the combination of it's 
>> fast/slow
>> paths.
>>
>> But next percpu_down_read() calls preempt_enable(), I can't see what stops us
>> migrating before percpu_up_read() preempt_disable()s to call 
>> __this_cpu_dec(),
>> which now affects a different variable.
>>
> 
> Ah, so the two operations that comment talks about are:
> 
> percpu_down_read_preempt_disable()
>   preempt_disable();
> 1)__this_cpu_inc(*sem->read_count);
>   if (unlikely(!rcu_sync_is_idle(&sem->rss)))
>   __percpu_down_read()
> smp_mb()
> if (likely(!smp_load_acquire(&sem->readers_block))) // false
> __percpu_up_read()
>   smp_mb()
> 2)   __this_cpu_dec(*sem->read_count);
>   rcuwait_wake_up(&sem->writer);
> preempt_enable_no_resched();
> 
> If you want more detail on this, I'll actually have to go think :-)

I think this was the answer to a much smarter question than mine!

I've tried (and failed) to break it instead. To answer my own question:

I thought this was potentially-broken because the __this_cpu_{add,dec}() out in
{get,put}_online_cpus() will operate on different per-cpu read_count variables
if we migrate. (not the pair above)

This isn't a problem as the only thing that reads the read_count is
readers_active_check(), which per_cpu_sum()s them all together before comparing
against zero. As they are all unsigned-ints it uses unsigned-overflow to do the
right thing. This even works if a CPU holding a vital part of the read_count is
offline, as per_cpu_sum() uses for_each_possible_cpu().


Thanks!

James


Re: get_online_cpus() from a preemptible() context (bug?)

2017-11-06 Thread James Morse
Hi Peter,

(combining your replies)

On 06/11/17 10:32, Peter Zijlstra wrote:
> On Fri, Nov 03, 2017 at 02:45:45PM +0000, James Morse wrote:
>> I'm trying to work out what stops a thread being pre-empted and migrated 
>> between
>> calling get_online_cpus() and put_online_cpus().

> Nothing; why would you think it would?

To stop the this_cpu_*() operations in down/up being applied on different CPUs,
affecting a different percpu:read_count.


> All those functions guarantee is
> that any CPU observed as being online says online (and its converse,
> that a CPU observed as being offline, says offline, although less people
> care about that one).


>> According to __percpu_down_read(), its the pre-empt count:
>>>  * Due to having preemption disabled the decrement happens on
>>>  * the same CPU as the increment, avoiding the
>>>  * increment-on-one-CPU-and-decrement-on-another problem.
>>
>>
>> So this:
>>> void cpus_read_lock(void)
>>> {
>>>percpu_down_read(&cpu_hotplug_lock);
>>> +
>>> +   /* Can we migrated before we release this per-cpu lock? */
>>> +   WARN_ON(preemptible());
>>>  }
>>
>> should never fire?

> It should.. You're reading a comment on __percpu_down_read() and using
> percpu_down_read(), _not_ the same function ;-)

Yes, sorry, I thought you did a better job of describing the case I'm trying to
work-out.


> If you look at percpu_down_read(), you'll note it'll disable preemption
> before calling __percpu_down_read().

Yes, this is how __percpu_down_read() protects the combination of it's fast/slow
paths.

But next percpu_down_read() calls preempt_enable(), I can't see what stops us
migrating before percpu_up_read() preempt_disable()s to call __this_cpu_dec(),
which now affects a different variable.


> And yes, that whole percpu-rwsem code is fairly magical :-)

I think I'll file this under magical. That rcu_sync_is_idle() must know
something I don't!


Thanks!

James






[PATCH 3/4] arm64: mm: Remove arch_apei_flush_tlb_one()

2017-11-06 Thread James Morse
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_fixmap() to do the invalidation. Remove it.

Move the IPI-considered-harmful comment to __set_fixmap().

Signed-off-by: James Morse 
Acked-by: Will Deacon 
Tested-by: Tyler Baicar 
---
 arch/arm64/include/asm/acpi.h | 12 
 arch/arm64/mm/mmu.c   |  4 
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
index 59cca1d6ec54..32f465a80e4e 100644
--- a/arch/arm64/include/asm/acpi.h
+++ b/arch/arm64/include/asm/acpi.h
@@ -126,18 +126,6 @@ static inline const char *acpi_get_enable_method(int cpu)
  */
 #define acpi_disable_cmcff 1
 pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr);
-
-/*
- * Despite its name, this function must still broadcast the TLB
- * invalidation in order to ensure other CPUs don't end up with junk
- * entries as a result of speculation. Unusually, its also called in
- * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
- * TLB broadcasting, then we're in trouble here.
- */
-static inline void arch_apei_flush_tlb_one(unsigned long addr)
-{
-   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-}
 #endif /* CONFIG_ACPI_APEI */
 
 #ifdef CONFIG_ACPI_NUMA
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e0e864..267d2b79d52d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -778,6 +778,10 @@ void __init early_fixmap_init(void)
}
 }
 
+/*
+ * Unusually, this is also called in IRQ context (ghes_iounmap_irq) so if we
+ * ever need to use IPIs for TLB broadcasting, then we're in trouble here.
+ */
 void __set_fixmap(enum fixed_addresses idx,
   phys_addr_t phys, pgprot_t flags)
 {
-- 
2.15.0.rc2



[PATCH 4/4] ACPI / APEI: Remove arch_apei_flush_tlb_one()

2017-11-06 Thread James Morse
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().

Signed-off-by: James Morse 
Reviewed-by: Borislav Petkov 
---
 arch/x86/kernel/acpi/apei.c | 5 -
 include/acpi/apei.h | 1 -
 2 files changed, 6 deletions(-)

diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
index ea3046e0b0cf..bb8d300fecbd 100644
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -52,8 +52,3 @@ void arch_apei_report_mem_error(int sev, struct 
cper_sec_mem_err *mem_err)
apei_mce_report_mem_error(sev, mem_err);
 #endif
 }
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
-   __flush_tlb_one(addr);
-}
diff --git a/include/acpi/apei.h b/include/acpi/apei.h
index c46694abea28..82c451698c98 100644
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -50,7 +50,6 @@ int erst_clear(u64 record_id);
 
 int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
 void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
 
 #endif
 #endif
-- 
2.15.0.rc2



[PATCH 1/4] ACPI / APEI: Replace ioremap_page_range() with fixmap

2017-11-06 Thread James Morse
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.

The GHES driver can only be enabled for architectures that select
HAVE_ACPI_APEI: Add fixmap entries to both x86 and arm64.

clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().

CC: Qiang Zheng 
Reported-by: Fengguang Wu 
Suggested-by: Linus Torvalds 
Signed-off-by: James Morse 
Reviewed-by: Borislav Petkov 
Tested-by: Tyler Baicar 
Tested-by: Toshi Kani 
[ For the arm64 bits: ]
Acked-by: Will Deacon 
[ For the x86 bits: ]
Acked-by: Ingo Molnar 

---
Changes since RFC:
 * Added #ifdefs around the entries in fixmap.h
 * Added a paragraph about HAVE_ACPI_APEI to the commit message
 * Merged the first three patches for improved history,

 arch/arm64/include/asm/fixmap.h |  7 +++
 arch/x86/include/asm/fixmap.h   |  6 ++
 drivers/acpi/apei/ghes.c| 45 ++---
 3 files changed, 28 insertions(+), 30 deletions(-)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4052ec39e8db 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,13 @@ enum fixed_addresses {
 
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+#ifdef CONFIG_ACPI_APEI_GHES
+   /* Used for GHES mapping from assorted contexts */
+   FIX_APEI_GHES_IRQ,
+   FIX_APEI_GHES_NMI,
+#endif /* CONFIG_ACPI_APEI_GHES */
+
__end_of_permanent_fixed_addresses,
 
/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..b0c505fe9a95 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,12 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
 
+#ifdef CONFIG_ACPI_APEI_GHES
+   /* Used for GHES mapping from assorted contexts */
+   FIX_APEI_GHES_IRQ,
+   FIX_APEI_GHES_NMI,
+#endif
+
__end_of_permanent_fixed_addresses,
 
/*
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3c3a37b8503b..f3269816b997 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
  * Because the memory area used to transfer hardware error information
  * from BIOS to Linux can be determined only in NMI, IRQ or timer
  * handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
  */
 
 /*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
 /* virtual memory area for atomic ioremap */
 static struct vm_struct *ghes_ioremap_area;
 /*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
  */
 static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
 static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,52 +160,36 @@ static void ghes_ioremap_exit(void)
 
 static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
 {
-   unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
 
-   vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
-   ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+   __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
 
-   return (void __iomem *)vaddr;
+   return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
 }
 
 static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
 {
-   unsigned long vaddr, paddr;
+   unsigned long paddr;
pgprot_t prot;
 
-   vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+   __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
 
-   ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
-   return (void __iomem *)vaddr;
+   return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
 }
 
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
 {
-   unsigned long vaddr = (unsigned long __force)vaddr_ptr;
-   void *base = ghes_ioremap_area->addr;
-
-   BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
-   unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
-   arch_apei_flush_tlb_one(vaddr);
+   clear_fixmap(FIX_APEI_GHES_NMI);
 }
 
-static void ghes_iounmap_irq(vo

[PATCH 2/4] ACPI / APEI: Remove ghes_ioremap_area

2017-11-06 Thread James Morse
Now that nothing is using the ghes_ioremap_area pages, rip them out.

Signed-off-by: James Morse 
Reviewed-by: Borislav Petkov 
Tested-by: Tyler Baicar 
---
 drivers/acpi/apei/ghes.c | 39 ++-
 1 file changed, 2 insertions(+), 37 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f3269816b997..1a4f16b10052 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
  * from BIOS to Linux can be determined only in NMI, IRQ or timer
  * handler, but general ioremap can not be used in atomic context, so
  * the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES   2
-#define GHES_IOREMAP_IRQ_PAGE(base)(base)
-#define GHES_IOREMAP_NMI_PAGE(base)((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
  * These 2 spinlocks are used to prevent the fixmap entries from being used
  * simultaneously.
  */
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
 
 static int ghes_panic_timeout __read_mostly = 30;
 
-static int ghes_ioremap_init(void)
-{
-   ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
-   VM_IOREMAP, VMALLOC_START, VMALLOC_END);
-   if (!ghes_ioremap_area) {
-   pr_err(GHES_PFX "Failed to allocate virtual memory area for 
atomic ioremap.\n");
-   return -ENOMEM;
-   }
-
-   return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
-   free_vm_area(ghes_ioremap_area);
-}
-
 static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
 {
phys_addr_t paddr;
@@ -1270,13 +1241,9 @@ static int __init ghes_init(void)
 
ghes_nmi_init_cxt();
 
-   rc = ghes_ioremap_init();
-   if (rc)
-   goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
-   goto err_ioremap_exit;
+   goto err;
 
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
  GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1300,8 +1267,6 @@ static int __init ghes_init(void)
return 0;
 err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
-   ghes_ioremap_exit();
 err:
return rc;
 }
-- 
2.15.0.rc2



[PATCH 0/4] Switch GHES ioremap_page_range() to use fixmap

2017-11-06 Thread James Morse
GHES is doing ioremap_page_range() in both NMI and irq context, neither
are safe as it may sleep to allocate intermediate levels of page table.

Replace the NMI/irq GHES_IOREMAP_PAGES to use a fixmap entry each.

After this nothing uses ghes_ioremap_area or arch_apei_flush_tlb_one(),
rip them out.

This hasn't been tested on a system with x86's NOTIFY_NMI. Any more
more testing would be welcome. These patches are (still) based on rc7.

Changes since RFC:
 * Added #ifdefs around the entries in fixmap.h
 * Added a paragraph about HAVE_ACPI_APEI to the commit message
 * Merged the first three patches for improved history

I've tried to be clear with who-acked-what when merging the patches.
For reference, the arch-acks are here:
https://lkml.org/lkml/2017/11/2/254
https://lkml.org/lkml/2017/10/31/780


Thanks,

James Morse (4):
  ACPI / APEI:  Replace ioremap_page_range() with fixmap
  ACPI / APEI: Remove ghes_ioremap_area
  arm64: mm: Remove arch_apei_flush_tlb_one()
  ACPI / APEI: Remove arch_apei_flush_tlb_one()

 arch/arm64/include/asm/acpi.h   | 12 --
 arch/arm64/include/asm/fixmap.h |  7 
 arch/arm64/mm/mmu.c |  4 ++
 arch/x86/include/asm/fixmap.h   |  6 +++
 arch/x86/kernel/acpi/apei.c |  5 ---
 drivers/acpi/apei/ghes.c| 84 +
 include/acpi/apei.h |  1 -
 7 files changed, 34 insertions(+), 85 deletions(-)

-- 
2.15.0.rc2



Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap

2017-11-06 Thread James Morse
On 01/11/17 18:20, Kani, Toshimitsu wrote:
> On Wed, 2017-11-01 at 16:30 +0100, Borislav Petkov wrote:
>> On Wed, Nov 01, 2017 at 02:58:33PM +0000, James Morse wrote:
>>> Does anyone have an x86 machine that does firmware-first using NOTIFY_NMI?

>> AFAIK, the only one who has access to a reportedly somewhat working GHES
>> implementation is Toshi. CCed.
> 
> Thanks for the heads-up.  My x86 system only supports GHES with SCI
> error sources.  It uses MCE for synchronous error events.
> 
> So, for x86 SCI error sources:
> 
> Tested-by: Toshi Kani 

Thanks Toshi!


> nit: I think ghes_ioremap_pfn_[nmi|irq] should be renamed since they no
> longer use ioremap. 

I thought that would be too noisy...

I have some more rework to do in here before we can merge the other arm64 RAS
stuff, I can take that into account then.


Thanks,

James




Re: [RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap

2017-11-06 Thread James Morse
Hi gengdongjiu

On 02/11/17 12:01, gengdongjiu wrote:
> James Morse wrote:
>> Can I take that as a 'Tested-by:'?
>>
>> These tags also let us record who has a system that can test changes to this 
>> driver.
> 
> sure.
> Thanks for the fixing.
> Qiang Zheng who is my colleague have tested it.
> 
> CC  Qiang.
> 
> Tested-by:  Qiang Zheng 

I can't do anything with this unless Qiang posts it. (I'll add to the CC list in
the next version)

>From Documentation/process/5.Posting.rst:
> Be careful in the addition of tags to your patches: only Cc: is appropriate
> for addition without the explicit permission of the person named.



Thanks,

James


get_online_cpus() from a preemptible() context (bug?)

2017-11-03 Thread James Morse
Hi Thomas, Peter,

I'm trying to work out what stops a thread being pre-empted and migrated between
calling get_online_cpus() and put_online_cpus().

According to __percpu_down_read(), its the pre-empt count:
>  * Due to having preemption disabled the decrement happens on
>  * the same CPU as the increment, avoiding the
>  * increment-on-one-CPU-and-decrement-on-another problem.


So this:
> void cpus_read_lock(void)
> {
>percpu_down_read(&cpu_hotplug_lock);
> +
> +   /* Can we migrated before we release this per-cpu lock? */
> +   WARN_ON(preemptible());
>  }

should never fire?

It does, some of the offenders:
> kmem_cache_create
> apply_workqueue_attrs
> stop_machine
> static_key_enable
> lru_add_drain_all
> __cpuhp_setup_state
> kmem_cache_shrink
> vmstat_shepherd
> __cpuhp_state_add_instance


Trying to leave preempt disabled between the down/up leads to
scheduling-while-atomic instead.

Can you point out what I've missed here?


Thanks,

James


Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap

2017-11-01 Thread James Morse
Hi Linus,

On 31/10/17 15:52, Linus Torvalds wrote:
> On Tue, Oct 31, 2017 at 8:38 AM, James Morse  wrote:
>>  7 files changed, 30 insertions(+), 85 deletions(-)
> 
> Lovely.
> 
> I obviously can't test it, but it looks fine. I *would* suggest just
> making the "add fixmap entries" commits with the code that actually
> uses them. There's no real reason to have two commits that just add
> two entries that aren't used yet.
> 
> If it was some meaningful helper function where a split of the commits
> makes each commit easier to read, that would be one thing.  As it is,
> the split just makes it harder to look at the history of the code (ie
> "I wonder where this was introduced - let's use 'git blame'. Oh,
> that's not useful").

I will squash the first three patches together to fix this, and add something
about HAVE_ACPI_APEI to the commit message.

(I'm still holding out for someone to test this on an x86 system with 
NOTIFY_NMI)


Thanks,

James


Re: [RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap

2017-11-01 Thread James Morse
Hi guys,

(+CC: Chen Gong and Huang Ying from the git log of [0])

On 31/10/17 15:38, James Morse wrote:
> RFC as I've only build-tested this on x86.

Does anyone have an x86 machine that does firmware-first using NOTIFY_NMI?


> Any more testing would be welcome.

('ls /sys/firmware/acpi/tables/', if you have a HEST and EINJ we should be able
to work something out)


Thanks,

James

[0] https://www.kernel.org/doc/Documentation/acpi/apei/einj.txt


Re: [RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap

2017-11-01 Thread James Morse
Hi gengdonjiu,

On 01/11/17 04:13, gengdongjiu wrote:
> On 2017/10/31 23:38, James Morse wrote:
>> CC'd people I've seen posting CPER log fragments, could you give this a
>> test on your platforms?

> Thanks for the fixing, not found obviously issue.

Can I take that as a 'Tested-by:'?

These tags also let us record who has a system that can test changes to this 
driver.


Thanks,

James




Re: [PATCH v1 0/3] manually add Error Synchronization Barrier at exception handler entry and exit

2017-11-01 Thread James Morse
Hi Dongjiu Geng,

On 01/11/17 19:14, Dongjiu Geng wrote:
> Some hardware platform can support RAS Extension, but not support IESB,
> such as Huawei's platform, so software need to insert Synchronization Barrier
> operations at exception handler entry.
> 
> This series patches are based on  James's series patches "SError rework +
> RAS&IESB for firmware first support". In Huawei's platform, we do not
> support IESB, so software needs to insert that.

Surely you don't implement it because your CPU doesn't need it. Can
unrecoverable errors really cross an exception without becoming an SError?

The ESB instruction does the barrier, but it also consumes any pending SError.
As it is this series will silently consume-and-discard uncontainable errors.



Thanks,

James




[RFC/RFT PATCH 4/6] ACPI / APEI: Remove ghes_ioremap_area

2017-10-31 Thread James Morse
Now that nothing is using the ghes_ioremap_area pages, rip them out.

Signed-off-by: James Morse 
---
 drivers/acpi/apei/ghes.c | 39 ++-
 1 file changed, 2 insertions(+), 37 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f3269816b997..1a4f16b10052 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -114,19 +114,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
  * from BIOS to Linux can be determined only in NMI, IRQ or timer
  * handler, but general ioremap can not be used in atomic context, so
  * the fixmap is used instead.
- */
-
-/*
- * Two virtual pages are used, one for IRQ/PROCESS context, the other for
- * NMI context (optionally).
- */
-#define GHES_IOREMAP_PAGES   2
-#define GHES_IOREMAP_IRQ_PAGE(base)(base)
-#define GHES_IOREMAP_NMI_PAGE(base)((base) + PAGE_SIZE)
-
-/* virtual memory area for atomic ioremap */
-static struct vm_struct *ghes_ioremap_area;
-/*
+ *
  * These 2 spinlocks are used to prevent the fixmap entries from being used
  * simultaneously.
  */
@@ -141,23 +129,6 @@ static atomic_t ghes_estatus_cache_alloced;
 
 static int ghes_panic_timeout __read_mostly = 30;
 
-static int ghes_ioremap_init(void)
-{
-   ghes_ioremap_area = __get_vm_area(PAGE_SIZE * GHES_IOREMAP_PAGES,
-   VM_IOREMAP, VMALLOC_START, VMALLOC_END);
-   if (!ghes_ioremap_area) {
-   pr_err(GHES_PFX "Failed to allocate virtual memory area for 
atomic ioremap.\n");
-   return -ENOMEM;
-   }
-
-   return 0;
-}
-
-static void ghes_ioremap_exit(void)
-{
-   free_vm_area(ghes_ioremap_area);
-}
-
 static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
 {
phys_addr_t paddr;
@@ -1270,13 +1241,9 @@ static int __init ghes_init(void)
 
ghes_nmi_init_cxt();
 
-   rc = ghes_ioremap_init();
-   if (rc)
-   goto err;
-
rc = ghes_estatus_pool_init();
if (rc)
-   goto err_ioremap_exit;
+   goto err;
 
rc = ghes_estatus_pool_expand(GHES_ESTATUS_CACHE_AVG_SIZE *
  GHES_ESTATUS_CACHE_ALLOCED_MAX);
@@ -1300,8 +1267,6 @@ static int __init ghes_init(void)
return 0;
 err_pool_exit:
ghes_estatus_pool_exit();
-err_ioremap_exit:
-   ghes_ioremap_exit();
 err:
return rc;
 }
-- 
2.15.0.rc2



[RFC/RFT PATCH 3/6] ACPI / APEI: Replace ioremap_page_range() with fixmap

2017-10-31 Thread James Morse
Replace ghes_io{re,un}map_pfn_{nmi,irq}()s use of ioremap_page_range()
with __set_fixmap() as ioremap_page_range() may sleep to allocate a new
level of page-table, even if its passed an existing final-address to
use in the mapping.

clear_fixmap() does the TLB invalidation in __set_fixmap() for arm64
and __set_pte_vaddr() for x86. In each case its the same as the
respective arch_apei_flush_tlb_one().

Reported-by: Fengguang Wu 
Suggested-by: Linus Torvalds 
Signed-off-by: James Morse 
CC: Tyler Baicar 
CC: Dongjiu Geng 
CC: Xie XiuQi 

---
CC'd people I've seen posting CPER log fragments, could you give this a
test on your platforms?

 drivers/acpi/apei/ghes.c | 45 +++--
 1 file changed, 15 insertions(+), 30 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3c3a37b8503b..f3269816b997 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -112,7 +113,7 @@ static DEFINE_MUTEX(ghes_list_mutex);
  * Because the memory area used to transfer hardware error information
  * from BIOS to Linux can be determined only in NMI, IRQ or timer
  * handler, but general ioremap can not be used in atomic context, so
- * a special version of atomic ioremap is implemented for that.
+ * the fixmap is used instead.
  */
 
 /*
@@ -126,8 +127,8 @@ static DEFINE_MUTEX(ghes_list_mutex);
 /* virtual memory area for atomic ioremap */
 static struct vm_struct *ghes_ioremap_area;
 /*
- * These 2 spinlock is used to prevent atomic ioremap virtual memory
- * area from being mapped simultaneously.
+ * These 2 spinlocks are used to prevent the fixmap entries from being used
+ * simultaneously.
  */
 static DEFINE_RAW_SPINLOCK(ghes_ioremap_lock_nmi);
 static DEFINE_SPINLOCK(ghes_ioremap_lock_irq);
@@ -159,52 +160,36 @@ static void ghes_ioremap_exit(void)
 
 static void __iomem *ghes_ioremap_pfn_nmi(u64 pfn)
 {
-   unsigned long vaddr;
phys_addr_t paddr;
pgprot_t prot;
 
-   vaddr = (unsigned long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
-   ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
+   __set_fixmap(FIX_APEI_GHES_NMI, paddr, prot);
 
-   return (void __iomem *)vaddr;
+   return (void __iomem *) fix_to_virt(FIX_APEI_GHES_NMI);
 }
 
 static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
 {
-   unsigned long vaddr, paddr;
+   unsigned long paddr;
pgprot_t prot;
 
-   vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
-
paddr = pfn << PAGE_SHIFT;
prot = arch_apei_get_mem_attribute(paddr);
+   __set_fixmap(FIX_APEI_GHES_IRQ, paddr, prot);
 
-   ioremap_page_range(vaddr, vaddr + PAGE_SIZE, paddr, prot);
-
-   return (void __iomem *)vaddr;
+   return (void __iomem *) fix_to_virt(FIX_APEI_GHES_IRQ);
 }
 
-static void ghes_iounmap_nmi(void __iomem *vaddr_ptr)
+static void ghes_iounmap_nmi(void)
 {
-   unsigned long vaddr = (unsigned long __force)vaddr_ptr;
-   void *base = ghes_ioremap_area->addr;
-
-   BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_NMI_PAGE(base));
-   unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
-   arch_apei_flush_tlb_one(vaddr);
+   clear_fixmap(FIX_APEI_GHES_NMI);
 }
 
-static void ghes_iounmap_irq(void __iomem *vaddr_ptr)
+static void ghes_iounmap_irq(void)
 {
-   unsigned long vaddr = (unsigned long __force)vaddr_ptr;
-   void *base = ghes_ioremap_area->addr;
-
-   BUG_ON(vaddr != (unsigned long)GHES_IOREMAP_IRQ_PAGE(base));
-   unmap_kernel_range_noflush(vaddr, PAGE_SIZE);
-   arch_apei_flush_tlb_one(vaddr);
+   clear_fixmap(FIX_APEI_GHES_IRQ);
 }
 
 static int ghes_estatus_pool_init(void)
@@ -360,10 +345,10 @@ static void ghes_copy_tofrom_phys(void *buffer, u64 
paddr, u32 len,
paddr += trunk;
buffer += trunk;
if (in_nmi) {
-   ghes_iounmap_nmi(vaddr);
+   ghes_iounmap_nmi();
raw_spin_unlock(&ghes_ioremap_lock_nmi);
} else {
-   ghes_iounmap_irq(vaddr);
+   ghes_iounmap_irq();
spin_unlock_irqrestore(&ghes_ioremap_lock_irq, flags);
}
}
-- 
2.15.0.rc2



[RFC/RFT PATCH 5/6] arm64: mm: Remove arch_apei_flush_tlb_one()

2017-10-31 Thread James Morse
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_fixmap() to do the invalidation. Remove it.

Move the IPI-considered-harmful comment to __set_fixmap().

Signed-off-by: James Morse 
---
 arch/arm64/include/asm/acpi.h | 12 
 arch/arm64/mm/mmu.c   |  4 
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h
index 59cca1d6ec54..32f465a80e4e 100644
--- a/arch/arm64/include/asm/acpi.h
+++ b/arch/arm64/include/asm/acpi.h
@@ -126,18 +126,6 @@ static inline const char *acpi_get_enable_method(int cpu)
  */
 #define acpi_disable_cmcff 1
 pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr);
-
-/*
- * Despite its name, this function must still broadcast the TLB
- * invalidation in order to ensure other CPUs don't end up with junk
- * entries as a result of speculation. Unusually, its also called in
- * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
- * TLB broadcasting, then we're in trouble here.
- */
-static inline void arch_apei_flush_tlb_one(unsigned long addr)
-{
-   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
-}
 #endif /* CONFIG_ACPI_APEI */
 
 #ifdef CONFIG_ACPI_NUMA
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index f1eb15e0e864..267d2b79d52d 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -778,6 +778,10 @@ void __init early_fixmap_init(void)
}
 }
 
+/*
+ * Unusually, this is also called in IRQ context (ghes_iounmap_irq) so if we
+ * ever need to use IPIs for TLB broadcasting, then we're in trouble here.
+ */
 void __set_fixmap(enum fixed_addresses idx,
   phys_addr_t phys, pgprot_t flags)
 {
-- 
2.15.0.rc2



[RFC/RFT PATCH 6/6] ACPI / APEI: Remove arch_apei_flush_tlb_one()

2017-10-31 Thread James Morse
Nothing calls arch_apei_flush_tlb_one() anymore, instead relying on
__set_pte_vaddr() to do the invalidation when called from clear_fixmap()
Remove arch_apei_flush_tlb_one().

Signed-off-by: James Morse 
---
 arch/x86/kernel/acpi/apei.c | 5 -
 include/acpi/apei.h | 1 -
 2 files changed, 6 deletions(-)

diff --git a/arch/x86/kernel/acpi/apei.c b/arch/x86/kernel/acpi/apei.c
index ea3046e0b0cf..bb8d300fecbd 100644
--- a/arch/x86/kernel/acpi/apei.c
+++ b/arch/x86/kernel/acpi/apei.c
@@ -52,8 +52,3 @@ void arch_apei_report_mem_error(int sev, struct 
cper_sec_mem_err *mem_err)
apei_mce_report_mem_error(sev, mem_err);
 #endif
 }
-
-void arch_apei_flush_tlb_one(unsigned long addr)
-{
-   __flush_tlb_one(addr);
-}
diff --git a/include/acpi/apei.h b/include/acpi/apei.h
index c46694abea28..82c451698c98 100644
--- a/include/acpi/apei.h
+++ b/include/acpi/apei.h
@@ -50,7 +50,6 @@ int erst_clear(u64 record_id);
 
 int arch_apei_enable_cmcff(struct acpi_hest_header *hest_hdr, void *data);
 void arch_apei_report_mem_error(int sev, struct cper_sec_mem_err *mem_err);
-void arch_apei_flush_tlb_one(unsigned long addr);
 
 #endif
 #endif
-- 
2.15.0.rc2



[RFC/RFT PATCH 1/6] arm64: fixmap: Add GHES fixmap entries

2017-10-31 Thread James Morse
GHES is switching to use fixmap for its dynamic mapping of CPER records,
to avoid using ioremap_page_range() in IRQ/NMI context.

Signed-off-by: James Morse 
---
 arch/arm64/include/asm/fixmap.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/include/asm/fixmap.h b/arch/arm64/include/asm/fixmap.h
index caf86be815ba..4edcdcb01a68 100644
--- a/arch/arm64/include/asm/fixmap.h
+++ b/arch/arm64/include/asm/fixmap.h
@@ -51,6 +51,11 @@ enum fixed_addresses {
 
FIX_EARLYCON_MEM_BASE,
FIX_TEXT_POKE0,
+
+   /* Used for GHES mapping from assorted contexts */
+   FIX_APEI_GHES_IRQ,
+   FIX_APEI_GHES_NMI,
+
__end_of_permanent_fixed_addresses,
 
/*
-- 
2.15.0.rc2



[RFC/RFT PATCH 2/6] x86/mm/fixmap: Add GHES fixmap entries

2017-10-31 Thread James Morse
GHES is switching to use fixmap for its dynamic mapping of CPER records,
to avoid using ioremap_page_range() in IRQ/NMI context.

Signed-off-by: James Morse 
---
 arch/x86/include/asm/fixmap.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..be3cc32db7f0 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -104,6 +104,10 @@ enum fixed_addresses {
FIX_GDT_REMAP_BEGIN,
FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
 
+   /* Used for GHES mapping from assorted contexts */
+   FIX_APEI_GHES_IRQ,
+   FIX_APEI_GHES_NMI,
+
__end_of_permanent_fixed_addresses,
 
/*
-- 
2.15.0.rc2



[RFC/RFT PATCH 0/6] Switch GHES ioremap_page_range() to use fixmap

2017-10-31 Thread James Morse
GHES is doing ioremap_page_range() in both NMI and irq context, neither
are safe as it may sleep to allocate intermediate levels of page table.

Replace the NMI/irq GHES_IOREMAP_PAGES to use a fixmap entry each.

After this nothing uses ghes_ioremap_area or arch_apei_flush_tlb_one(),
rip them out.

RFC as I've only build-tested this on x86. For arm64 I've tested it on a
software model. Any more testing would be welcome. These patches are based
on rc7.

Thanks,

James Morse (6):
  arm64: fixmap: Add GHES fixmap entries
  x86/mm/fixmap: Add GHES fixmap entries
  ACPI / APEI:  Replace ioremap_page_range() with fixmap
  ACPI / APEI: Remove ghes_ioremap_area
  arm64: mm: Remove arch_apei_flush_tlb_one()
  ACPI / APEI: Remove arch_apei_flush_tlb_one()

 arch/arm64/include/asm/acpi.h   | 12 --
 arch/arm64/include/asm/fixmap.h |  5 +++
 arch/arm64/mm/mmu.c |  4 ++
 arch/x86/include/asm/fixmap.h   |  4 ++
 arch/x86/kernel/acpi/apei.c |  5 ---
 drivers/acpi/apei/ghes.c| 84 +
 include/acpi/apei.h |  1 -
 7 files changed, 30 insertions(+), 85 deletions(-)

-- 
2.15.0.rc2



Re: [PATCH v7 1/4] arm64: kvm: route synchronous external abort exceptions to EL2

2017-10-18 Thread James Morse
Hi Dongjiu Geng,

On 17/10/17 15:14, Dongjiu Geng wrote:
> ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
> route synchronous external aborts to EL2, and adds a
> trap control bit HCR_EL2.TERR which controls to
> trap all Non-secure EL1&0 error record accesses to EL2.

The bulk of this patch is about trap-and-emulating these ERR registers, but
that's not reflected in the title:
> KVM: arm64: Emulate RAS error registers and set HCR_EL2's TERR & TEA


> This patch enables the two bits for the guest OS.
> when an synchronous abort is generated in the guest OS,
> it will trap to EL3 firmware, EL3 firmware will check the

*buzz*
This depends on SCR_EL3.EA, which this patch doesn't touch and the normal-world
can't even know about. This is what your system does, the commit message should
be about the change to Linux.

(I've said this before)


> HCR_EL2.TEA value to decide to jump to hypervisor or host
> OS. Enabling HCR_EL2.TERR makes error record access
> from guest trap to EL2.
> 
> Add some minimal emulation for RAS-Error-Record registers.
> In the emulation, ERRIDR_EL1 and ERRSELR_EL1 are zero.
> Then, the others ERX* registers are RAZ/WI.

> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index fe39e68..47983db 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h
> @@ -47,6 +47,13 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>   vcpu->arch.hcr_el2 = HCR_GUEST_FLAGS;
>   if (is_kernel_in_hyp_mode())
>   vcpu->arch.hcr_el2 |= HCR_E2H;
> + if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN)) {

This ARM64_HAS_RAS_EXTN isn't in mainline, nor is it added by your series. I
know where it comes from, but other reviewers may not. If you have dependencies
on another series, please call them out in the cover letter.

This is the first cpus_have_const_cap() user in this header file, it probably 
needs:
#include 


> + /* route synchronous external abort exceptions to EL2 */
> + vcpu->arch.hcr_el2 |= HCR_TEA;
> + /* trap error record accesses */
> + vcpu->arch.hcr_el2 |= HCR_TERR;
> + }
> +
>   if (test_bit(KVM_ARM_VCPU_EL1_32BIT, vcpu->arch.features))
>   vcpu->arch.hcr_el2 &= ~HCR_RW;
>  }
> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index d686300..af55b3bc 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -105,6 +105,8 @@ enum vcpu_sysreg {
>   TTBR1_EL1,  /* Translation Table Base Register 1 */
>   TCR_EL1,/* Translation Control Register */
>   ESR_EL1,/* Exception Syndrome Register */

> + ERRIDR_EL1, /* Error Record ID Register */

Page 39 of [0]: 'ERRIDR_EL1 is a 64-bit read-only ...'.


> + ERRSELR_EL1,/* Error Record Select Register */

We're emulating these as RAZ/WI, do we really need to allocate
vcpu->arch.ctxt.sys_regs space for them? If we always return 0 for ERRIDR, then
we don't need to keep ERRSELR as 'the value read back [..] is UNKNOWN'.

I think we only need space for these once their value needs to be migrated,
user-space doesn't need to know they exist until then.


>   AFSR0_EL1,  /* Auxiliary Fault Status Register 0 */
>   AFSR1_EL1,  /* Auxiliary Fault Status Register 1 */
>   FAR_EL1,/* Fault Address Register */

> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 2e070d3..a74617b 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -775,6 +775,36 @@ static bool access_pmovs(struct kvm_vcpu *vcpu, struct 
> sys_reg_params *p,
>   return true;
>  }
>  
> +static bool access_error_reg(struct kvm_vcpu *vcpu, struct sys_reg_params *p,
> +  const struct sys_reg_desc *r)
> +{
> + /* accessing ERRIDR_EL1 */
> + if (r->CRm == 3 && r->Op2 == 0) {
> + if (p->is_write)
> + vcpu_sys_reg(vcpu, ERRIDR_EL1) = 0;

As above, this register is read-only.


> + return trap_raz_wi(vcpu, p, r);

If we can get rid of the vcpu storage this just becomes trap_raz_wi() .


> + }
> +
> + /* accessing ERRSELR_EL1 */
> + if (r->CRm == 3 && r->Op2 == 1) {
> + if (p->is_write)
> + vcpu_sys_reg(vcpu, ERRSELR_EL1) = 0;
> +
> + return trap_raz_wi(vcpu, p, r);
> + }

Same here.


> +
> + /*
> +  * If ERRSELR_EL1.SEL is greater than or equal to ERRIDR_EL1.NUM,
> +  * the ERX* registers are RAZ/WI.
> +  */
> + if ((vcpu_sys_reg(vcpu, ERRSELR_EL1) & 0xff) >=
> + (vcpu_sys_reg(vcpu, ERRIDR_EL1) && 0xff))
> + return trap_raz_wi(vcpu, p, r);

The trick here is ERRIDR_EL1 is read only, KVM can choose how many records it
emulates. Let's pick zero:
> Zero indicates no records can be accessed through the Error Record System
> registers.

Which lets 

Re: [PATCH v5 2/2] acpi: apei: Add SEI notification type support for ARMv8

2017-10-18 Thread James Morse
Hi Dongjiu Geng,

On 17/10/17 09:02, Dongjiu Geng wrote:
> ARMv8.2 requires implementation of the RAS extension, in
> this extension it adds SEI(SError Interrupt) notification
> type, this patch adds new GHES error source SEI handling
> functions.

This paragraph is merging two things that aren't related.
The 'ARM v8.2 architecture extensions' have some RAS bits, which if your CPU
implements v8.2 are required.

ACPIv6.1 added NOTIFY_SEI as a notification type for ARMv8 systems.

This patch adds a GHES function for NOTIFY_SEI. Please leave the CPU RAS
extensions out of it.


> Because this error source parsing and handling
> methods are similar with the SEA. So share some SEA handling
> functions with the SEI
> 
> Expose one API ghes_notify_abort() to external users. External
> modules can call this exposed API to parse and handle the
> SEA or SEI.

This series doesn't add a caller/user for this new API, so why do we need to do
this now?

(I still haven't had a usable answer for 'what does your firmware do when SError
is masked', but I'll go beat that drum on the other thread).


More important for the APEI code is: How do SEA and SEI interact?

As far as I can see they can both interrupt each other, which isn't something
the single in_nmi() path in APEI can handle. I thinks we should fix this first.
(I'll try and polish my RFC that had a stab at that...)


SEA gets away with a lot of things because its synchronous. SEI isn't. Xie XiuQi
pointed to the memory_failure_queue() code. We can use this directly from SEA,
but not SEI. (what happens if an SError arrives while we are queueing
memory_failure work from an IRQ).

The one that scares me is the trace-point reporting stuff. What happens if an
SError arrives while we are enabling a trace point? (these are static-keys 
right?)


I don't think we can just plumb SEI in like this and be done with it.
(I'm looking at teasing out the estatus cache code from being x86:NMI only. This
way we solve the same 'cant do this from NMI context' with the same code'.)


Thanks,

James



boring nits below:

> Note: For the SEI(SError Interrupt), it is asynchronous external
> abort, the error address recorded by firmware may be not accurate.
> If not accurate, EL3 firmware needs to identify the address to a
> invalid value.

This paragraph keeps cropping up. Who expects an address with an SError?
We don't get one for IRQs, but that never needs stating.


> Cc: Borislav Petkov 
> Cc: James Morse 
> Signed-off-by: Dongjiu Geng 
> Tested-by: Tyler Baicar 

> Tested-by: Dongjiu Geng 

(It's expected you test your own code)



> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 2509e4f..c98c1b3 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -585,7 +585,7 @@ static int do_sea(unsigned long addr, unsigned int esr, 
> struct pt_regs *regs)
>   if (interrupts_enabled(regs))
>   nmi_enter();
>  
> - ret = ghes_notify_sea();
> + ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>   if (interrupts_enabled(regs))
>   nmi_exit();
> @@ -682,7 +682,7 @@ int handle_guest_sea(phys_addr_t addr, unsigned int esr)
>   int ret = -ENOENT;
>  
>   if (IS_ENABLED(CONFIG_ACPI_APEI_SEA))
> - ret = ghes_notify_sea();
> + ret = ghes_notify_abort(ACPI_HEST_NOTIFY_SEA);
>  
>   return ret;
>  }
> diff --git a/drivers/acpi/apei/Kconfig b/drivers/acpi/apei/Kconfig
> index de14d49..47fcb0c 100644
> --- a/drivers/acpi/apei/Kconfig
> +++ b/drivers/acpi/apei/Kconfig
> @@ -54,6 +54,21 @@ config ACPI_APEI_SEA
> option allows the OS to look for such hardware error record, and
> take appropriate action.
>  
> +config ACPI_APEI_SEI
> + bool "APEI Asynchronous SError Interrupt logging/recovering support"
> + depends on ARM64 && ACPI_APEI_GHES
> + default y
> + help
> +   This option should be enabled if the system supports
> +   firmware first handling of SEI (asynchronous SError interrupt).
> +
> +   SEI happens with asynchronous external abort for errors on device
> +   memory reads on ARMv8 systems. If a system supports firmware first
> +   handling of SEI, the platform analyzes and handles hardware error
> +   notifications from SEI, and it may then form a HW error record for
> +   the OS to parse and handle. This option allows the OS to look for
> +   such hardware error record, and take appropriate action.
> +
>  config ACPI_APEI_MEMORY_FAILURE
>   bool "APEI memory error recovering support"
>   depends on ACPI_APEI && MEMORY_FAIL

Re: [PATCH v5 2/2] acpi: apei: Add SEI notification type support for ARMv8

2017-10-18 Thread James Morse
Hi Borislav!

On 18/10/17 10:25, Borislav Petkov wrote:
> On Wed, Oct 18, 2017 at 05:17:27PM +0800, gengdongjiu wrote:
>> Thanks Borislav, can I write it as asynchronous exception or
>> asynchronous abort?
> 
> WTF?!

Yup.


> The thing is abbreviated as "SEI" and apparently means "System Error
> Interrupt". Nothing else.

ARM has 'external abort', which are either synchronous or asynchronous, both are
delivered as different types of exception.

Asynchronous external abort is treated as a special kind of interrupt, 'SError
Interrupt', (where SError stands for System Error, but its rarely written like
that). 'SEI' is a relatively new abbreviation for SError interrupt.


What should we call this thing? In the ACPI code I'd prefer 'SEI' as that is
what the ACPI spec calls it. Here we are talking about an GHES notification.

But in the arm64 arch code this should be called SError Interrupt as this is
what the ARM-ARM calls it. This code cares about exception routing and interrupt
masking.


But, I don't really care.


Thanks,

James


Re: [PATCH v6 5/7] arm64: kvm: route synchronous external abort exceptions to el2

2017-10-16 Thread James Morse
Hi gengdongjiu,

On 14/09/17 12:12, gengdongjiu wrote:
> On 2017/9/8 0:31, James Morse wrote:
>> KVM already handles external aborts from lower exception levels, no more work
>> needs doing for TEA.
> If it is firmware first solution, that is SCR_EL3.EA=1, all SError interrupt 
> and synchronous External
> Abort exceptions are taken to EL3, so EL3 firmware will handle it, KVM no 
> needs to handle it.
> 
> HCR_EL3.TEA is only for EL3 to check its value to decide to jump to 
> hypervisor or kernel.
> 
>>
>> What happens when a guest access the RAS-Error-Record registers?
>>
>> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation 
>> for
>> the registers it traps. Most of them should be RAZ/WI, so it should be
>> straightforward. (I think KVMs default is to emulate an undef for unknown 
>> traps).

> Today I added the support to do some minimal emulation for RAS-Error-Record 
> registers, thanks
> for the good suggestion.

Where can I find this patch?
I'd like to repost it as part of the SError_rework/RAS/IESB series: this is one
of the bits KVM needs but I didn't touch as it looks like your updated version
of this patch should cover it.


Thanks,

James


Re: [PATCH 2/3] arm64: vdso: replace gettimeofday.S with arm vgettimeofday.C

2017-10-11 Thread James Morse
Hi Mark,

On 11/09/17 17:18, Mark Salyzyn wrote:
> Take an effort to recode the arm64 vdso code from assembler to C
> previously submitted by Andrew Pinski , rework
> it for use in both arm and arm64, overlapping any optimizations
> for each architecture.
> 
> apin...@cavium.com writes in the original patch:
> 
> This allows the compiler to optimize the divide by 1000 and remove
> the other divides.
> 
> On ThunderX, gettimeofday improves by 32%.  On ThunderX 2,
> gettimeofday improves by 18%.

Great! How was this tested? (I can compare the numbers for Juno and Seattle)


> Note I noticed a bug in the old implementation of __kernel_clock_getres;
> it was checking only the lower 32bits of the pointer; this would work
> for most cases but could fail in a few.

Is there a separate patch for this? Once we change it to C we can't easily
backport fixes.


I can't get my head round what vgettimeofday.c looks like after your previous
patch, some comments on the rest of it:

> diff --git a/arch/arm64/kernel/vdso/Makefile b/arch/arm64/kernel/vdso/Makefile
> index 62c84f7cb01b..6c127964ee62 100644
> --- a/arch/arm64/kernel/vdso/Makefile
> +++ b/arch/arm64/kernel/vdso/Makefile
> @@ -5,16 +5,21 @@
>  # Heavily based on the vDSO Makefiles for other archs.
>  #
>  
> -obj-vdso := gettimeofday.o note.o sigreturn.o
> +obj-vdso := vgettimeofday.o note.o sigreturn.o
>  
>  # Build rules
>  targets := $(obj-vdso) vdso.so vdso.so.dbg
>  obj-vdso := $(addprefix $(obj)/, $(obj-vdso))
>  
> -ccflags-y := -shared -fno-common -fno-builtin
> +ccflags-y := -shared -fno-common -fno-builtin -fno-stack-protector
> +ccflags-y += -DDISABLE_BRANCH_PROFILING
>  ccflags-y += -nostdlib -Wl,-soname=linux-vdso.so.1 \
>   $(call cc-ldoption, -Wl$(comma)--hash-style=sysv)

> +# Force -O2 to avoid libgcc dependencies
> +CFLAGS_REMOVE_vgettimeofday.o = -pg -Os
> +CFLAGS_vgettimeofday.o = -O2 -mcmodel=tiny

How is this -O2=!libgcc enforced? Is it just co-incidence and works with today's
compilers?

We already have -nostdlib -fno-builtin -fno-common, are these not enough?

I assume that if this goes wrong we will get a link error due to:
d67703a8a69e ("arm64: kill off the libgcc dependency")


>  # Disable gcov profiling for VDSO code
>  GCOV_PROFILE := n

Once this is in C I think we need these too:
> KASAN_SANITIZE := n
> UBSAN_SANITIZE := n
> KCOV_INSTRUMENT := n


> @@ -48,15 +53,9 @@ endef
>  include/generated/vdso-offsets.h: $(obj)/vdso.so.dbg FORCE
>   $(call if_changed,vdsosym)
>  
> -# Assembly rules for the .S files
> -$(obj-vdso): %.o: %.S FORCE
> - $(call if_changed_dep,vdsoas)

Doesn't this break if_changed_dep for the other S files? (notes.S, sigreturn.S
vdso.S)


>  # Actual build commands
>  quiet_cmd_vdsold = VDSOL   $@
>cmd_vdsold = $(CC) $(c_flags) -Wl,-n -Wl,-T $^ -o $@
> -quiet_cmd_vdsoas = VDSOA   $@
> -  cmd_vdsoas = $(CC) $(a_flags) -c -o $@ $<
>  
>  # Install commands for the unstripped file
>  quiet_cmd_vdso_install = INSTALL $@

> diff --git a/arch/arm64/kernel/vdso/datapage.h 
> b/arch/arm64/kernel/vdso/datapage.h
> new file mode 100644
> index ..e627e4f4ed93
> --- /dev/null
> +++ b/arch/arm64/kernel/vdso/datapage.h
> @@ -0,0 +1,58 @@
> +/*
> + * Userspace implementations of __get_datapage
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program.  If not, see .
> + */
> +
> +#ifndef __VDSO_DATAPAGE_H
> +#define __VDSO_DATAPAGE_H
> +
> +#include 
> +#include 
> +
> +/*
> + * We use the hidden visibility to prevent the compiler from generating a GOT
> + * relocation. Not only is going through a GOT useless (the entry couldn't 
> and
> + * mustn't be overridden by another library), it does not even work: the 
> linker
> + * cannot generate an absolute address to the data page.
> + *
> + * With the hidden visibility, the compiler simply generates a PC-relative
> + * relocation (R_ARM_REL32), and this is what we need.
> + */
> +extern const struct vdso_data _vdso_data 
> __attribute__((visibility("hidden")));
> +
> +static inline const struct vdso_data *__get_datapage(void)
> +{
> + const struct vdso_data *ret;
> + /*
> +  * This simply puts &_vdso_data into ret. The reason why we don't use
> +  * `ret = &_vdso_data` is that the compiler tends to optimise this in a
> +  * very suboptimal way: instead of keeping &_vdso_data in a register,
> +  * it goes through a relocation almost every time _vdso_data must be
> +  * acc

Re: [PATCH 1/3] arm: vdso: Rewrite gettimeofday to more closely match arm64 variant

2017-10-11 Thread James Morse
Hi Mark,

On 11/09/17 17:18, Mark Salyzyn wrote:
> Take an effort to recode the arm64 vdso code from assembler to C
> previously submitted by Andrew Pinski , rework
> it for use in both arm and arm64, overlapping any optimizations
> for each architecture.
> 
> Side effects include renaming some vdso_datapage.h variables, adding
> a few more to enhance features and functionality to match.

There are quite a lot of different things going on here, all of which are
interleaved in this patch. Could you split them out into smaller simpler 
changes?

The bits I count are:
* Pulling 32bit specific eabi and fallback macro out into compiler.h,
* Adding support for __vdso_clock_getres,
* Flipping tk_is_cntvct's logic to become use_syscall,
* Variable renames to align 32/64 bit,

...are those 'coarse' versions of the timer calls being added, or just moved 
around?


Thanks,

James



Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-10-06 Thread James Morse
Hi gengdongjiu,

On 27/09/17 12:07, gengdongjiu wrote:
> On 2017/9/23 0:51, James Morse wrote:
>> If this wasn't a firmware-first notification, then you're right KVM hands the
>> guest an asynchronous external abort. This could be considered a bug in KVM. 
>> (we
>> can discuss with Marc and Christoffer what it should do), but:
>>
>> I'm not sure what scenario you could see this in: surely all your
>> CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
>> notifications. So they should always be claimed by APEI.

> Yes, if it is firmware-first we should not exist such issue.

[...]

>> What you may be seeing is some awkwardness with the change in the SError ESR
>> with v8.2. Previously the VSE mechanism injected an impdef SError, (but they
>> were all impdef so it didn't matter).
>> With VSESR_EL2 KVM has to specify one, and all-zeros is a bad choice as this
>> means 'classified as a RAS error ... unknown!'.

>> I have a patch in the upcoming SError/RAS series that changes KVMs 
>> virtual-abort
>> code to specify an impdef ESR for this path.

https://www.spinics.net/lists/arm-kernel/msg609589.html


> Before I remember Marc and you suggest me specify the an impdef ESR (set the 
> vsesr_el2) in
> the userspace,

If Qemu/kvmtool wants to emulate a machine that notifies the guest about
firmware-first RAS Errors using SError/SEI, it needs to decide when to send
these SError and what ESR to specify.


> now do you want to specify an impdef ESR in KVM instead of usrspace?

No, that patch is just trying to fixup the existing cases where KVM already
injects an SError. I'm just trying to keep the behaviour the-same:

Before the RAS Extensions the guest would always get an impdef SError ESR.
After the RAS Extensions KVM has to pick an ESR, as otherwise the guest gets the
hardware-reset value of VSESR_EL2. On the FVP this is all-zeros, which is a RAS
encoding. That patch changes it to still be an impdef SError ESR.


> if setting the vsesr_el2 in the KVM, whether user-space need to specify?

I think we need a KVM CAP API to do this. With that patch it can be wired into
pend_guest_serror(), which takes the ESR to make pending if the CPU supports it.

It's not clear to me whether its useful for user-space to make an SError pending
even if it can't set the ESR

[...]

>> Because the SEI notification depends on v8.2 I'd like to get the SError/RAS
>> series posted (currently re-testing), then I'll pick up enough of the patches
>> you've posted for a consolidated version of the series, and we can take the
>> discussion from there.

> James, it is great, we can make a consolidated version of the series.

We need to get some of the wider issues fixed first, the big-ugly one is
memory_failure_queue() not being NMI safe. (this isn't a problem for SEA as this
would only become re-entrant if the kernel text was corrupt). It is a problem
for SEI and SDEI.


>> I'd still like to know what your firmware does if the normal-world believes 
>> its
>> masked physical-SError and you want to hand it an SEI notification.

Aha, thanks for this:

> firstly the physical-SError that happened in the EL2/EL1/EL1 can not be 
> masked if SCR_EL3.EA is set.

Masked for the CPU because the CPU can deliver the SError to EL3.

What about software? Code at EL1 and EL2 each have a PSTATE.A bit they may have
set. HCR_EL2.VSE respects EL1's PSTATE.A ... the question is does your firmware
respect the PSTATE.A value of the exception level that SError are routed to?


> when trap to EL3, firmware will record the error to APEI CPER from reading 
> ERR* RAS registers.
> 
> (1) if HCR_EL2.TEA is set to 1, exception come from EL0, El1. firmware knows 
> this

HCR_EL2.TEA covers synchronous-external-aborts. For SError you need to check
HCR_EL2.AMO. Some crazy hypervisor may set one and not the other.


> SError come from guest OS, copy the elr_el3 to elr_el2, copy ESR_El3 to
> ESR_EL2.

The EC value in the ELR describes current/lower exception level, you need to
re-encode this for EL2 if the exception came from EL2.


> if the SError exception come from guest EL0 or EL1, set ELR_EL3 with 
> VBAR_EL2 + 0x580(one EL2 SEI entry point),
> 
> execute "ERET", then jump to EL2 hypervisor.
>
> (2)if the SError exception come EL2 hypervisor, copy the elr_el3 to elr_el2, 
> copy ESR_El3 to ESR_EL,
> set ELR_EL3 with VBAR_EL2+0x380(one EL2 SEI entry point),
> 
>execute "ERET", then jump to EL2 hypervisor.

This SError came from EL2. You _must_ check SPSR_EL3.A is clear before returning
to the EL2 SError vector.

EL2 believes it has masked SError, it does this because it can't handle one
right now. If your firmware 

Re: [RFC PATCH v2 19/31] KVM: arm64: Describe AT instruction emulation design

2017-10-03 Thread James Morse
Hi Jintack,

On 03/10/17 04:11, Jintack Lim wrote:
> This design overview will help to digest the subsequent patches that
> implement AT instruction emulation.

> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 8d04926..d8728cc 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1621,6 +1621,72 @@ static bool access_id_aa64mmfr0_el1(struct kvm_vcpu *v,
>   { SYS_DESC(SYS_SP_EL2), NULL, reset_special, SP_EL2, 0},
>  };
>  
> +/*
> + * AT instruction emulation
> + *
> + * We emulate AT instructions executed in the virtual EL2.

> + * Basic strategy for the stage-1 translation emulation is to load proper
> + * context, which depends on the trapped instruction and the virtual HCR_EL2,
> + * to the EL1 virtual memory control registers and execute S1E[01] 
> instructions
> + * in EL2. See below for more detail.

What happens if the guest memory containing some stage1-page-table has been
unmapped from stage2? (e.g. its swapped to disk).

(there is some background to this: I tried to implement the kvm_translate
ioctl() using this approach, running 'at s1e1*' from EL2. I ran into problems
when parts of the guest's stage1 page tables had been unmapped from stage2.)

>From memory, I found that the AT instructions would fault-in those pages when
run from EL1, but when executing the same instruction at EL2 they just failed
without any hint of which IPA needed mapping in.

I can try digging for any left over code if we want to setup a test case for 
this...


Thanks,

James


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 21/09/17 08:55, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>> put the world back together to make this a synchronous exception, so it 
>> reports
>> it to firmware as an SError-interrupt.
>> Linux gets an APEI notification and memory_failure() causes the affected 
>> page to
>> be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.
>>
>> Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. 
>> AO->
>> action optional, probably asynchronous.
>>
>> But in our example it wasn't really asynchronous, that was just a property of
>> the original CPU->firmware notification. What happens? The guest vcpu is 
>> re-run,
>> it re-runs the same instructions (this was a contained error so KVM's ELR 
>> points
>> at/before the instruction that steps in the problem). This time KVM takes a
>> stage2 fault, which the mm code will refuse to fixup because the relevant 
>> page
>> was marked as PG_hwpoision by memory_failure(). KVM signals Qemu/kvmtool with
>> SIGBUS_MCEERR_AR. Now Qemu/kvmtool can notify the guest using SEA.
> 
> CC Achin
> 
> I have some personal opinion, if you think it is not right, hope you can 
> point out.
> 
> Synchronous External Abort and SError Interrupt are hardware 
> exception(hardware concept),
> which is independent of software notification,
> in armv8 without RAS, the two concepts already exist. In the APEI spec, in 
> order to
> better describe the two exceptions, so use SEA and SEI notification to stand
for them.

> SEA notification stands for Synchronous External Abort, so may be it is not 
> only a
> notification, it also stands for a hardware error type.
> SEI notification stands for SError Interrupt, so may be it is not only a 
> notification,
> it also stands for a hardware error type.

> In the OS, it has different handling flow to the two exception(two 
> notification):
> when the guest OS running, if the hardware generates a Synchronous External 
> Abort, we
> told the guest OS this error is SError Interrupt instead of Synchronous
External Abort.

This should only happen when APEI doesn't claim the external-abort as a RAS
notification. If there were CPER records to process then the error is handled by
the host, and we can return to the guest.

If this wasn't a firmware-first notification, then you're right KVM hands the
guest an asynchronous external abort. This could be considered a bug in KVM. (we
can discuss with Marc and Christoffer what it should do), but:

I'm not sure what scenario you could see this in: surely all your
CPU:external-aborts are taken to EL3 by SCR_EL3.EA and become firmware-first
notifications. So they should always be claimed by APEI.


> guest OS uses SEI notification handling flow to deal with it, I am not sure 
> whether it
> will have problem, because the true hardware exception is Synchronous External
> Abort, but software treats it as SError interrupt to handle.

Once you're into a guest the original 'true hardware exception' shouldn't
matter. In this scenario KVM has handed the guest an SError, our question is 'is
it an SEI notification?':

For firmware first the guest OS should poke around in the CPER buffers, find
nothing to do, and return to the arch code for (future) kernel-first.
For kernel first the guest OS should trawl throug

Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-22 Thread James Morse
Hi gengdongjiu,

On 18/09/17 14:36, gengdongjiu wrote:
> On 2017/9/14 21:00, James Morse wrote:
>> On 13/09/17 08:32, gengdongjiu wrote:
>>> On 2017/9/8 0:30, James Morse wrote:
>>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>>>> an access or not.
>>
>> Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered 
>> via
>> some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
>> x86's kernel-first handling, which nicely matches this 'direct access' 
>> problem.
>> BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). 
>> Powerpc
>> also triggers these directly, both from what look to be synchronous paths, 
>> so I
>> think its fair to equate BUS_MCEERR_AR to a synchronous access and 
>> BUS_MCEERR_AO
>> to something_else.
> 
> James, thanks for your explanation.
> can I understand that your meaning that "BUS_MCEERR_AR" stands for 
> synchronous access and BUS_MCEERR_AO stands for asynchronous access?

Not 'stands for', as the AR is Action-Required and AO Action-Optional. My point
was I can't find a case where Action-Required is used for an error that isn't
synchronous.

We should run this past the people who maintain the existing BUS_MCEERR_AR
users, in case its just a severity to them.


> Then for "BUS_MCEERR_AO", how to distinguish it is asynchronous data 
> access(SError) and PCIE AER error?

How would userspace get one of these memory errors for a PCIe error?


> In the user space, we can check the si_code, if it is "BUS_MCEERR_AR", we use 
> SEA notification type for the guest;
> if it is "BUS_MCEERR_AO", we use SEI notification type for the guest.
> Because there are only two values for si_code("BUS_MCEERR_AR" and 
> BUS_MCEERR_AO), in which case we can use the GSIV(IRQ) notification type?

This is for Qemu/kvmtool to decide, it depends on what sort of machine they are
emulating.

For example, the physical machine's memory-controller may notify the CPU about
memory errors by triggering SError trapped to EL3, or with a dedicated FIQ, also
routed to EL3. By the time this gets to the host kernel the distinction doesn't
matter. The host has handled the error.

For a guest, your memory-controller is effectively the host kernel. It will give
you an BUS_MCEERR_AO signal for any guest memory that is affected, and a
BUS_MCEERR_AR if the guest directly accesses a page of affected memory.

What Qemu/kvmtool do with this is up to them. If they're emulating a machine
with no RAS features, printing an error and exit.

Otherwise BUS_MCEERR_AR could be notified as one of the flavours of IRQ, unless
the affected vcpu has interrupts masked, in which case an SEA notification gives
you some NMI-like behaviour.

For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My choice would
be IRQ, as you can't know if the guest supports SEI and it would be a shame to
kill it with an SError if the affected memory was free. SEA for synchronous
errors is still a good choice even if the guest doesn't support it as that
memory is still gone so its still a valid guest:Synchronous-external-abort.


[...]

>>> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
>>> the two different Errors.

>> user-space can choose whether to use SEA or SEI, it doesn't have to choose 
>> the
>> same notification type that firmware used, which in turn doesn't have to be 
>> the
>> same as that used by the CPU to notify firmware.
>>
>> The choice only matters because these notifications hang on an existing 
>> pieces
>> of the Arm-architecture, so the notification can only add to the 
>> architecturally
>> defined meaning. (i.e. You can only send an SEA for something that can 
>> already
>> be described as a synchronous external abort).
>>
>> Once we get to user-space, for memory_failure() notifications, (which so far 
>> is
>> all we are talking about here), the only thing that could matter is whether 
>> the
>> guest hit a PG_hwpoison page as a stage2 fault. These can be described as
>> Synchronous-External-Abort.
>>
>> The Synchronous-External-Abort/SError-Interrupt distinction matters for the 
>> CPU
>> because it can't always make an error synchronous. For memory_failure()
>> notifications to a KVM guest we really can do this, and we already have this
>> behaviour for free. An example:
>>
>> A guest touches some hardware:poisoned memory, for whatever reason the CPU 
>> can't
>>

Re: [PATCH] arm64: Unconditionally support ARCH_HAVE_NMI_SAFE_CMPXCHG

2017-09-15 Thread James Morse
Hi Stephen,

On 15/09/17 02:19, Stephen Boyd wrote:
> From what I can see there isn't anything about ACPI_APEI_SEA that
> means the arm64 architecture can or cannot support NMI safe
> cmpxchg, so the if condition here is not important. 

Yup, it was to match 'HAVE_NMI', which was new with ACPI_APEI_SEA and pulls in
some printk() stuff.

... how come you don't need to change HAVE_NMI too?


> Let's remove
> it. Doing that allows us to support ftrace histograms via
> CONFIG_HIST_TRIGGERS that depends on the arch having this config
> selected.

What does CONFIG_HIST_TRIGGERS need this for? I can't see any
cmpxchg use in kernel/trace/trace_events_hist.c


Regardless,

Acked-by: James Morse 


Thanks,

James

> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 0df64a6a56d4..27ce2ab7b080 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -21,7 +21,7 @@ config ARM64
>   select ARCH_HAS_STRICT_KERNEL_RWX
>   select ARCH_HAS_STRICT_MODULE_RWX
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
> - select ARCH_HAVE_NMI_SAFE_CMPXCHG if ACPI_APEI_SEA
> + select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_USE_CMPXCHG_LOCKREF
>   select ARCH_SUPPORTS_MEMORY_FAILURE
>   select ARCH_SUPPORTS_ATOMIC_RMW
> 



Re: [PATCH v3 1/3] arm64/ras: support sea error recovery

2017-09-15 Thread James Morse
Hi Xie XiuQi,

On 11/09/17 15:11, Xie XiuQi wrote:
> I first describe the approach of this patchset:
> 
> A memory access error on the execution path usually triggers SEA.
> According to the existing process, errors occurred in the kernel,
> leading to direct panic, if it occurred the user-space, we should
> just kill process.
> 
> But there is a class of error, in fact, is not necessary to kill
> process, you can recover and continue to run the process. Such as
> the instruction data corrupted, where the memory page might be
> read-only, which is has not been modified, the disk might have the
> correct data, so you can directly drop the page, ant reload it when
> necessary.
> 
> So this patchset is just try to solve such problem: if the error is
> consumed in user-space and the error occurs on a clean page, you can
> directly drop the memory page without killing process.
> 
> This is implemented in memory_failure, which is generic process.

> The error reported by SEA should be handled before re-enter the process,
> or we must kill the process to prevent error propagation.
> 
> memory_failure_queue() is asynchronous, in which, error info was saved
> at ghes_proc, but handled in kworker. During this period there is a context
> switching, so we can not determine which process would be switch to. So
> memory_failure_queue is not suitable for handling the problem.

Thanks for this summary. I see the problem you're trying to solve is when
memory_failure() runs, in your scenario its not guaranteed to run before we
return to user space.

What is the user-visible symptom of this? SIGBUS, code=0 instead of SIGBUS,
code=...MCEERR_A?

..in which case I'm looking at this as a race with the memory_failure() bottom
half via schedule_work().

How does x86 avoid this same problem?


> And memory_failure is not nmi-safe, so it can not be called directly in the
> SEA context. So we just handle this error at SEA exit path, and before context
> switching.

(I need to look more into which locks memory_failure() is taking)


> In FFH mode, physical address can only be obtained by parsing the GHES table.
> But we only care about SEA, so the error handling is tied to the type of 
> notification.

I care about all the notification methods. Once the notification has been passed
to APEI I want them to behave the same so that we don't have subtle bugs between
the 11 different ways we could get a notification. This code is rarely tested
enough as it is.

>From the arch code I just want to call out to APEI asking 'is this yours?'. If
so I expect APEI to have done all the work, if not we take the v8.0 behaviour.


Here you need APEI and the arch code to spot 'SEA' and treat it differently,
invoking some arm64-specific behaviour for APEI, and some
not-really-arch-specific code under /arch/arm64. There is nothing arm64 specific
about your arm_process_error(), how come the core APEI code doesn't need to do 
this?


I think this is caused by the way memory_failure() schedules its work, and that
is where I'd like to try and fix this, so that its the same for all notification
methods and all (cough: both) architectures.


> The TIF flag is checked on a generic path, but it will only be set when SEA 
> occurs.
> And if we use unlikely optimization, it should have little impact on 
> performance.

Yes, the arch code checks _TIF_WORK_MASK in one go so there is no performance
problem for code that hasn't taken the RAS-Error. (and once we've taken a RAS
error performance is out the window!)


> And the TIF flag approach was used on x86 platform for years, until commit 
> d4812e169d

... so x86 doesn't do this ...

> (x86, mce: Get rid of TIF_MCE_NOTIFY and associated mce tricks)[0]. On 
> currently arm64
> platform, there is no IST interrupt[1] function, so we could not call 
> memory_failure
> directly in SEA context. So the way to use TIF notification, is also a good 
> choice,
> after all, the same way on x86 platform is verified.

Thanks, looks like I need to read more of the history of x86's kernel-first
handling...


Thanks,

James



Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-14 Thread James Morse
Hi gengdongjiu,

(re-ordered hunks)

On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>> For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by
>> an access or not.

Actually it looks like we can: I thought 'BUS_MCEERR_AR' could be triggered via
some CPER flags, but its not. The only code that flags MF_ACTION_REQUIRED is
x86's kernel-first handling, which nicely matches this 'direct access' problem.
BUS_MCEERR_AR also come from KVM stage2 faults (and the x86 equivalent). Powerpc
also triggers these directly, both from what look to be synchronous paths, so I
think its fair to equate BUS_MCEERR_AR to a synchronous access and BUS_MCEERR_AO
to something_else.

I don't think we need anything else.


>> When the mm code gets -EHWPOISON when trying to resolve a
>
> Because of that, so I allow  userspace getting exception information

... and there are cases where you can't get the exception information, and other
cases where it wasn't an exception at all.

[...]


>> What happens if the dram-scrub hardware spots an error in guest memory, but
>> the guest wasn't running? KVM won't have a relevant ESR value to give you.

> if the dram-scrub hardware spots an error in guest memory, it will generate
> IRQ in DDR controller, not SEA or SEI exception. I still do not consider the
> GSIV. For GSIV, may be we can only handle it in the host OS.

Great example: this IRQ pulls us out of a guest, we tromp through APEI and then
memory_failure(), the memory happened to belong to the same guest
(coincidence!), we send it some signal and now its user-space's problem.

Your KVM_REG_ARM64_FAULT mechanism is going to return stale data, even though
the notification interrupted the guest, and it was guest memory that was
affected. KVM doesn't have a relevant ESR.


I'm strongly against exposing 'which notification type' this error originally
came from because:
* it doesn't matter once we've got the CPER records,
* there isn't always an answer (there are/will-be other ways of tripping
  memory_failure())
* it creates ABI between firwmare, host userspace and guest userspace.
  Firmware's choice of notification type shouldn't affect anything other than
  the host kernel.


On 13/09/17 08:32, gengdongjiu wrote:
> On 2017/9/8 0:30, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> when userspace gets SIGBUS signal, it does not know whether
>>> this is a synchronous external abort or SError,
>>
>> Why would Qemu/kvmtool need to know if the original notification (if there 
>> was
>> one) was synchronous or asynchronous? This is between firmware and the 
>> kernel.

> there are two reasons:
> 
> 1. Let us firstly discuss the SEA and SEI, there are different workflow for 
> the two different Errors.
> 2. when record the CPER in the user space, it needs to know the error type, 
> because SEA and SEI are different Error source,
>so they have different offset in the APEI table, that is to say they will 
> be recorded to different place of the APEI table.

user-space can choose whether to use SEA or SEI, it doesn't have to choose the
same notification type that firmware used, which in turn doesn't have to be the
same as that used by the CPU to notify firmware.

The choice only matters because these notifications hang on an existing pieces
of the Arm-architecture, so the notification can only add to the architecturally
defined meaning. (i.e. You can only send an SEA for something that can already
be described as a synchronous external abort).

Once we get to user-space, for memory_failure() notifications, (which so far is
all we are talking about here), the only thing that could matter is whether the
guest hit a PG_hwpoison page as a stage2 fault. These can be described as
Synchronous-External-Abort.

The Synchronous-External-Abort/SError-Interrupt distinction matters for the CPU
because it can't always make an error synchronous. For memory_failure()
notifications to a KVM guest we really can do this, and we already have this
behaviour for free. An example:

A guest touches some hardware:poisoned memory, for whatever reason the CPU can't
put the world back together to make this a synchronous exception, so it reports
it to firmware as an SError-interrupt.
Linux gets an APEI notification and memory_failure() causes the affected page to
be unmapped from the guest's stage2, and SIGBUS_MCEERR_AO sent to user-space.

Qemu/kvmtool can now notify the guest with an IRQ or POLLed notification. AO->
action optional, probably asynchronous.

But in our example it wasn't really asynchronous, that was just a property of
the original CPU->firmware notification. What happens?

Re: [PATCH v6 5/7] arm64: kvm: route synchronous external abort exceptions to el2

2017-09-14 Thread James Morse
Hi gengdongjiu,

On 14/09/17 12:12, gengdongjiu wrote:
> On 2017/9/8 0:31, James Morse wrote:
>> KVM already handles external aborts from lower exception levels, no more work
>> needs doing for TEA.

> If it is firmware first solution, that is SCR_EL3.EA=1, all SError interrupt 
> and synchronous External
> Abort exceptions are taken to EL3, so EL3 firmware will handle it, KVM no 
> needs to handle it.

... and presumably your firmware generates a fake-Synchronous-external-abort to
hand to EL2 as an APEI SEA notification? My point: this is fine, KVM already
handles synchronous-external aborts, no more work needed for this trap, (in
contrast to the TERR, which you've fixed)


> HCR_EL3.TEA is only for EL3 to check its value to decide to jump to 
> hypervisor or kernel.

HCR_EL3!?!


>> What happens when a guest access the RAS-Error-Record registers?
>>
>> Before we can set HCR_EL2.TERR I think we need to add some minimal emulation 
>> for
>> the registers it traps. Most of them should be RAZ/WI, so it should be
>> straightforward. (I think KVMs default is to emulate an undef for unknown 
>> traps).

> Today I added the support to do some minimal emulation for RAS-Error-Record 
> registers, thanks
> for the good suggestion.

Thanks. Software has the bad habit of living much longer than we think, if KVM
traps part of the architecture then we have to emulate it... Some bright spark
might boot a future Linux-v4.42 guest on a Linux-v4.16 host.

I had a run through the RAS spec: if we make ERRIDR_EL1 RAZ/WI then we can do
the same with ERRSELR_EL1. Then following the rules for 'If ERRSELR_EL1.SEL is
[>=]  ERRIDR_EL1.NUM' that makes the ERX* registers RAZ/WI too.


>> Eventually we will want to back this with a page of memory that lets
>> Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
>> errors for kernel-first handling.)


Thanks,

James


Re: [PATCH v6 3/7] acpi: apei: remove the unused code

2017-09-14 Thread James Morse
Hi gengdongjiu,

On 11/09/17 13:04, gengdongjiu wrote:
> On 2017/9/9 2:17, James Morse wrote:
>> On 04/09/17 12:43, gengdongjiu wrote:
>>> On 2017/9/1 1:50, James Morse wrote:
>>>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>>> If you aren't handling the notification, why is this is in the HEST at all?
>>>> (and if its not: its not firmware-first)
>>
>>> For the SEI notification, may be we can parse and handle the CPER record 
>>> other than the Error physical address
>>
>> Sure, but I only see this cleanup patch in this series, where does APEI learn
>> about NOTIFY_SEI? As this is nothing will ever touch those CPER records, if
>> you're using GHESv2 firmware will be prevented from delivering subsequent
>> notifications.

> James, whether it is possible you can review the previous v5 patch which adds 
> the support for

Spreading 'current discussion' over two versions is a problem for anyone trying
to follow this series.

If you post a newer version its normal for people to delete the older versions.
When you post a new version you should be happy that its the latest and 
greatest.


> NOTIFY_SEI? thanks in advancecIn that patch, I share the SEI notification
handling with the SEA
> notification handling to avoid duplicated code.

You may be able to share some of the code, but I don't think you should share
the list of GHES between notification methods.
This leads to races between the firmware and OS: If CPU-A has received an SEI
firmware would have to avoid generating an SEA on CPU-B as the SEI-handler
running on CPU-A may find and process the second set of CPER records. CPU-B then
gets a spurious notification.

Why is this a problem? KVM needs to know if APEI handled the error, or whether
the Synchronous-External-Abort/SError-Interrupt was due to something else, in
which case we invoke todays default behaviour, which isn't appropriate for a RAS
event.


Thanks,

James


Re: [PATCH v2] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2017-09-12 Thread James Morse
Hi Pratyush,

On 01/09/17 06:48, Pratyush Anand wrote:
> do_task_stat() calls get_wchan(), which further does unbind_frame().
> unbind_frame() restores frame->pc to original value in case function
> graph tracer has modified a return address (LR) in a stack frame to hook
> a function return. However, if function graph tracer has hit a filtered
> function, then we can't unwind it as ftrace_push_return_trace() has
> biased the index(frame->graph) with a 'huge negative'
> offset(-FTRACE_NOTRACE_DEPTH).
> 
> Moreover, arm64 stack walker defines index(frame->graph) as unsigned
> int, which can not compare a -ve number.
> 
> Similar problem we can have with calling of walk_stackframe() from
> save_stack_trace_tsk() or dump_backtrace().
> 
> This patch fixes unwind_frame() to test the index for -ve value and
> restore index accordingly before we can restore frame->pc.

I've just spotted arm64's profile_pc, which does this:
>From arch/arm64/kernel/time.c:profile_pc():
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>   frame.graph = -1; /* no task info */
> #endif

Is this another elaborate way of hitting this problem?

I guess the options are skip any return-address restore in the unwinder if
frame.graph is -1. (and profile_pc may have a bug here). Or, put
current->curr_ret_stack in there.

profile_pc() always passes tsk=NULL, so the unwinder assumes its current...
kernel/profile.c pulls the pt_regs from a per-cpu irq_regs variable, that is
updated by handle_IPI ... so it looks like this should always be current...


Thanks,

James


> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 09d37d66b630..4c47147d0554 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -75,6 +75,9 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
> stackframe *frame)
>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>   if (tsk->ret_stack &&
>   (frame->pc == (unsigned long)return_to_handler)) {
> + if (frame->graph < 0)
> + frame->graph += FTRACE_NOTRACE_DEPTH;
> +
>   /*
>* This is a case where function graph tracer has
>* modified a return address (LR) in a stack frame
> 






Re: [PATCH v6 3/7] acpi: apei: remove the unused code

2017-09-08 Thread James Morse
Hi gengdongjiu,

On 04/09/17 12:43, gengdongjiu wrote:
> On 2017/9/1 1:50, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> In current code logic, the two functions ghes_sea_add() and
>>> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
>>> is defined. If not, it will return errors in the ghes_probe()
>>> and not contiue. Hence, remove the unnecessary handling when
>>> CONFIG_ACPI_APEI_SEI is not defined.
>>
>> This doesn't match what the patch does. I get this feeling this is needed for
>> some future patch you haven't included.
> 
> James, let check the code, when the ghes_probe, if the CONFIG_ACPI_APEI_SEA 
> is not defined.
> it will return -ENOTSUPP and goto error, and the ghes_sea_add has no chance 
> to execute.
> similar, if the probe is failed, it should not have chance to execute 
> ghes_sea_remove.

It's the 'unnecessary handling when CONFIG_ACPI_APEI_SEI' in the commit message
that confuses me: this patch doesn't reference that Kconfig symbol. I guess that
sentence needs removing for this v6?

Re-reading without that part of the commit-message:

You're relying on the compiler's dead-code elimination to spot unused static
functions and silently drop them. Great!
(there is the small risk that gcc 3.2[0] can't do this, x86 still has to support
this gcc version)

As this is just clean-up patch can you break it out of this series, it isn't
needed to add support for SEI.

(This series adds support for what should be an APEI notification, but the only
code that touches APEI removes some code from a different notification method.)


>> Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
>> check the GHES->ErrorStatusAddress for CPER records when it receives an
>> SError-Interrupt, as it may be a notification of an error from this error 
>> source.

> previously I added the NOTIFY_SEI support,

(Yes, I saw that in v5 and expected this series to add some APEI support code )


> but consider the error address in CPER is not accurate and calling 
> memory_failure() may not make sense.
> so I remove it.

'not accurate'... this is going to be a problem, but lets keep that discussion
on the cover-letter.


>> If you aren't handling the notification, why is this is in the HEST at all?
>> (and if its not: its not firmware-first)

> For the SEI notification, may be we can parse and handle the CPER record 
> other than the Error physical address

Sure, but I only see this cleanup patch in this series, where does APEI learn
about NOTIFY_SEI? As this is nothing will ever touch those CPER records, if
you're using GHESv2 firmware will be prevented from delivering subsequent
notifications.


Thanks,

James

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/admin-guide/README.rst#n251


Re: [PATCH v3 1/3] arm64/ras: support sea error recovery

2017-09-08 Thread James Morse
Hi Xie XiuQi,

(Sorry a few versions of this went past before I caught up with it)

On 07/09/17 08:45, Xie XiuQi wrote:
> With ARM v8.2 RAS Extension, SEA are usually triggered when memory errors
> are consumed. In some cases, if the error address is in a clean page or a
> read-only page, there is a chance to recover. Such as error occurs in a
> instruction page, we can reread this page from disk instead of killing
> process.

> Because memory_failure() may sleep, we can not call it directly in SEA
> exception context.

This is why we have memory_failure_queue() instead, it ... bother. That doesn't
look nmi-safe. (I thought this ended with an llist, but clearly I was looking at
the wrong thing).

It doesn't look like this is a problem for NOTIFY_SEA as it would only interrupt
itself on the same CPU if the memory-failure code/data were corrupt. (which is
not a case we can handle). We need to fix this before any of the asynchronous
NMI-like RAS notifications for arm64 get merged.

(this is one problem, but I don't think its 'the' problem you are trying to
solve with this series).


> So we saved faulting physical address associated with
> a process in the ghes handler and set __TIF_SEA_NOTIFY.

A per-notification type TIF flag looks fishy, surely this would affect all
NMI-like RAS notification methods?


> When we return
> from SEA exception context and get into do_notify_resume() before the
> process running, we could check it and call memory_failure() to do
> recovery. It's safe, because we are in process context.

I'm afraid I don't think this is the best approach for fixing this.
Its tied to the notification type, but the notification should be irrelevant
once we call ghes_proc().
It adds code poking around in CPER and ACPI/GHES to the arm64 arch code, all of
this should be in the core common code.
Most importantly: this means arm64 behaves differently with regard to handling
memory errors to other architectures using ACPI. Two behaviours means twice the
code, review and bugs...


Delaying the handling until we re-enter user-space means faults that may affect
the kernel aren't handled until much later. Just because the fault was
synchronous and user-space was running doesn't mean only user space is affected.
Some examples I've collected so far: the zero-page may be corrupt, this is
mapped into every process and used by the kernel. Similarly corruption in the
vdso affects all user-space. The fault may affect the page tables, this affects
all users of the mm_struct.

(I'm sure we agree that an synchronous-external-abort interrupting the kernel is
fatal for the kernel, but the other way round isn't always true).

Setting a TIF flag to handle the error before re-entering user-space is a
problem as the scheduler may choose to pre-empt this task and run all the other
tasks before this eventually gets handled.


Assuming this is just a problem with memory_failure_queue(), two alternatives I
can suggest are making memory_failure_queue() nmi-safe, or abstracting
NOTIFY_NMI's estatus pool/cache to use for the arm64 NMI-like notifications too.

If there is more to this, can you explain the problem you're trying to solve?
(I suspect there may be an issue with multiple-signals being merged, or exactly
when memory_failure_queue()'s work gets run.) Can you outline the sequence of
events?


You're picking a physical address out of 'ARM Processor Error Information
Structure', these correspond with Cache, TLB, Bus or (the mysterious) 'micro
architectural error'. I don't see anything checking the error type.
Given the physical address, are you adding error-handling for cache-errors with
this series?


Thanks,

James



Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-09-07 Thread James Morse
Hi gengdongjiu,

(I've re-ordered some of the hunks here:)

On 04/09/17 12:10, gengdongjiu wrote:
> On 2017/9/1 1:43, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Not call memory_failure() to handle it. Because the error address recorded
>>> by APEI is not accurated, so can not identify the address to hwpoison
>>> memory.
>>
>> This looks like a firmware bug, what address do you get in your CPER
>> records? It should be a physical address.

> No, not firmware bug. At least in the armv8.0 CPU and huawei's armv8.2 CPU,
> the architecture decided it is not accurate, this abort is asynchronous not
> synchronous.

This is going to be a problem. (I'm chasing Achin to find out when this is
allowed to happen and what we're expected to do about it!)

I hope this isn't the default behaviour, but only happens in exceptionally rare
circumstances.


>> To report a memory-error you must have an address.
> maybe we can not get the accurate error address, can you get it in your armv8
> platform?

I only have software-models, they only generate the errors you tell them to.


I think I see why you're taking this approach with the series, the scenario is:
1. Firmware takes an SError due to a bad memory location from guest EL0.
2. The CPU doesn't provide the address of the memory location.

You want to confine this error as much as possible, in particular to the context
it came from (e.g. guest EL0). CPU context isn't something the CPER records can
describe (they describe failures in system components), hence your hybrid
{kernel,firmware}-first code.

I don't think its safe to kill guest-EL0 and hope this confined the error.

If the affected page of guest memory has never been written to by the guest, the
host will map in the global zero-page, (made read-only at stage2). If the
corruption is in this page it affects the host kernel, guests and user-space
processes. Just because the error came from guest-EL0 doesn't mean
kernel/hypervisor memory isn't affected.

This doesn't just affect that one page: KSM may have merged every copy of every
guest user-space's libc, which has subsequently become corrupt. The first
guest-EL0 to step in this triggers the fault, but it affects all the guests.
With the address all the guests can fix this error, and KSM will re-merge the
pages. Without the address every user-space process in every guest will
eventually be killed.

We aren't even guaranteed that the access that caused the fault came from your
guest EL0. The fault may be in the page tables belonging to the guest kernel,
even worse they may belong to they hypervisor's stage2 page tables.

(Thanks to Mark and Robin for these examples)


I think in this scenario your firmware should describe a memory-error with an
unknown address. (i.e. don't set the 'physical address valid' bit in CPER's
'Table 275 Memory Error Record'). When Linux gets one of these, it should
panic(): We know some memory is corrupt, we don't know where it is.


>> User-space may be signalled by the memory_failure() helper, and user-space >>
may choose to notify the guest about the memory-failure, but this would be a
>> new error.

> For the SError, it is asynchronous abort. so it is not better to call
> memory_failure() helper, because the error address is not accurate.
> memory_failure() will offline or poison the address, but the address is not
> accurate. so it is dangerous

By 'not accurate' do you mean the CPU provides an address, and its wrong.
(surely this is a CPU bug), or just no address is provided. (i.e. the
ERRADDR.AI 'address incorrect' bit is set).


>>> Because the error was taken from a lower Exception level, if the
>>> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
>>> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
>>> transfers to hypervisor.
>>
>> What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
>> SError? (this is very common today: all kernel code runs like this).

> Firstly, the guest OS usually runs in the El1 or El0, not El2.
> if El2 happens an SError, it will trap to EL3 firmware even though the 
> PSTATE.A is set.
> Because the PSTATE.A can not mask it if the SError is trapped to EL3.

Sure, we agree that from the CPU's view when SCR_EL3.EA is set physical-SError
can't be masked when executing any EL below EL3.

My question was about the 'firmware sets ESR_EL2/FAR_EL2 to fake an exception
trap to EL2' step. While EL3 can take the physical-SError at any time the
normal-world is running, it can't always deliver a fake-SError to EL2, because
EL2 believes it has masked physical-SError.

With the SError rework this should only be masked while we are 

Re: [PATCH v6 4/7] arm64: kvm: support user space to query RAS extension feature

2017-09-07 Thread James Morse
Hi gengdongjiu,

On 05/09/17 08:18, gengdongjiu wrote:
> On 2017/9/1 2:04, James Morse wrote:
>> On 28/08/17 11:38, Dongjiu Geng wrote:
>>> Userspace will want to check if the CPU has the RAS
>>> extension. 
>>
>> ... but user-space wants to know if it can inject SErrors with a specified 
>> ESR.
>>
>> What if we gain another way of doing this that isn't via the RAS-extensions, 
>> now
>> user-space has to check for two capabilities.
>>
>>
>>> If it has, it wil specify the virtual SError syndrome
>>> value, otherwise it will not be set. This patch adds support for
>>> querying the availability of this extension.
>>
>> I'm against telling user-space what features the CPU has unless it can use 
>> them
>> directly. In this case we are talking about a KVM API, so we should describe 
>> the
>> API not the CPU.
> 
> shenglong (zhaoshengl...@huawei.com) who is Qemu maintainer suggested 
> checking the CPU RAS-extensions
> to decide whether generate the APEI table and record CPER for the guest OS in 
> the user space.
> he means if the host does not support RAS, user space may also not support 
> RAS.

The code to signal memory-failure to user-space doesn't depend on the CPU's
RAS-extensions.

If Qemu supports notifying the guest about RAS errors using CPER records, it
should generate a HEST describing firmware first. It can then choose the
notification methods, some of which may require optional KVM APIs to support.

Seattle has a HEST, it doesn't support the CPU RAS-extensions. The kernel can
notify user-space about memory_failure() on this machine. I would expect Qemu to
be able to receive signals and describe memory errors to a guest (1).

The question should be: 'How can Qemu know it can use SEI as a firmware-first
notification?' It needs a KVM API to trigger an SError in the guest with a
specified ESR. The name of the KVM CAP needs to reflect the API (2).

Just because this is the first KVM API that needs the CPU to have the RAS
extensions doesn't mean we should call it 'has RAS' and be done with it.

We will eventually need another KVM API to configure trapping and emulating
values in the RAS ERR registers so that Qemu can emulate a machine without
firmware-first. (This is likely to be a page of memory that backs the registers,
there will need to be another KVM CAP to describe this support (3)).


Exposing the CPUs support for RAS-extensions to support (2) means having
per-platform support for (1). This is either creating extra work, or not
supporting as many platforms as we could. Both are bad.
Once we have (3) as well, any developer needs to know that 'has RAS' just meant
the first API KVM implemented using RAS, and doesn't mean later APIs also using
RAS are supported by the kernel.


Thanks,

James


Re: [PATCH v6 5/7] arm64: kvm: route synchronous external abort exceptions to el2

2017-09-07 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> ARMv8.2 adds a new bit HCR_EL2.TEA which controls to
> route synchronous external aborts to EL2, and adds a
> trap control bit HCR_EL2.TERR which controls to
> trap all Non-secure EL1&0 error record accesses to EL2.
>
> This patch enables the two bits for the guest OS.
> when an synchronous abort is generated in the guest OS,

> it will trap to EL3 firmware, EL3 firmware will check the
> HCR_EL2.TEA value to decide to jump to hypervisor or host
> OS.

(This is what you are using this for, the patch has nothing to do with EL3.)


> Enabling HCR_EL2.TERR makes error record access
> from guest trap to EL2.


KVM already handles external aborts from lower exception levels, no more work
needs doing for TEA.

What happens when a guest access the RAS-Error-Record registers?

Before we can set HCR_EL2.TERR I think we need to add some minimal emulation for
the registers it traps. Most of them should be RAZ/WI, so it should be
straightforward. (I think KVMs default is to emulate an undef for unknown 
traps).

Eventually we will want to back this with a page of memory that lets
Qemu/kvmtool configure what the guest can see. (i.e. the emulated machine's
errors for kernel-first handling.)


Thanks,

James


Re: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

2017-09-07 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> when userspace gets SIGBUS signal, it does not know whether
> this is a synchronous external abort or SError,

Why would Qemu/kvmtool need to know if the original notification (if there was
one) was synchronous or asynchronous? This is between firmware and the kernel.


I think I can see why you need this: to choose whether to emulate SEA or SEI,
but what if the guest wasn't running? Or the guest was running, but it wasn't
guest-memory that is affected.

What happens if the dram-scrub hardware spots an error in guest memory, but the
guest wasn't running? KVM won't have a relevant ESR value to give you.

What happens if we start swapping a page of guest memory to disk, and discover
the memory is corrupt. This is synchronous, but it wasn't the guest, and KVM
still can't give you an ESR.

What about CPER records discovered through the polled interface? What happens if
I write a PFN into the corrupt-pfn sysfs interface?


I think what you need is some way of knowing if the BUS_MCEERR_A* was directly
caused by a user-space (or guest) access, and if so was it a data or instruction
fetch. These can become SEA notifications.

KVM's user-space shouldn't be a special-case where the kernel behaves
differently: if we tinker with this it needs to make sense for all user space
processes and mean something on all architectures.

I think this information could be useful to other users of these signals, e.g. a
JVM could silently regenerate/reload code/data for a non-direct-access fault
instead of exit-ing (or throwing an exception) for a direct access.

For BUS_MCEERR_A* from memory_failure() we can't know if they are caused by an
access or not. When the mm code gets -EHWPOISON when trying to resolve a
user-space fault we know it was due to a direct-access. (I don't know if/how x86
can know if it was code or data). Faulting guest accesses through KVM are just a
special version of this where KVM fixes-up stage2.

... but for any of this to work we need the address of the corrupt memory.
(-> cover letter)


Thanks,

James


Re: [PATCH] arm64: KVM: VHE: reset PSTATE.UAO when switch to host

2017-09-07 Thread James Morse
Hi Dongjiu Geng,

On 07/09/17 06:54, Dongjiu Geng wrote:
> In VHE mode, host kernel runs in the EL2 and can enable
> 'User Access Override' when fs==KERNEL_DS so that it can
> access kernel memory. However, PSTATE.UAO is set to 0 on
> an exception taken from EL1 to EL2. Thus when VHE is used
> and exception taken from a guest UAO will be disabled and
> host will use the incorrect PSTATE.UAO. So check and reset
> the PSTATE.UAO when switching to host.

This would only be a problem if KVM were calling into world-switch with
fs==KERNEL_DS. I can't see where this happens.

kvm_arch_vcpu_ioctl_run() is the only place KVM calls world-switch, there are no
set_fs() calls in it, or on the path to it. The addr_limit should be USER_DS,
PSTATE.UAO will be clear, as it is when we come back from a guest.

This isn't broken today. I agree it will break if KVM decides to
set_fs(KERNEL_DS) around world switch, but until then we don't need this patch.


> Move the reset PSTATE.PAN on entry to EL2 together with
> PSTATE.UAO reset.

Moving this breaks PAN-at-HYP for systems with PAN but without VHE.


> diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S
> index 12ee62d..7662ef5 100644
> --- a/arch/arm64/kvm/hyp/entry.S
> +++ b/arch/arm64/kvm/hyp/entry.S
> @@ -96,8 +96,6 @@ ENTRY(__guest_exit)
>  
>   add x1, x1, #VCPU_CONTEXT
>  
> - ALTERNATIVE(nop, SET_PSTATE_PAN(1), ARM64_HAS_PAN, CONFIG_ARM64_PAN)
> -
>   // Store the guest regs x2 and x3
>   stp x2, x3,   [x1, #CPU_XREG_OFFSET(2)]
>  
> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index a733461..715b3941 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static bool __hyp_text __fpsimd_enabled_nvhe(void)
>  {
> @@ -399,6 +400,17 @@ int __hyp_text __kvm_vcpu_run(struct kvm_vcpu *vcpu)
>  
>   __sysreg_restore_host_state(host_ctxt);
>  
> + if (has_vhe()) {
> + /*
> +  * PSTATE was not saved over guest enter/exit, re-enable
> +  * any detecte features that might not have been set
> +  * correctly.
> +  */
> + uao_thread_switch(current);

I don't see how addr_limit will ever be KERNEL_DS, so this is always clearing
PSTATE.UAO, which was already clear from the guest-exit exception.

(Also, the uao_thread_switch() code isn't accessible from EL2, neither is 
current)


> + asm(ALTERNATIVE("nop", SET_PSTATE_PAN(1),
> + ARM64_HAS_PAN, CONFIG_ARM64_PAN));

... and this is setting PSTATE.PAN on VHE, which was already set, and breaking
PAN-at-HYP on non-VHE systems.

Vladimir's commit message for that patch that added this enabling explained it
is needed for !VHE as SCTLR_EL2 when HCR_EL2.E2H is clear doesn't have a SPAN 
bit.

When we have VHE clearing SCTLR_EL2.SPAN (clearing because it was RES1 on v8.0)
will cause the CPU to set PSTATE.PAN when we take an exception.


> + }
> +
>   if (fp_enabled) {
>   __fpsimd_save_state(&guest_ctxt->gp_regs.fp_regs);
>   __fpsimd_restore_state(&host_ctxt->gp_regs.fp_regs);
> 


James



Re: [PATCH v6 4/7] arm64: kvm: support user space to query RAS extension feature

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> In ARMV8.2 RAS extension, a virtual SError exception syndrome
> register(VSESR_EL2) is added.  This value may be specified from
> userspace.

I agree that the CPU support for injecting SErrors with a specified ESR should
be exposed to KVM's user space...


> Userspace will want to check if the CPU has the RAS
> extension. 

... but user-space wants to know if it can inject SErrors with a specified ESR.

What if we gain another way of doing this that isn't via the RAS-extensions, now
user-space has to check for two capabilities.


> If it has, it wil specify the virtual SError syndrome
> value, otherwise it will not be set. This patch adds support for
> querying the availability of this extension.

I'm against telling user-space what features the CPU has unless it can use them
directly. In this case we are talking about a KVM API, so we should describe the
API not the CPU.


> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
> index 3256b9228e75..b7313ee028e9 100644
> --- a/arch/arm64/kvm/reset.c
> +++ b/arch/arm64/kvm/reset.c
> @@ -77,6 +77,9 @@ int kvm_arch_dev_ioctl_check_extension(struct kvm *kvm, 
> long ext)
>   case KVM_CAP_ARM_PMU_V3:
>   r = kvm_arm_support_pmu_v3();
>   break;
> + case KVM_CAP_ARM_RAS_EXTENSION:

This should be called something more like 'KVM_CAP_ARM_INJECT_SERROR_ESR'


> + r = cpus_have_const_cap(ARM64_HAS_RAS_EXTN);
> + break;
>   case KVM_CAP_SET_GUEST_DEBUG:
>   case KVM_CAP_VCPU_ATTRIBUTES:
>   r = 1;


We can inject SError on systems without the RAS extensions using just the
HCR_EL2.VSE bit. We may want to make the 'ESR' part of the API optional, or
expose '1' for the without-ESR version and '2 for with-ESR, (however we choose
to implement that).

The risk is if we want to add a without-ESR version later, and the name we make
ABI now turned out to be a mistake. Marc or Christoffer probably have the best
view of this. (no-one has needed it so far...)


Thanks,

James



Re: [PATCH v6 3/7] acpi: apei: remove the unused code

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> In current code logic, the two functions ghes_sea_add() and
> ghes_sea_remove() are only called when CONFIG_ACPI_APEI_SEA
> is defined. If not, it will return errors in the ghes_probe()
> and not contiue. Hence, remove the unnecessary handling when
> CONFIG_ACPI_APEI_SEI is not defined.

This doesn't match what the patch does. I get this feeling this is needed for
some future patch you haven't included.


> change since v5:
> 1. remove the SEI notification type handling, because the SEI is
>asynchronous exception and the address is not accurate. so
>not call memory_failure() to handle it.

Setting NOTIFY_SEI as the GHES entry's notification type means the OS should
check the GHES->ErrorStatusAddress for CPER records when it receives an
SError-Interrupt, as it may be a notification of an error from this error 
source.

If you aren't handling the notification, why is this is in the HEST at all?
(and if its not: its not firmware-first)


James


> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index d661d452b238..c15a08db2c7c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -813,7 +813,6 @@ static struct notifier_block ghes_notifier_hed = {
>   .notifier_call = ghes_notify_hed,
>  };
>
> -#ifdef CONFIG_ACPI_APEI_SEA
>  static LIST_HEAD(ghes_sea);
>
>  /*
> @@ -848,19 +847,6 @@ static void ghes_sea_remove(struct ghes *ghes)
>   mutex_unlock(&ghes_list_mutex);
>   synchronize_rcu();
>  }
> -#else /* CONFIG_ACPI_APEI_SEA */
> -static inline void ghes_sea_add(struct ghes *ghes)
> -{
> - pr_err(GHES_PFX "ID: %d, trying to add SEA notification which is not
supported\n",
> -ghes->generic->header.source_id);
> -}
> -
> -static inline void ghes_sea_remove(struct ghes *ghes)
> -{
> - pr_err(GHES_PFX "ID: %d, trying to remove SEA notification which is not
supported\n",
> -ghes->generic->header.source_id);
> -}
> -#endif /* CONFIG_ACPI_APEI_SEA */
>
>  #ifdef CONFIG_HAVE_ACPI_APEI_NMI
>  /*
>



Re: [PATCH v6 1/7] arm64: cpufeature: Detect CPU RAS Extentions

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> From: Xie XiuQi 
> 
> ARM's v8.2 Extentions add support for Reliability, Availability and
> Serviceability (RAS). On CPUs with these extensions system software
> can use additional barriers to isolate errors and determine if faults
> are pending.
> 
> Add cpufeature detection and a barrier in the context-switch code.
> There is no need to use alternatives for this as CPUs that don't
> support this feature will treat the instruction as a nop.
> 
> Platform level RAS support may require additional firmware support.
> 
> Signed-off-by: Xie XiuQi 
> [Rebased, added esb and config option, reworded commit message]
> Signed-off-by: James Morse 

Nit: when re-posting patches from the list you need to add your signed-off-by.
See Documentation/process/submitting-patches.rst 'Developer's Certificate of
Origin 1.1'

This goes for your patch 2 as well.


> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index c845c8c04d95..7a17b4a1bd9e 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -370,6 +370,9 @@ __notrace_funcgraph struct task_struct 
> *__switch_to(struct task_struct *prev,
>*/
>   dsb(ish);
>  
> + /* Deliver any pending SError from prev */
> + esb();
> +

This patch was sitting on top of the SError rework. As the cover-letter
describes that was all there to make sure SError is unmasked when we execute
this esb(). Without it any pending SError will be deferred, its ESR is written
to DISR_EL1, which this patch doesn't check.

On its own, this patch is actively harmful to systems that don't have
firmware-first handling.

We probably need to produce a combined series...


Thanks,

James



Re: [PATCH v6 0/7] Add RAS virtualization support for SEA/SEI notification type in KVM

2017-08-31 Thread James Morse
Hi Dongjiu Geng,

On 28/08/17 11:38, Dongjiu Geng wrote:
> In the firmware-first RAS solution, corrupt data is detected in a
> memory location when guest OS application software executing at EL0
> or guest OS kernel El1 software are reading from the memory. The
> memory node records errors in an error record accessible using
> system registers.
> 
> Because SCR_EL3.EA is 1, then CPU will trap to El3 firmware, EL3
> firmware records the error to APEI table through reading system
> register.

Strictly speaking these are CPER records in a memory region pointed to by the
HEST->GHES ACPI table.


> Because the error was taken from a lower Exception level, if the
> exception is SEA/SEI and HCR_EL2.TEA/HCR_EL2.AMO is 1, firmware
> sets ESR_EL2/FAR_EL2 to fake a exception trap to EL2, then
> transfers to hypervisor.

What happens if you took an SError from EL2 and EL2 has PSTATE.A set masking
SError? (this is very common today: all kernel code runs like this).

What happens if the hypervisor then executes an ESB with PSTATE.A set? It
expects to see any pending SError deferred and its syndrome written to DISR_EL1,
but this register is RAZ/WI when you set SCR_EL3.EA. '4.4.2' of [0]


> For the synchronous external abort(SEA), Hypervisor calls the
> ghes_handle_memory_failure() to deal with this error,
> ghes_handle_memory_failure() function reads the APEI table and 
> callls memory_failure() to decide whether it needs to deliver
> SIGBUS signal to user space, the advantage of using SIGBUS signal
> to notify user space is that it can be compatible with Non-Kvm users.
> 
> For the SError Interrupt(SEI),KVM firstly classified the error.

KVM can't parse the CPER records, nor does it know where to look to find them.
KVM should call out to the APEI code so the host kernel can handle the error.

User-space may be signalled by the memory_failure() helper, and user-space may
choose to notify the guest about the memory-failure, but this would be a new 
error.


> Not call memory_failure() to handle it. Because the error address recorded
> by APEI is not accurated, so can not identify the address to hwpoison
> memory.

This looks like a firmware bug, what address do you get in your CPER records? It
should be a physical address.

To report a memory-error you must have an address.

If the error wasn't detected as a synchronous access then delivering a
synchronous-external-abort is inappropriate (I think we both agree on this), and
SError-interrupt doesn't have a way of specifying an address ... but the CPER
records do.

For firmware-first your SError-interrupt is just a notification, its the CPER
records the OS uses to handle the error.


> If the SError error comes from guest user mode and is not propagated,
> then signal user space to handle it, otherwise, directly injects virtual
> SError, or panic if the error is fatal.

What do you mean by propagated?

I don't think we should ever hand RAS notifications to user-space, the host
kernel should handle them, then describe the symptom (e.g. this region of your
va space is gone) to user-space.


> when user space handles the error,
> it will specify syndrome for the injected virtual SError. This syndrome value
> is set to the VSESR_EL2. VSESR_EL2 is a new ARMv8.2 RAS extensions register
> which provides the syndrome value reported to software on taking a virtual
> SError interrupt exception.


Thanks,

James

[0]
https://static.docs.arm.com/ddi0587/a/RAS%20Extension-release%20candidate_march_29.pdf



Re: [PATCH] arm64: fix unwind_frame() for filtered out fn via set_graph_notrace

2017-08-29 Thread James Morse
Hi Pratyush,

(Nit: get_maintainer.pl will give you the list of people to CC, you're reliant
on the maintainers being eagle-eyed to spot this!)

On 28/08/17 13:53, Pratyush Anand wrote:
> Testcase:
> cd /sys/kernel/debug/tracing/
> echo schedule > set_graph_notrace
> echo 1 > options/display-graph
> echo wakeup > current_tracer
> ps -ef | grep -i agent
> 
> Above commands result in
> PANIC: "Unable to handle kernel paging request at virtual address
> 801bcbde7000"

This didn't panic for me, it just killed the running task. (I guess you have
panic-on-oops set)


> vmcore analysis:
> 1)
> crash> bt
> PID: 1561   TASK: 8003cb7e4000  CPU: 0   COMMAND: "ps"
>  #0 [8003c4ff77b0] crash_kexec at 0816b9b8
>  #1 [8003c4ff77e0] die at 08088b34
>  #2 [8003c4ff7820] __do_kernel_fault at 0809b830
>  #3 [8003c4ff7850] do_bad_area at 08098b90
>  #4 [8003c4ff7880] do_translation_fault at 087c6cdc
>  #5 [8003c4ff78b0] do_mem_abort at 08081334
>  #6 [8003c4ff7ab0] el1_ia at 08082cc0
>  PC: 0808811c  [unwind_frame+300]
>  LR: 080858a8  [get_wchan+212]
>  SP: 8003c4ff7ab0  PSTATE: 6145
> X29: 8003c4ff7ab0  X28: 0001  X27: 
> X26:   X25:   X24: 
> X23: 8003c1c2  X22: 08c73000  X21: 8003c1c1c000
> X20: 000f  X19: 8003c1bc7000  X18: 0010
> X17:   X16: 0001  X15: ffed
> X14: 0010  X13:   X12: 0004
> X11:   X10: 02dd14c0   X9: 1999
>  X8: 003f   X7: 8003f71b   X6: 0018
>  X5:    X4: 8003c1c1fd20   X3: 8003c1c1fd10
>  X2: 0017ffe8   X1: 8003c4ff7af8   X0: 8003cbf67000
>  #7 [8003c4ff7b20] do_task_stat at 08304f0c
>  #8 [8003c4ff7c60] proc_tgid_stat at 08305b48
>  #9 [8003c4ff7ca0] proc_single_show at 082fdd10
>  #10 [8003c4ff7ce0] seq_read at 082b27bc
>  #11 [8003c4ff7d70] __vfs_read at 08289e54
>  #12 [8003c4ff7e30] vfs_read at 0828b14c
>  #13 [8003c4ff7e70] sys_read at 0828c2d0
>  #14 [8003c4ff7ed0] __sys_trace at 0808349c
>  PC: 0006  LR:   SP: ffed  PSTATE: 003f
> X12: 0010 X11:  X10: 0004  X9: 7febe8d0
>  X8:   X7: 1999  X6: 003f  X5: 000c
>  X4: 7fce9c78  X3: 000c  X2:   X1: 
>  X0: 0400
> 
> (2) Instruction at 0808811c caused IA/DA.
> 
> crash> dis -l 08088108 6
> /usr/src/debug//xx/arch/arm64/kernel/stacktrace.c:
> 84
> 0x08088108 :  ldr w2, [x1,#24]
> 0x0808810c :  sub w6, w2, #0x1
> 0x08088110 :  str w6, [x1,#24]
> 0x08088114 :  mov w6, #0x18 // #24
> 0x08088118 :  umull   x2, w2, w6
> 0x0808811c :  ldr x0, [x0,x2]
> 
> Corresponding c statement is
> frame->pc = tsk->ret_stack[frame->graph--].ret;
> 
> (3) So, it caused data abort while executing
> 0x0808811c :  ldr x0, [x0,x2]
> 
> x0 + x2 = 8003cbf67000 + 0017ffe8 = 801bcbde7000
> Access of 801bcbde7000 resulted in "Unable to handle kernel paging
> request"
> 
> from above data: frame->graph = task->curr_ret_stack which  should be,
> x2 / 0x18 = , which is -FTRACE_NOTRACE_DEPTH.

(0x18 is the size of struct ftrace_ret_stack for your config?)


> OK, so problem is here:
> do_task_stat() calls get_wchan(). Here p->curr_ret_stack  is
> -FTRACE_NOTRACE_DEPTH in the failed case.

> It means, when do_task_stat()
> has been called for task A (ps in this case) on CPUm,task A was in mid
> of execution on CPUn,

get_wchan() on a running processes can't work: the stack may be modified while
we walk it.
>From arch/arm64/kernel/process.c::get_wchan():
>   if (!p || p == current || p->state == TASK_RUNNING)
>   return 0;

As far as I can see, if task A is running on CPU-N then CPU-Ms attempt to
get_wchan() it will return 0.


> and was in the mid of mcount() execution where
> curr_ret_stack had been decremented in ftrace_push_return_trace() for
> hitting schedule() function, but it was yet to be incremented in
> ftrace_return_to_handler().

So if the function-graph-tracer has hooked the return values on a stack and hit
a filtered function, (schedule() in your example), we can't unwind it as
ftrace_push_return_trace() has biased the index with a 'huge negative' offset to
flag it as 'this should be filtered out'.

The arm64 stack walker isn't aware of this and interprets it as an unsigned
index. Nasty!


> Similar problem we can have with calling of walk_stackframe() from
> save_stack_trace_tsk() or dump_backtrace().
> 
> This patc

Re: [PATCHv2] arm64:kexec: have own crash_smp_send_stop() for crash dump for nonpanic cores

2017-08-11 Thread James Morse
Hi Hoeun,

On 07/08/17 06:09, Hoeun Ryu wrote:
>  Commit 0ee5941 : (x86/panic: replace smp_send_stop() with kdump friendly
> version in panic path) introduced crash_smp_send_stop() which is a weak
> function and can be overriden by architecture codes to fix the side effect

(overridden)


> caused by commit f06e515 : (kernel/panic.c: add "crash_kexec_post_
> notifiers" option).
> 
>  ARM64 architecture uses the weak version function and the problem is that
> the weak function simply calls smp_send_stop() which makes other CPUs
> offline and takes away the chance to save crash information for nonpanic
> CPUs in machine_crash_shutdown() when crash_kexec_post_notifiers kernel
> option is enabled.
> 
>  Calling smp_send_crash_stop() in machine_crash_shutdown() is useless
> because all nonpanic CPUs are already offline by smp_send_stop() in this
> case and smp_send_crash_stop() only works against online CPUs.


>  The result is that /proc/vmcore is not available with the error messages;
> "Warning: Zero PT_NOTE entries found", "Kdump: vmcore not initialized".

When I tried this I got one of these warnings for each secondary CPU, but the
vmcore file was still available. When I ran 'crash' on the vmcore it reported:
> CPUS: 6 [OFFLINE: 5]

Did I miss as step to reproduce this? If not, can we change this paragraph to
say something like:
> The result is that secondary CPUs registers are not saved by crash_save_cpu()
> and the vmcore file misreports these CPUs as being offline.


>  crash_smp_send_stop() is implemented to fix this problem by replacing the
> exising smp_send_crash_stop() and adding a check for multiple calling to

(existing)


> the function. The function (strong symbol version) saves crash information
> for nonpanic CPUs and machine_crash_shutdown() tries to save crash
> information for nonpanic CPUs only when crash_kexec_post_notifiers kernel
> option is disabled.
> 
> * crash_kexec_post_notifiers : false
> 
>   panic()
> __crash_kexec()
>   machine_crash_shutdown()
> crash_smp_send_stop()<= save crash dump for nonpanic cores
> 
> * crash_kexec_post_notifiers : true
> 
>   panic()
> crash_smp_send_stop()<= save crash dump for nonpanic cores
> __crash_kexec()
>   machine_crash_shutdown()
> crash_smp_send_stop()<= just return.


> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index dc66e6e..73d8f5e 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -977,11 +977,21 @@ void smp_send_stop(void)
>  }
>  
>  #ifdef CONFIG_KEXEC_CORE
> -void smp_send_crash_stop(void)
> +void crash_smp_send_stop(void)
>  {
> + static int cpus_stopped;
>   cpumask_t mask;
>   unsigned long timeout;
>  
> + /*
> +  * This function can be called twice in panic path, but obviously
> +  * we execute this only once.
> +  */
> + if (cpus_stopped)
> + return;
> +
> + cpus_stopped = 1;
> +

This cpus_stopped=1 can't happen on multiple CPUs at the same time as any second
call is guaranteed to be on the same CPU, both are behind panic()s
'atomic_cmpxchg()'.


Other than my '/proc/vmcore is not available' question above, this looks fine 
to me:
Reviewed-by: James Morse 
Tested-by: James Morse 


Thanks!

James





Re: [PATCH] KVM: arm64: add esr_el2 and far_el2 to sysreg

2017-08-08 Thread James Morse
Hi gengdongjiu,

On 07/08/17 18:43, gengdongjiu wrote:
> Another question, For the SEI, I want to also use SIGBUS both for the KVM 
> user and non-kvm user,
> if SEA and SEI Error all use the SIGBUS to notify user space(Qemu),

User-space shouldn't necessarily be notified about Synchronous External Aborts
or SError Interrupts. You're really asking about RAS firmware-first
notifications that use these as the notification mechanism.

We should not notify user-space that the guest happened to be interrupted by a
RAS firmware-first notification. It may not be relevant, and we can't know until
we parse the CPER records. The notification mechanism is between firmware and
the host kernel, we should never expose anything about it to user space or a 
guest.

Linux should act on the CPER records first to determine if the host kernel can
keep running. Once it has done this it can deliver signals to affected
processes, but which signal and its properties depends on the CPER records.

The example here is BUS_MCEERR_AO and BUS_MCEERR_AR. These notify userspace that
si_addr_lsb bits of memory are corrupt at si_addr, this is either
Action-Optional or Action-Required.

For arm64 we just needed to turn this code on, it already presents the minimum
necessary information to user-space in an architecture-agnostic way. We didn't
need to do anything to this code to support NOTIFY_SEA, the notification
mechanism is irrelevant, this is all driven by the CPER records.

If you have a class of error that isn't covered by the memory-failure code, then
we need to add something similar. This should be based on the CPER records, and
should work in exactly the same way for all processes on all ACPI platforms.


> do you agree my solution for the SEI? thanks.

No, you are trying to notify userspace that firmware notified the host. This
creates an ABI between EL3 firmware and EL0 user space that we can't possibly
support.

I think you've come to this because you are merging two steps together:
1. The OS uses the v8.2 RAS extensions to isolate errors and notify firmware.
2. If firmware has to tell the OS about the error, firmware generates CPER 
records.
3. Firmware triggers the GHES notification mechanism for this error source.
4. Linux receives the notification and calls ghes_proc(), (if KVM gets the
notification because a guest happened to be running, it should switch back to
the host and arrange for ghes_proc() to be called).
5. ghes_proc() parses the CPER records and calls other kernel helpers to handle
the specific type of error, e.g. memory_failure().
6. If the helper knows the kernel can keep running, the error is visible to
user-space and user space could do further processing to correct the error, an
error-specific signal is sent.
7. User-space reloads the webpage, notifies the guest or whatever is 
appropriate.

You are merging steps 3 and 7.

The notification method is an abstraction that only matters to steps 3&4, this
lets us add more without rewriting the world.
User-space signals are an abstraction between steps 6&7, this works across
architectures, even those not using APEI firmware first.
There are no shortcuts here. Doing anything else creates more work for user
space, another platform or another architecture.


Thanks,

James


Re: [PATCH] KVM: arm64: add esr_el2 and far_el2 to sysreg

2017-08-07 Thread James Morse
Hi gengdongjiu,

On 07/08/17 17:23, gengdongjiu wrote:
>   As James's suggestion, I move injection SEA Error logic to the user 
> space(Qemu), Qemu sets the related guest OS esr/elr/pstate/spsr

(because for firmware-first its the CPER records that matter, and only QEMU
knows where it reserved the memory for these, and what it told the guest it
would use as the notification method).

> through IOCTL KVM_SET_ONE_REG. For the SEA, when Qemu sets the esr_el1.IL 
> bit, it needs to refer to esr_el2.IL, else Qemu does not know the trapped
> instruction was a 16-bit or a 32-bit instruction, also it needs to set 
> far_el1 using far_el2, because this is synchronization abort.

The 32bit kernel doesn't support ACPI firmware first, and aarch64 doesn't
support 16-bit instructions.


James




Re: [PATCH] arm64:kexec: have own crash_smp_send_stop() for crash dump for nonpanic cores

2017-08-04 Thread James Morse
Hi Hoeun,

On 04/08/17 08:02, Hoeun Ryu wrote:
>  Commit 0ee5941 : (x86/panic: replace smp_send_stop() with kdump friendly
> version in panic path) introduced crash_smp_send_stop() which is a weak
> function and can be overriden by architecture codes to fix the side effect
> caused by commit f06e515 : (kernel/panic.c: add "crash_kexec_post_
> notifiers" option).

If I've understood correctly: if we boot with this option core code doesn't use
our machine_crash_shutdown(), and instead calls crash_smp_send_stop(), which we
don't have, so it uses the default smp_send_stop(), which doesn't save the regs.

Thanks for catching this!


Could we rename smp_send_crash_stop() crash_smp_send_stop() and add the
called-twice logic there? They are similar enough that I'm getting them muddled
already!


Thanks,

James


>  ARM64 architecture uses the weak version function and the problem is that
> the weak function simply calls smp_send_stop() which makes other CPUs
> offline and takes away the chance to save crash information for nonpanic
> CPUs in machine_crash_shutdown() when crash_kexec_post_notifiers kernel
> option is enabled.
> 
>  Calling smp_send_crash_stop() in the function is useless because all
> nonpanic CPUs are already offline by smp_send_stop() in this case and
> smp_send_crash_stop() only works against online CPUs.
> 
>  The result is that /proc/vmcore is not available with the error messages;
> "Warning: Zero PT_NOTE entries found", "Kdump: vmcore not initialized".
> 
>  crash_smp_send_stop() is implemented for ARM64 architecture to fix this
> problem and the function (strong symbol version) saves crash information
> for nonpanic CPUs using smp_send_crash_stop() and machine_crash_shutdown()
> tries to save crash information for nonpanic CPUs only when
> crash_kexec_post_notifiers kernel option is disabled.




Re: [PATCH v3 0/5] ARM64: disable irq between breakpoint and step exception

2017-08-03 Thread James Morse
Hi Pratyush,

On 02/08/17 19:46, Pratyush Anand wrote:
> In my understanding problems are:
> (1) Single stepping of unwanted instruction (ie. instruction  next to 
> enable_dbg
> from el1_irq)
> (2) We do not have memory at the end of el1_irq, so that we can set watchpoint
> exception generating instruction for single stepping.

Yes, for (2) the PSTATE.SS bit is saved in SPSR.SS when we take the irq, but it
isn't restored because we ERET from the irq handler with PSTATE.D clear.


> I think, we can find a way to take care for (2), but not sure how (1) can be
> taken care, without the approach I am taking.

We can fix (1) by making 'enable_dbg' inherit the debug state of the interrupted
EL1, unless the SPSR.SS bit is set, in which case we interrupted a
single-stepped instruction and shouldn't re-enable debug because we know must
MDSCR_EL1.SS be set.

This way a synchronous data-abort from a single-stepped instruction with
interrupts unmasked can unmask interrupts in el1_sync but keep debug disabled.
This should give us the least surprises if we call core-code.

I've posted a series that (in addition to your perf/watchpoint patches) fixes
all the issues I saw with your example. Can you test it fixes the
single-step:interrupt interaction on your platform?


Thanks,

James



Re: [PATCH v3 0/5] ARM64: disable irq between breakpoint and step exception

2017-08-02 Thread James Morse
Hi Pratyush,

On 01/08/17 05:18, Pratyush Anand wrote:
> On Monday 31 July 2017 10:45 PM, James Morse wrote:
>> On 31/07/17 11:40, Pratyush Anand wrote:
>>> samples/hw_breakpoint/data_breakpoint.c passes with x86_64 but fails with
>>> ARM64. Even though it has been NAKed previously on upstream [1, 2], I have
>>> tried to come up with patches which can resolve it for ARM64 as well.
>>>
>>> I noticed that even perf step exception can go into an infinite loop if CPU
>>> receives an interrupt while executing breakpoint/watchpoint handler. So,
>>> event though we are not concerned about above test, we will have to find a
>>> solution for the perf issue.

> You can easily reproduce the issue with following:
> # insmod data_breakpoint.ko ksym=__sysrq_enabled
> # cat /proc/sys/kernel/sysrq

Thanks, that happily dump-stacks forever. Your first three patches fix the
stepping over the watchpoint, I've had a go at fixing the interrupt interaction,
(instead of just masking interrupts).

gdb single-step works, as does kprobes, FWIW for those three:
Tested-by: James Morse 


>> What causes your infinite loop? 

> Flow is like this:
> - A SW or HW breakpoint exception is being generated on a cpu lets say CPU5
> - Breakpoint handler does something which causes an interrupt to be active on
> the same CPU. In fact there might be many other reasons for an interrupt to be
> active on a CPU while breakpoint handler was being executed.
> - So, as soon as we return from breakpoint exception, we go to the IRQ 
> exception
> handler, while we were expecting a single step exception.

What breaks when this happens?

With your reproducer and the first three patches I see it hitting the watchpoint
multiple times and stepping the irq handler.

I think we have two or three interacting bugs here. I'm not convinced masking
interrupts is the best fix as the data abort handler inherits this value. We
might mask interrupts for a fault that can't be handled with interrupts masked.

I will post some RFC/fixes, but need to get my head round the debug/exception
interaction in the ARM-ARM first!


Thanks,

James


Re: [PATCH v3 0/5] ARM64: disable irq between breakpoint and step exception

2017-07-31 Thread James Morse
Hi Pratyush,

On 31/07/17 11:40, Pratyush Anand wrote:
> samples/hw_breakpoint/data_breakpoint.c passes with x86_64 but fails with
> ARM64. Even though it has been NAKed previously on upstream [1, 2], I have
> tried to come up with patches which can resolve it for ARM64 as well.
> 
> I noticed that even perf step exception can go into an infinite loop if CPU
> receives an interrupt while executing breakpoint/watchpoint handler. So,
> event though we are not concerned about above test, we will have to find a
> solution for the perf issue.

This caught my eye as I've been reworking the order the DAIF flags get
set/cleared[0].

What causes your infinite loop? Is it single-stepping kernel_exit? If so patch 4
"arm64: entry.S mask all exceptions during kernel_exit" [1] may help.

If its more like "single stepping something we didn't expect" you will get the
same problem if we take an SError. (which with that series is unmasked ~all the
time).
Either way this looks like a new and exciting way of hitting the 'known issue'
described in patch 12 [3].


Would disabling MDSCR_EL1.SS if we took an exception solve your problem?

If so, I think we should add a new flag, 'TIF_KSINGLESTEP', causing us to
save/restore MDSCR_EL1.SS into pt_regs on el1 exceptions. This would let us
single-step without modifying the DAIF flags for the location we are stepping,
and allow taking any kind of exception from that location.

We should disable nested users of single-step, we can do that by testing the
flag, print a warning then pretend we missed the breakpoint. (hence it needs to
be separate from the user single-step flag).


Thanks,

James

[0] https://www.spinics.net/lists/arm-kernel/msg596684.html
[1] https://www.spinics.net/lists/arm-kernel/msg596686.html
[2] https://www.spinics.net/lists/arm-kernel/msg596689.html


Re: [PATCH] acpi: apei: clear error status before acknowledging the error

2017-07-31 Thread James Morse
Hi Tyler,

On 31/07/17 17:15, Baicar, Tyler wrote:
> On 7/29/2017 12:53 AM, Borislav Petkov wrote:
>> On Fri, Jul 28, 2017 at 04:25:03PM -0600, Tyler Baicar wrote:
>>> Currently we acknowledge errors before clearing the error status.
>>> This could cause a new error to be populated by firmware in-between
>>> the error acknowledgment and the error status clearing which would
>>> cause the second error's status to be cleared without being handled.
>>> So, clear the error status before acknowledging the errors.
>>>
>>> Also, make sure to acknowledge the error if the error status read
>>> fails.

>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>>> index d661d45..6a6895a 100644
>>> --- a/drivers/acpi/apei/ghes.c
>>> +++ b/drivers/acpi/apei/ghes.c
>>> @@ -743,17 +743,15 @@ static int ghes_proc(struct ghes *ghes)
>>>   }
>>>   ghes_do_proc(ghes, ghes->estatus);
>>>   +out:
>> If the first ghes_read_estatus() fails and we jump straight to that
>> label...
>>
>>> +ghes_clear_estatus(ghes);
>>>   /*
>>>* GHESv2 type HEST entries introduce support for error 
>>> acknowledgment,
>>>* so only acknowledge the error if this support is present.
>>>*/
>>>   if (is_hest_type_generic_v2(ghes)) {
>>>   rc = ghes_ack_error(ghes->generic_v2);
>> ... and ACK the error anyway, even the status read failed, wouldn't that
>> confuse the firmware?

> I think the better thing to do in this case is still send the ack. If
> ghes_read_estatus() fails, then
> either we are unable to read the estatus or the estatus is empty/invalid. For
> the first case, there's
> not much that can be done. The second case would be a FW bug with populating 
> the
> estatus.

Wouldn't this mean acking on a timer for ghes_poll_func()?

What happens if:
> kernel: read error-status-block
> kernel: nothing here
> firmware: error! write to error-status-block
> kernel: write to ack register

(this is probably only a problem for polling as there is no notification)


> If we do not send the ack, then we will be in a scenario where FW will not 
> send
> any more errors.

Because we haven't yet handled the first one...

I thought GHESv2's ack was also used to catch errors that occur while an earlier
error is being handled. But from the text in ACPI 6.2's 18.3.2.8 the 'ack' is
only described as releasing the memory region, not completion of the error 
handler.


Thanks,

James




Re: [kernel-hardening] Re: [RFC PATCH 6/6] arm64: add VMAP_STACK and detect out-of-bounds SP

2017-07-20 Thread James Morse
Hi Ard,

On 20/07/17 06:35, Ard Biesheuvel wrote:
> On 20 July 2017 at 00:32, Laura Abbott  wrote:
>> I didn't notice any performance impact but I also wasn't trying that
>> hard. I did try this with a different configuration and ran into
>> stackspace errors almost immediately:
>>
>> [ 0.358026] smp: Brought up 1 node, 8 CPUs
>> [ 0.359359] SMP: Total of 8 processors activated.
>> [ 0.359542] CPU features: detected feature: 32-bit EL0 Support
>> [0.361781] Insufficient stack space to handle exception!

[...]

>> [0.367382] Task stack: [0xff8008e8..0xff8008e84000]
>> [0.367519] IRQ stack:  [0xffc03bf62000..0xffc03bf66000]
> 
> The IRQ stack is not 16K aligned ...

>> [0.367687] ESR: 0x -- Unknown/Uncategorized
>> [0.367868] FAR: 0x
>> [0.368059] Kernel panic - not syncing: kernel stack overflow
>> [0.368252] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 
>> 4.12.0-00018-ge9cf49d604ef-dirty #23
>> [0.368427] Hardware name: linux,dummy-virt (DT)
>> [0.368612] Call trace:
>> [0.368774] [] dump_backtrace+0x0/0x228
>> [0.368979] [] show_stack+0x10/0x20
>> [0.369270] [] dump_stack+0x88/0xac
>> [0.369459] [] panic+0x120/0x278
>> [0.369582] [] handle_bad_stack+0xd0/0xd8
>> [0.369799] [] __do_softirq+0x74/0x210
>> [0.370560] SMP: stopping secondary CPUs
>> [0.384269] Rebooting in 5 seconds..
>>
>> The config is based on what I use for booting my Hikey android
>> board. I haven't been able to narrow down exactly which
>> set of configs set this off.
>>
> 
> ... so for some reason, the percpu atom size change fails to take effect here.

I'm not completely up to speed with these series, so this may be noise:

When we added the IRQ stack Jungseok Lee discovered that alignment greater than
PAGE_SIZE only applies to CPU0. Secondary CPUs read the per-cpu init data into a
page-aligned area, but any greater alignment requirement is lost.

Because of this the irqstack was only 16byte aligned, and struct thread_info had
to be be discovered without depending on stack alignment.


Thanks,

James


Re: [PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-07-17 Thread James Morse
Hi Michael,

On 17/07/17 11:17, Michael Ellerman wrote:
> James Morse  writes:
>> compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
>> instead using those in ptrace_request(). The compat variant should
>> read a compat_sigset_t from userspace instead of ptrace_request()s
>> sigset_t.
>>
>> While compat_sigset_t is the same size as sigset_t, it is defined as
>> 2xu32, instead of a single u64. On a big-endian CPU this means that
>> compat_sigset_t is passed to user-space using middle-endianness,
>> where the least-significant u32 is written most significant byte
>> first.
>>
>> If ptrace_request()s code is used userspace will read the most
>> significant u32 where it expected the least significant.
> 
> But that's what the code has done since 2013.

> So won't changing this break userspace that has been written to work
> around that bug?

Wouldn't the same program then be broken when run natively instead? To work
around it userspace would have to know it was running under compat instead of
natively.
This only affects this exotic ptrace API for big-endian compat users. I think
there are so few users that no-one has noticed its broken.

I'm only aware of CRIU using this[0], and it doesn't look like powerpc has to
support compat-criu users:
https://github.com/xemul/criu/tree/master/compel/arch
only has a ppc64 entry, for which
https://github.com/xemul/criu/blob/master/compel/arch/ppc64/plugins/include/asm/syscall-types.h
puts 'bits per word' as 64, I don't think it supports ppc32, which is where this
bug would be seen.


> Or do we think nothing actually uses it in the wild and
> we can get away with it?

I think only Zhou Chengming has hit this, and there is no 'in the wild' code
that actually inspects the buffer returned by the call.

Zhou, were you using criu on big-endian ilp32 when you found this? Or was it
some other project that uses this API..
(ilp32 is a second user of compat on arm64)


Thanks,

James



[0]
The commit message that added this code points to CRIU and GDB. GDB doesn't use
this API. Debian's codesearch (which obviously isn't exhaustive) only finds CRIU
and systemtap making these calls.

It looks like criu just save/restores the sigset_t as a blob:
https://sources.debian.net/src/criu/2.12.1-2/criu/parasite-syscall.c/?hl=92#L92

It's sigset_t helpers aren't aware of this bug:
https://github.com/xemul/criu/blob/master/compel/include/uapi/ksigset.h

Systemtap just makes some calls as part of a self test:
https://sources.debian.net/src/systemtap/3.1-2/testsuite/systemtap.syscall/ptrace.c/?hl=198#L198



Re: [RFC PATCH 2/6] arm64: avoid open-coding THREAD_SIZE{,_ORDER}

2017-07-13 Thread James Morse
Hi Mark,

On 12/07/17 23:32, Mark Rutland wrote:
> Currently we define THREAD_SIZE_ORDER dependent on which arm64-specific
> page size kconfig symbol was selected. This is unfortunate, as it hides
> the relationship between THREAD_SIZE_ORDER and THREAD_SIZE, and makes it
> painful more painful than necessary to modify the thread size as we will
> need to do for some debug configurations.
> 
> This patch follows arch/metag's approach of consistently defining
> THREAD_SIZE in terms of THREAD_SIZE_ORDER. This avoids having ifdefs for
> particular page size configurations, and allows us to change a single
> definition to change the thread size.

I think this has unintended side effects for 64K page systems.  (or at least not
yet intended)

Today:
> #ifdef CONFIG_ARM64_4K_PAGES
> #define THREAD_SIZE_ORDER 2
> #elif defined(CONFIG_ARM64_16K_PAGES)
> #define THREAD_SIZE_ORDER 0
> #endif

Means THREAD_SIZE_ORDER is unset on 64K, and THREAD_SIZE is always:
> #define THREAD_SIZE   16384

/kernel/fork.c matches this with its:
> # if THREAD_SIZE >= PAGE_SIZE || defined(CONFIG_VMAP_STACK)
[...]
> #else
[...]
> void thread_stack_cache_init(void)
> {
>   thread_stack_cache = kmem_cache_create("thread_stack", THREAD_SIZE,
> THREAD_SIZE, 0, NULL);
>   BUG_ON(thread_stack_cache == NULL);
> }
> #endif

To create a kmemcache to share 64K pages as 16K stacks.


After this patch:
> #define THREAD_SHIFT  14
>
> #if THREAD_SHIFT >= PAGE_SHIFT
> #define THREAD_SIZE_ORDER (THREAD_SHIFT - PAGE_SHIFT)
> #else
> #define THREAD_SIZE_ORDER 0
> #endif

Means THREAD_SIZE_ORDER is 0, and:
> #define THREAD_SIZE   (PAGE_SIZE << THREAD_SIZE_ORDER)

gives us a 64K THREAD_SIZE.



Thanks,

James





> diff --git a/arch/arm64/include/asm/thread_info.h 
> b/arch/arm64/include/asm/thread_info.h
> index 141f13e9..6d0c59a 100644
> --- a/arch/arm64/include/asm/thread_info.h
> +++ b/arch/arm64/include/asm/thread_info.h
> @@ -23,13 +23,17 @@
>  
>  #include 
>  
> -#ifdef CONFIG_ARM64_4K_PAGES
> -#define THREAD_SIZE_ORDER2
> -#elif defined(CONFIG_ARM64_16K_PAGES)
> +#include 
> +
> +#define THREAD_SHIFT 14
> +
> +#if THREAD_SHIFT >= PAGE_SHIFT
> +#define THREAD_SIZE_ORDER(THREAD_SHIFT - PAGE_SHIFT)
> +#else
>  #define THREAD_SIZE_ORDER0
>  #endif
>  
> -#define THREAD_SIZE  16384
> +#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>  #define THREAD_START_SP  (THREAD_SIZE - 16)
>  
>  #ifndef __ASSEMBLY__
> 



Re: [PATCH 1/2] arm64: mm: abort uaccess retries upon fatal signal

2017-07-12 Thread James Morse
Hi Mark,

On 11/07/17 15:19, Mark Rutland wrote:
> When there's a fatal signal pending, arm64's do_page_fault()
> implementation returns 0. The intent is that we'll return to the
> faulting userspace instruction, delivering the signal on the way.
> 
> However, if we take a fatal signal during fixing up a uaccess, this
> results in a return to the faulting kernel instruction, which will be
> instantly retried, resulting in the same fault being taken forever. As
> the task never reaches userspace, the signal is not delivered, and the
> task is left unkillable. While the task is stuck in this state, it can
> inhibit the forward progress of the system.
> 
> To avoid this, we must ensure that when a fatal signal is pending, we
> apply any necessary fixup for a faulting kernel instruction. Thus we
> will return to an error path, and it is up to that code to make forward
> progress towards delivering the fatal signal.

VM_FAULT_RETRY's 'I released your locks' behaviour is pretty nasty, but this
looks right. FWIW:
Reviewed-by: James Morse 

I also gave this a spin through LTP on Juno, based on v4.12-defconfig:
Tested-by: James Morse 


Thanks,

James



Re: [PATCH v4 3/3] arm64: kvm: inject SError with user space specified syndrome

2017-07-06 Thread James Morse
Hi gengdongjiu,

On 05/07/17 09:14, gengdongjiu wrote:
> On 2017/7/4 18:14, James Morse wrote:
>> Can you give us a specific example of an error you are trying to handle?

> For example:
> guest OS user space accesses device type memory, but happen SError. because 
> the
> SError is asynchronous faults, it does not take immediately. when guest OS 
> call "SVC" to enter guest os
> kernel space, the ESB instruction(Error Synchronization Barrier) will defter 
> this SError. so the SError happen immediately.

Ah, this isn't necessarily a 'RAS notification' SError/SEI, it may be a
'vanilla', v8.0 SError.

You've given a guest access to a physical device (how?), the guest has done
something, which caused the device to respond with SError.

Do you have a specific use-case for this? What is the ESR? What kinds of CPER
records does firmware generate? (if any)


We have to be careful here as devices can still generate asynchronous-interrupts
using SError, these aren't contained by ESB barriers. For these we should fall
back to KVM's v8.0 SError behaviour. KVM can tell them apart as the APEI code
doesn't claim the SError as an SEI notification, and with the RAS extensions the
ESR has the 'IDS' bit set.


>> How would a non-KVM user space process handle the error?

> it is indeed, non-KVM user space can not get the notification from hypervisor 
> or host kernel. thanks for the pointing out
> do you mean still Signal SIGBUS from memory_failure?

No, I was assuming this was a RAS notification SEI, (because your patch 1/3 of
touched the RAS cpu-features) being given to user space to handle.

Instead, can I ask how the host should handle this SError if it had accessed the
device itself?


I agree device pass-through is going to be a special case for KVM, but before
the host can deliver a device RAS error into the guest that was using the
device, it needs to fully understand what the error means:

The error may mean that the careful configuration that makes device-passthrough
safe no longer works, letting the guest continue to access the device may let it
damage another guest or the hyper visor.


We may need a way for the host RAS code to identify the driver responsible, to
handle the device error, or delegate it if that's appropriate.


[...]

>> So (a): a physical-CPU hardware error occurs, and then (c) we tell 
>> Qemu/kvmtool
>> via a KVM-specific API.
>>
>> Don't do this, it doesn't work for non-KVM users. You are exposing 
>> host-specific
>> implementation details to user space. What if I discover the same error via a
>> Polling GHES, or one of the IRQ flavours?

> James, you mainly concern the way that "tell Qemu/kvmtool via a KVM-specific 
> API", right?
> so how about still delivered SIGBUS same as the SEA(Synchronous External 
> Abort)?

> by the way, what is your meaning of below words?
>  >"What if I discover the same error via a Polling GHES, or one of the IRQ 
> flavours?"

This was my mistaken assumption that you were passing an APEI RAS SEI
notification to user space via a KVM specific API. This wouldn't work for
applications not using KVM, or notifications not using SEI.

Here I was asking what happens if the notification used NOTIFY_POLL or
NOTIFY_IRQ (instead of NOTIFY_SEI) in the GHES, but this isn't relevant as it
doesn't look like this is a APEI RAS notification.

[...]


>> If there is another type of CPER record where we should notify userspace, 
>> please
>> do it from mm/memory-failure.c, drivers/acpi/apei/ghes.c or
>> drivers/firmware/efi/cper.c. These should consider all user-space 
>> applications,
>> not just users of KVM, and not just on arm64.
> 
> here I have a question, in the "drivers/acpi/apei/ghes.c" code, it only 
> handle the memory section of CPER.

Yes, we are certainly missing processing for the other record types.


> if the section type of CPER is processor, it will not notify user-space. so 
> only let userspace handle the memory section is reasonable?

I think the only errors that user-space can know more than the kernel are memory
errors. These are the only RAS errors we should expect user space to handle. All
the others fall into either 'corrected by the kernel' or 'fatal for userspace -
SIGKILL'.


>> For memory errors we already have BUS_MCEERR_AR - action-required, and
>> BUS_MCEERR_AO - action-optional.
>>
>> For a TLB error, (Table 250 of UEFI 2.6), what is Qemu expected to do? Linux 
>> has
>> to classify the error and handle it as far as possible. In most cases the 
>> error
>> is either handled (no notification required), or fatal. Memory errors are the
>> only example I've found so far where an appl

Re: [PATCH v4 3/3] arm64: kvm: inject SError with user space specified syndrome

2017-07-04 Thread James Morse
Hi gengdongjiu,

Can you give us a specific example of an error you are trying to handle?
How would a non-KVM user space process handle the error?

KVM-users should be regular user space processes, we should not have a KVM-way
and everyone-else-way of handling errors.


On 04/07/17 05:46, gengdongjiu wrote:
> On 2017/7/3 16:39, Christoffer Dall wrote:
>> On Mon, Jun 26, 2017 at 08:46:39PM +0800, Dongjiu Geng wrote:
>>> when SError happen, kvm notifies user space to record the CPER,
>>> user space specifies and passes the contents of ESR_EL1 on taking
>>> a virtual SError interrupt to KVM, KVM enables virtual system
>>> error or asynchronous abort with this specifies syndrome. This
>>> patch modify the world-switch to restore VSESR_EL2, VSESR_EL2
>>> saves the virtual SError syndrome, it becomes the ESR_EL1 value when
>>> HCR_EL2.VSE injects an SError. This register is added by the
>>> RAS Extensions.
>>
>> This commit message is confusing and doesn't help me understand the
>> patch.
> (1) what is the rationale for the guest OS SError interrupt(SEI) handling in 
> the RAS solution?

>   a). In the firmware-first RAS solution, when guest OS happen a SError 
> interrupt (SEI), it will firstly trap to EL3(SCR_EL3.EA = 1);
>   b). The firmware logs, triages, and delegates the error exception to the 
> hypervisor. As the error came from guest OS  EL1, firmware
>   does by faking an SError interrupt exception entry to EL2.
>   c). Control transfers to the hypervisor's delegated error recovery 
> agent.Because HCR_EL2.AMO is set to 1, the hypervisor can use a
>   Virtual SError interrupt to delegate an asynchronous abort to EL1, by 
> setting HCR_EL2.VSE to 1 and using VESR_EL2 to pass syndrome.

So (a): a physical-CPU hardware error occurs, and then (c) we tell Qemu/kvmtool
via a KVM-specific API.

Don't do this, it doesn't work for non-KVM users. You are exposing host-specific
implementation details to user space. What if I discover the same error via a
Polling GHES, or one of the IRQ flavours?

User space should not have to know, or care, how linux is notified about APEI
RAS errors.


> (2) what is this patch mainly do?
>   As mentioned above, the hypervisor needs to enable virtual SError and pass 
> the virtual syndrome to the guest OS.
> 
>   a). when Control transfers to the hypervisor from firmware by faking an 
> SError interrupt, the hypervisor delivered the syndrome_info(esr_el2) and
>   host VA address( Qemu translate this VA address to the virtual machine 
> physical address(IPA)) using below new added "serror_intr" struct.
>   /* KVM_EXIT_SERROR_INTR */
>   struct {
>   __u32 syndrome_info;
>   __u64 address;
>   } serror_intr;

This is for a guest exit to host user-space. Here you are telling Qemu that a
physical CPU hardware error occurred. Qemu/kvmtool should not be expected to
parse the ESR, this is the job of the operating system.

When you're using ACPI firmware-first, SError/SEI is just a notification, the
important data is in the CPER records, which Qemu can't access, (and should be
processed by Linux APEI code).


It looks like you've calculated an address from FAR_EL2/HPFAR_EL2. For an
SError, these are meaningless.

(These registers hold real values for Synchronous External Abort, but for
 firmware-first we should prefer the CPER records.)


>   b). Qemu gets the address(host VA) delivered by KVM, translate this host VA 
> address to virtual machine physical address(IPA), and runtime record this 
> virtual
>  machine physical address(IPA) to the guest OS's APEI table.

I agree with this step, but you're acting on the wrong data. (You're converting
fault_ipa -> virtual address -> fault_ipa, something isn't right ...)

Qemu should react to a signal like BUS_MCEERR_A{R,O} from memory_failure(). This
mechanism serves all user space processes, not just kvm users. This is where the
user-space virtual address should come from. Qemu/kvmtool have to generate the
guest IPA once they discover the affected memory was presented to the guest
through KVM.


Your KVM-specific mechanism exposes too much raw information (raw ESR values to
user space), and only serves applications using KVM.

If there is another type of CPER record where we should notify userspace, please
do it from mm/memory-failure.c, drivers/acpi/apei/ghes.c or
drivers/firmware/efi/cper.c. These should consider all user-space applications,
not just users of KVM, and not just on arm64.


>   c). Qemu gets the syndrome_info delivered by KVM, it refers to this 
> syndrome value(but can be different from it) to specify the virtual SError 
> interrupt's syndrome through setting VESR_EL2.

'but can be different from it' is because a classification step is required, the
operating system should do this. We should only signal Qemu/kvmtool for errors
that can actually be handled. Some APEI notifications may be for corrected
errors, (I would hope these always 

[PATCH] ptrace: Add compat PTRACE_{G,S}ETSIGMASK handlers

2017-06-29 Thread James Morse
compat_ptrace_request() lacks handlers for PTRACE_{G,S}ETSIGMASK,
instead using those in ptrace_request(). The compat variant should
read a compat_sigset_t from userspace instead of ptrace_request()s
sigset_t.

While compat_sigset_t is the same size as sigset_t, it is defined as
2xu32, instead of a single u64. On a big-endian CPU this means that
compat_sigset_t is passed to user-space using middle-endianness,
where the least-significant u32 is written most significant byte
first.

If ptrace_request()s code is used userspace will read the most
significant u32 where it expected the least significant.

Instead of duplicating ptrace_request()s code as a special case in
the arch code, handle it here.

CC: Yury Norov 
CC: Andrey Vagin 
Reported-by: Zhou Chengming 
Signed-off-by: James Morse 
Fixes: 29000caecbe87 ("ptrace: add ability to get/set signal-blocked mask")
---
LTP test case here:
https://lists.linux.it/pipermail/ltp/2017-June/004932.html

 kernel/ptrace.c | 52 
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/kernel/ptrace.c b/kernel/ptrace.c
index 8d2c10714530..a5bebb6713e8 100644
--- a/kernel/ptrace.c
+++ b/kernel/ptrace.c
@@ -843,6 +843,22 @@ static int ptrace_regset(struct task_struct *task, int 
req, unsigned int type,
 EXPORT_SYMBOL_GPL(task_user_regset_view);
 #endif
 
+static int ptrace_setsigmask(struct task_struct *child, sigset_t *new_set)
+{
+   sigdelsetmask(new_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
+
+   /*
+* Every thread does recalc_sigpending() after resume, so
+* retarget_shared_pending() and recalc_sigpending() are not
+* called here.
+*/
+   spin_lock_irq(&child->sighand->siglock);
+   child->blocked = *new_set;
+   spin_unlock_irq(&child->sighand->siglock);
+
+   return 0;
+}
+
 int ptrace_request(struct task_struct *child, long request,
   unsigned long addr, unsigned long data)
 {
@@ -914,18 +930,7 @@ int ptrace_request(struct task_struct *child, long request,
break;
}
 
-   sigdelsetmask(&new_set, sigmask(SIGKILL)|sigmask(SIGSTOP));
-
-   /*
-* Every thread does recalc_sigpending() after resume, so
-* retarget_shared_pending() and recalc_sigpending() are not
-* called here.
-*/
-   spin_lock_irq(&child->sighand->siglock);
-   child->blocked = new_set;
-   spin_unlock_irq(&child->sighand->siglock);
-
-   ret = 0;
+   ret = ptrace_setsigmask(child, &new_set);
break;
}
 
@@ -1149,7 +1154,9 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
  compat_ulong_t addr, compat_ulong_t data)
 {
compat_ulong_t __user *datap = compat_ptr(data);
+   compat_sigset_t set32;
compat_ulong_t word;
+   sigset_t new_set;
siginfo_t siginfo;
int ret;
 
@@ -1189,6 +1196,27 @@ int compat_ptrace_request(struct task_struct *child, 
compat_long_t request,
else
ret = ptrace_setsiginfo(child, &siginfo);
break;
+   case PTRACE_GETSIGMASK:
+   if (addr != sizeof(compat_sigset_t))
+   return -EINVAL;
+
+   sigset_to_compat(&set32, &child->blocked);
+
+   if (copy_to_user(datap, &set32, sizeof(set32)))
+   return -EFAULT;
+
+   ret = 0;
+   break;
+   case PTRACE_SETSIGMASK:
+   if (addr != sizeof(compat_sigset_t))
+   return -EINVAL;
+
+   if (copy_from_user(&set32, datap, sizeof(compat_sigset_t)))
+   return -EFAULT;
+
+   sigset_from_compat(&new_set, &set32);
+   ret = ptrace_setsigmask(child, &new_set);
+   break;
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
case PTRACE_GETREGSET:
case PTRACE_SETREGSET:
-- 
2.11.0



Re: [PATCH v5] trace: ras: add ARM processor error information trace event

2017-06-28 Thread James Morse
Hi guys,

(CC: Punit)

On 26/06/17 15:06, Borislav Petkov wrote:
> On Sat, Jun 24, 2017 at 11:38:23AM +0800, Xie XiuQi wrote:
>> Add a new trace event for ARM processor error information, so that
>> the user will know what error occurred. With this information the
>> user may take appropriate action.
>>
>> These trace events are consistent with the ARM processor error
>> information table which defined in UEFI 2.6 spec section N.2.4.4.1.

Sorry I've been quiet on this - I'm not familiar with how the trace event stuff
works, or what picks it up in user space.


>> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
>> index 39701a5..f76ab0f 100644
>> --- a/drivers/ras/ras.c
>> +++ b/drivers/ras/ras.c
>> @@ -22,7 +22,17 @@ void log_non_standard_event(const uuid_le *sec_type, 
>> const uuid_le *fru_id,
>>  
>>  void log_arm_hw_error(struct cper_sec_proc_arm *err)
>>  {
>> +int i;
>> +struct cper_arm_err_info *err_info;
>> +
>>  trace_arm_event(err);
>> +
>> +if (!trace_arm_err_info_event_enabled())
>> +return;
> 
> If we're going to check whether the tracepoint is enabled, you need
> to do that for arm_event TP too. Because from looking at the spec,
> arm_event dumps
> 
> Table 260. ARM Processor Error Section
> 
> and you're dumping
> 
> Table 261. ARM Processor Error Information Structure
> 
> which is embedded in the previous table.
> 
> So this is basically a single error event and the error info structures
> can describe different incarnations to that error event.
> 
> And you need to mirror exactly that behavior.
> 
> Then, when you do that, you need to document somewhere so that userspace
> knows to open *both* TPs in order to get the full error information.

I don't see how the data in Table 261 is usable without the corresponding
information in Table 260.

For example 261's 'Type: cache error' has to be interpreted with 260's 'Error
affinity level: 0', 'MPIDR_EL1: 0x100' to know that this cache error only
affected that specific CPU.

While I like the idea of just spitting out the CPER records (as we don't need to
invent a new format to pass the information to user space), I don't think we can
easily do this through tracepoints.

One tracepoint per cper header/table would be tricky to parse, wouldn't we have
to rely on the cpu-id and timestamps to pair the records back up?


> Alternatively, you can extend arm_event to get issued with *each*
> cper_arm_err_info but that would mean a lot of redundant information
> being shuffled out to userspace.

I think this is what we should do, but only keep the number of fields we export
to a minimum. If we always use the names in the spec, and user-space always
parses the 'format' file, we should be able to add more fields when they turn
out to be necessary. (looks like the trace infrastructure makes inventing a new
format easy!)

On that note, how does user space get the 'error severity' from Table 247,
'section severity', 'flags' and the other stringy bits of table 249?
Does it need them?


Thanks,

James





Re: [PATCH 18/20] arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32

2017-06-27 Thread James Morse
Hi Yury, Zhou,

On 23/06/17 23:28, Yury Norov wrote:
> On Fri, Jun 23, 2017 at 06:03:37PM +0100, James Morse wrote:
>> Hi Yury,
>>
>> On 04/06/17 13:00, Yury Norov wrote:
>>> ILP32 has context-related structures different from both aarch32 and
>>> aarch64/lp64. In this patch compat_arch_ptrace() renamed to
>>> compat_a32_ptrace(), and compat_arch_ptrace() only makes choice between
>>> compat_a32_ptrace() and new compat_ilp32_ptrace() handler.
>>>
>>> compat_ilp32_ptrace() calls generic compat_ptrace_request() for all
>>> requests except PTRACE_GETSIGMASK and PTRACE_SETSIGMASK, which need
>>> special handling.
>>
>> Can you elaborate on this special handling?
>>
>> How come we don't need to wrap PTRACE_{G,S}ETSIGMASK for aarch32 compat?
>> >From kernel/signal32.c that uses compat_sigset_t too.
>>
>> It looks like aarch64, ilp32 and aarch32 all use the same size sigset_t,
>> so doesn't compat_ptrace_request() already do everything we need?
>>
>> ...
>>
>> Is this fixing an endian problem? If so, can we document it as such. Do we
>> already have the same bug for aarch32 compat?
> 
> Originally, the problem was found by Zhou Chengming: 
> https://lkml.org/lkml/2016/6/27/18
> But I think you right, this is the fix for endian.
> 
> It lookd like aarch32 is buggy, but IIUC to confirm it, the BE arm64
> machine is needed. I use qemu and AFAIR it has no BE support.
> 
> Zhou, can you test it on your machine and if the bug will be reproduced,
> send the patch for aarch32?

I've reproduced this on big endian compat-aarch32: yes its broken. I will respin
Zhou's patch as a fix.


Thanks,

James



Re: [PATCH 18/20] arm64: ptrace: handle ptrace_request differently for aarch32 and ilp32

2017-06-23 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> ILP32 has context-related structures different from both aarch32 and
> aarch64/lp64. In this patch compat_arch_ptrace() renamed to
> compat_a32_ptrace(), and compat_arch_ptrace() only makes choice between
> compat_a32_ptrace() and new compat_ilp32_ptrace() handler.
> 
> compat_ilp32_ptrace() calls generic compat_ptrace_request() for all
> requests except PTRACE_GETSIGMASK and PTRACE_SETSIGMASK, which need
> special handling.

Can you elaborate on this special handling?

How come we don't need to wrap PTRACE_{G,S}ETSIGMASK for aarch32 compat?
>From kernel/signal32.c that uses compat_sigset_t too.

It looks like aarch64, ilp32 and aarch32 all use the same size sigset_t,
so doesn't compat_ptrace_request() already do everything we need?

...

Is this fixing an endian problem? If so, can we document it as such. Do we
already have the same bug for aarch32 compat?


Thanks,

James


Re: [PATCH 16/20] arm64: signal32: move ilp32 and aarch32 common code to separated file

2017-06-19 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> Signed-off-by: Yury Norov 

Can I offer a body for the commit message:
ILP32 needs to mix 32bit struct siginfo and 64bit sigframe for its signal
handlers. Move the existing compat code for copying siginfo to user space and
manipulating signal masks into signal32_common.c so it can be used to deliver
aarch32 and ilp32 signals.


> diff --git a/arch/arm64/include/asm/signal32.h 
> b/arch/arm64/include/asm/signal32.h
> index e68fcce538e1..1c4ede717bd2 100644
> --- a/arch/arm64/include/asm/signal32.h
> +++ b/arch/arm64/include/asm/signal32.h
> @@ -13,6 +13,9 @@
>   * You should have received a copy of the GNU General Public License
>   * along with this program.  If not, see .
>   */
> +
> +#include 
> +
>  #ifndef __ASM_SIGNAL32_H
>  #define __ASM_SIGNAL32_H

Nit: This should go inside the guard.


> diff --git a/arch/arm64/kernel/signal32_common.c 
> b/arch/arm64/kernel/signal32_common.c
> new file mode 100644
> index ..5bddc25dca12
> --- /dev/null
> +++ b/arch/arm64/kernel/signal32_common.c
> @@ -0,0 +1,135 @@
[...]
> +#include 
> +#include 
> +#include 

What do you need ratelimit.h for?


> +#include 
> +
> +#include 

I can't see anything using these ESR_ macros in here...


> +#include 

This was for the VFP save/restore code, which you didn't move...


> +#include 
> +#include 

[...]


> +int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t 
> *from)
[...]
> + case __SI_FAULT:
> + err |= __put_user((compat_uptr_t)(unsigned long)from->si_addr,
> +   &to->si_addr);

This looks tricky. si_addr comes from FAR_EL1 when user-space touches something
it shouldn't. This could be a 64bit value as ilp32 processes can still branch to
64bit addresses in registers and generate loads that cross the invisible 4GB
boundary. Here you truncate the 64bit address.
Obviously this can't happen at all with aarch32, and for C programs its into
undefined-behaviour territory, but it doesn't feel right to pass an address to
user-space that we know is wrong... but we don't have an alternative.

This looks like a class of problem particular to ilp32/x32: 'accessed an address
you can't encode with a signal'. After a quick dig in x86's x32 code, it looks
like they only pass the first 32bits of si_addr too.

One option is to mint a new si_code to go with SIGBUS meaning something like
'address overflowed si_addr'. Alternatively we could just kill tasks that do 
this.


Thanks,

James


Re: [PATCH 05/20] arm64: rename COMPAT to AARCH32_EL0 in Kconfig

2017-06-19 Thread James Morse
Hi Yury,

On 04/06/17 12:59, Yury Norov wrote:
> From: Andrew Pinski 
> 
> In this patchset  ILP32 ABI support is added. Additionally to AARCH32,
> which is binary-compatible with ARM, ILP32 is (mostly) ABI-compatible.
> 
> From now, AARCH32_EL0 (former COMPAT) config option means the support of
> AARCH32 userspace, ARM64_ILP32 - support of ILP32 ABI (see next patches),
> and COMPAT indicates that one of them, or both, is enabled.
> 
> Where needed, CONFIG_COMPAT is changed over to use CONFIG_AARCH32_EL0 instead

Nit: You have 'COMPAT' around compat_hwcap_str's definition, but its only user
is wrapped in 'AARCH32_EL0'.


After this patch
arch/arm64/kernel/perf_callchain.c::perf_callchain_user() still has:
>   if (!compat_user_mode(regs)) {
>   /* AARCH64 mode */
...
>   } else {
> #ifdef CONFIG_COMPAT
>   /* AARCH32 compat mode */
...
> #endif
>   }

I think this one should become CONFIG_AARCH32_EL0. compat to this code means the
fp is 'compat_fp' in x11, and it should read a 32bit call chain from user-space.

This is confusing as 'is_compat_task()' matches one of aarch32 or ilp32, but
compat_user_mode(regs) only matches aarch32 as it checks the saved spsr. I can't
see any problem caused by this today, but its going to bite someone in the
future. Can this be renamed aarch32_user_mode()? (turns out 'a32' is the name of
just one of aarch32's instruction sets[0].)


Thanks,

James

[0] 
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka16137.html



Re: [PATCH 14/20] arm64: ilp32: add sys_ilp32.c and a separate table (in entry.S) to use it

2017-06-08 Thread James Morse
Hi Yury,

On 04/06/17 13:00, Yury Norov wrote:
> From: Andrew Pinski 
> 
> Add a separate syscall-table for ILP32, which dispatches either to native
> LP64 system call implementation or to compat-syscalls, as appropriate.

(I'm still reading through this series trying to understand it, but spotted 
this: )

> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 279bc2ab10c3..7d52fe1ec6bd 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -577,6 +594,7 @@ el0_svc_compat:
>* AArch32 syscall handling
>*/
>   adrpstbl, compat_sys_call_table // load compat syscall table 
> pointer
> + ldr x16, [tsk, #TSK_TI_FLAGS]
>   uxtwscno, w7// syscall number in w7 (r7)
>   mov sc_nr, #__NR_compat_syscalls
>   b   el0_svc_naked
> @@ -798,15 +816,21 @@ ENDPROC(ret_from_fork)
>   .align  6
>  el0_svc:
>   adrpstbl, sys_call_table// load syscall table pointer
> + ldr x16, [tsk, #TSK_TI_FLAGS]
>   uxtwscno, w8// syscall number in w8
>   mov sc_nr, #__NR_syscalls
> +#ifdef CONFIG_ARM64_ILP32
> + tst x16, #_TIF_32BIT_AARCH64
> + b.eqel0_svc_naked   // We are using LP64  syscall 
> table
> + adrpstbl, sys_call_ilp32_table  // load ilp32 syscall table 
> pointer
> + delouse_input_regs
> +#endif
>  el0_svc_naked:   // compat entry point
>   stp x0, scno, [sp, #S_ORIG_X0]  // save the original x0 and 
> syscall number
>   enable_dbg_and_irq
>   ct_user_exit 1
>  

> - ldr x16, [tsk, #TSK_TI_FLAGS]   // check for syscall hooks

If built with CONFIG_CONTEXT_TRACKING, ct_user_exit will call
context_tracking_user_exit(), this will clobber x16 which you depend on not
changing below:


> - tst x16, #_TIF_SYSCALL_WORK
> + tst x16, #_TIF_SYSCALL_WORK // check for syscall hooks

>   b.ne__sys_trace
>   cmp scno, sc_nr // check upper syscall limit
>   b.hsni_sys


Thanks,

James


Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error

2017-05-12 Thread James Morse
Hi gengdongjiu,

On 05/05/17 13:31, gengdongjiu wrote:
> when guest OS happen an SEA, My current solution is shown below:
> 
> (1) host EL3 firmware firstly handle the SEA error and generate the CPER 
> record.
> (2) EL3 firmware separately copy the esr_el3, elr_el3, SPSR_el3,
> far_el3 to the esr_el2, elr_el2, SPSR_el2, far_el2.

Copying {ELR,SPSR,FAR}_EL3 to the EL2 registers rings some alarm bells: I'm sure
you exclude values from EL3 or the secure-world, we should never hand those to
the normal world.


> (3) then jump the EL2 hypervisor

> so the EL2 hypervisor uses the ESR that come from esr_el3,  here the
> ESR(esr_el3) value may be different with the exist KVM API's ESR.

The ESR may be different between EL3 and EL2. The ESR contains the severity of
the event, the CPU will choose this when it takes the SError to EL3. If it had
taken the SError to EL2, the CPU may have classified the error differently.

Firmware may need to generate a more severe ESR if it receives an error that
would be propagated by delivering SEI to a lower exception level, for example if
an EL2 system register is 'infected'.

This is the same for Qemu/kvmtool. A contained error at EL2 may be an
uncontained error if we hand it to guest EL1. Linux's RAS code will decide this
with its choice of signal to send, (and possibly which code to set).
Qemu/kvmtool need to choose an appropriate APEI notification, which may involve
generating a relevant ESR.

Also relevant is the problem we discussed earlier with trying to deliver fake
Physical-SError from software at EL3: If the SError is routed to EL2, and EL2
has PSTATE.A masked, EL3 has to wait and try again later. This is another case
where firmware may have to upgrade the classification of an error to 
uncontainable.


Thanks,

James


Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error

2017-05-12 Thread James Morse
Hi gengdongjiu,

On 10/05/17 09:44, gengdongjiu wrote:
> On 2017/5/9 1:28, James Morse wrote:
>>>> (hwpoison for KVM is a corner case as Qemu's memory effectively has two 
>>>> users,
>>>> Qemu and KVM. This isn't the example of how user-space gets signalled.)
>>
>> KVM creates guests as if they were additional users of Qemu's memory. The 
>> code
>> in mm/memory-failure.c may find that Qemu didn't have the affected page 
>> mapped
>> to user-space - but it may have been in use by stage2.
>>
>> The KVM+SIGBUS patch hides this difference, meaning Qemu gets a signal when 
>> the
>> guest touches the hwpoison page as if Qemu had touched the page itself.
>>
>> Signals from KVM is a corner case, for firmware-first decisions should 
>> happen in
>> the APEI code based on CPER records.

>>> If so, how the KVM handle the SEA type other than hwpoison?

>> To deliver to a guest? It shouldn't have to know, user space should use a KVM
>> API to drive this.
>>
>> When received from hardware? It shouldn't have to care, these things should 
>> be
>> passed into the APEI code for handling. KVM just needs to put the host 
>> registers
>> back.

> Recently I confirmed with the hardware team. they said almost all the SEA 
> errors have the
> Poison flag, so may be there is no need to consider other SEA errors other 
> than  hwPoison.
> only consider SEA hwpoison errors can be enough.

We should be careful here, by hwpoison I meant the Linux feature.
>From Documentation/vm/hwpoison.txt:
> Upcoming Intel CPUs have support for recovering from some memory errors
> (``MCA recovery''). This requires the OS to declare a page "poisoned",
> kill the processes associated with it and avoid using it in the future.

We were talking about KVM's reaction to 'the OS declaring a page poisoned'.
Lets try to call this one memory-failure, as that is its Kconfig name. (now I
understand why we've been confusing each other!)

Your hwpoison looks like something the CPU reports in the ERRSTATUS registers
(4.6.10 of DDI0587). This is something firmware should read, then describe to
the OS via CPER records. Depending on these CPER records linux may invoke its
memory-failure code.


>>> injection a SEA is no more than setting some registers: elr_el1, PC,
>>> PSTATE, SPSR_el1, far_el1, esr_el1
>>> I seen this KVM API do the same thing as Qemu.  do you found call this
>>> API will have issue and necessary to choose another ESR value?
>>
>> Should we let user-space pick the ESR to deliver to the guest? Yes, letting
>> user-space specify the ESR gives the most flexibility to do something clever 
>> in
>> the future. An obvious choice for SEA is between the external-abort and 
>> 'parity
>> or ECC error' codes. If we tell user-space which of these happened (I don't
>> think Linux does today) then Qemu can relay that information to the guest.

> may be the ESR is delivered by the KVM.
> (1) guest OS EL0 happen SEA due to hwpoison
> (2) CPU traps to EL3 firmware, and update the ESR_EL3
> (3) the EL3 firmware copies the ESR_EL3 to ESR_EL2
> (4) then jump to EL2 hypervisor, hypervisor uses the ESR_EL2 to inject the 
> SEA.
> 
> May be the esr_el2 can provide the accurate error information.
> or do you think user-space specify the ESR instead of esr_el2 is better?

I think the severity needs to be considered as the notification is handled by
each exception level. There are cases where it will need to be upgraded from
'contained' to 'uncontained'. (more discussion on another part of the thread).


Thanks,

James


Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8

2017-05-12 Thread James Morse
Hi Tyler,

On 08/05/17 20:59, Baicar, Tyler wrote:
> On 5/8/2017 11:28 AM, James Morse wrote:
>> I was tidying up the masking/unmasking in entry.S, something I wasn't aware 
>> of
>> that leads to a bug:
>> entry.S will unmask interrupts for instruction/data aborts that came from a
>> context with interrupts enabled. This makes sense for get_user() and 
>> friends...
>> For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked 
>> code,
>> such as APEI, but if we end up in here with interrupts unmasked we can take 
>> an
>> IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead 
>> to
>> the deadlock we were originally trying to avoid.
>>
>> Teaching entry.S to spot external aborts is messy. I think the two choices 
>> are
>> to either mask interrupts when calling nmi_enter() (as these things should be
>> mutually exclusive), or to conditionally call nmi_enter() based on
>> interrupts_enabled(regs). I prefer the second one as it matches the 
>> notify_sea()
>> while interruptible that happens when KVM takes one of these.

Thinking about this some more: the KVM case is different as we know it was a
guest that triggered the external abort. Nothing the host kernel does is likely
to trigger either the same error or a related one.

But I can't think of a way this would trip twice on the host... yes your
suggestion looks fine.

(When we add SError/SEI support too we will need to change it as SEA may
interrupt SEI, and nmi_enter() has a BUG_ON(in_nmi()), so this nesting will need
explicitly checking.)


Thanks,

James



Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error

2017-05-09 Thread James Morse
Hi Christoffer,

On 08/05/17 18:54, Christoffer Dall wrote:
> On Mon, May 08, 2017 at 06:28:02PM +0100, James Morse wrote:
> I must admit I am losing track of exactly what this proposed API was
> supposed to do.

There are two, and we keep jumping between them!
This is about two notification methods APEI has for arm64, 'SEA' and 'SEI'.

SEA is synchronous and looks like a data abort. Qemu/kvmtool can inject these
today using the KVM_GET/SET_ONE_REG API whenever it wants to.

SEI uses SError, is asynchronous and can be masked. In addition these need to be
consumed/synchronised by the ESB instruction, even when executed by a guest.
Hardware has the necessary bits to drive all this, we need to expose an API to
drive it.

(I try to spell them out each time so I don't confuse SEI with something
synchronous!)


This patch was about SEA. I think you've answered our question:

> However, if it's a question about setting up VCPU registers to a certain
> state and potentially modifying memory, then I think experience has
> shown us (psci) that emulating something in the kernel that userspace
> can have fine-grained control over is a bad idea, and should be left to
> userspace using as generic APIs as possible.
> 
> Furthermore, if I understand what injecting a SEA requires, it is very
> similar to resetting the CPU and loading data into guest memory, which
> QEMU already does today, and there is no reason to introduce additional
> APIs if it can be done using KVM_GET/SET_ONE_REG ioctls.


Thanks,

James


Re: [PATCH v3 1/3] arm64: kvm: support kvmtool to detect RAS extension feature

2017-05-08 Thread James Morse
Hi gengdongjiu,

On 04/05/17 18:20, gengdongjiu wrote:
>> On 30/04/17 06:37, Dongjiu Geng wrote:
>>> Handle kvmtool's detection for RAS extension, because sometimes
>>> the APP needs to know the CPU's capacity
>>
>>> diff --git a/arch/arm64/kvm/reset.c b/arch/arm64/kvm/reset.c
>>> index d9e9697..1004039 100644
>>> --- a/arch/arm64/kvm/reset.c
>>> +++ b/arch/arm64/kvm/reset.c
>>> @@ -64,6 +64,14 @@ static bool cpu_has_32bit_el1(void)
>>>   return !!(pfr0 & 0x20);
>>>  }
>>>
>>> +static bool kvm_arm_support_ras_extension(void)
>>> +{
>>> + u64 pfr0;
>>> +
>>> + pfr0 = read_system_reg(SYS_ID_AA64PFR0_EL1);
>>> + return !!(pfr0 & 0x1000);
>>> +}
>>
>> Why are we telling user-space that the CPU has RAS extensions? EL0 can't do
>> anything with this and the guest EL1 can detect it from the id registers.
>>
>>
>> Are you using this to decide whether or not to generate a HEST for the guest?
> 
> James, yes, it is.  my current user-space qemu EL0 patches indeed will
> check the RAS  extensions.
> if has the RAS extensions. for SEA, userspace qemu will generate the
> CPER and inject the SEA to guest;
> for SEI,  userspace qemu sets the virtual SEI with the specified
> Syndrome(set the HCR_EL2.VSE and vsesr_el2 );
> if not have RAS extensions, Qemu does nothing

But you can use APEI in a guest on CPUs without the RAS extensions: the host may
signal memory errors to Qemu for any number of reasons, user-space shouldn't
care how it knows. Examples are PCI-AER, any APEI event notified by polling or
one of the flavours of irq.

I would expect Qemu to generate a HEST based on its abilities, i.e. if it
supports any mechanism of notifying the guest about errors. Choosing the
mechanism then depends on the type of error.

Ideally the Qemu code for HEST/GHES/CPER generation code using some of the irqs
and polling could be shared with x86, as these should be possible using common
KVM APIs.


>> If Qemu/kvmtool supports handling memory-failure notifications from signals 
>> you
>> should always generate a HEST. The GHES notification method could be anything
>> Qemu can deliver to the guest using the KVM APIs. Notifications from Qemu to 
>> the
>> guest don't depend on the RAS extensions. KVM has APIs for IRQ and SEA (you 
>> can
>> use KVM_SET_ONE_REG).
> 
> I will consider your suggestion to  always generate a CPER instead of

(generate a HEST, CPER are the runtime records. There are too many acronyms in
this space!)

> relying on the RAS extensions, thanks


Thanks,

James



Re: [PATCH V15 06/11] acpi: apei: handle SEA notification type for ARMv8

2017-05-08 Thread James Morse
Hi Tyler,

On 19/04/17 00:05, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index b74d8b7..10013ff 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -518,6 +520,17 @@ static int do_sea(unsigned long addr, unsigned int esr, 
> struct pt_regs *regs)
>   pr_err("Synchronous External Abort: %s (0x%08x) at 0x%016lx\n",
>   inf->name, esr, addr);
>  
> + /*
> +  * Synchronous aborts may interrupt code which had interrupts masked.
> +  * Before calling out into the wider kernel tell the interested
> +  * subsystems.
> +  */
> + if (IS_ENABLED(CONFIG_ACPI_APEI_SEA)) {
> + nmi_enter();
> + ghes_notify_sea();
> + nmi_exit();
> + }
> +
>   info.si_signo = SIGBUS;
>   info.si_errno = 0;
>   info.si_code  = 0;


I was tidying up the masking/unmasking in entry.S, something I wasn't aware of
that leads to a bug:
entry.S will unmask interrupts for instruction/data aborts that came from a
context with interrupts enabled. This makes sense for get_user() and friends...
For do_sea() we pull nmi_enter() as this can interrupt interrupts-masked code,
such as APEI, but if we end up in here with interrupts unmasked we can take an
IRQ from this 'NMI' context, which will inherit the in_nmi() and could lead to
the deadlock we were originally trying to avoid.

Teaching entry.S to spot external aborts is messy. I think the two choices are
to either mask interrupts when calling nmi_enter() (as these things should be
mutually exclusive), or to conditionally call nmi_enter() based on
interrupts_enabled(regs). I prefer the second one as it matches the notify_sea()
while interruptible that happens when KVM takes one of these.



Thanks,

James


Re: [PATCH v3 3/3] arm/arm64: signal SIBGUS and inject SEA Error

2017-05-08 Thread James Morse
Hi gengdongjiu,

On 04/05/17 17:52, gengdongjiu wrote:
> 2017-05-04 23:42 GMT+08:00 gengdongjiu :
>> On 30/04/17 06:37, Dongjiu Geng wrote:
>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>> index 105b6ab..a96594f 100644
>>> --- a/arch/arm/kvm/mmu.c
>>> +++ b/arch/arm/kvm/mmu.c
>>> +static void kvm_handle_bad_page(unsigned long address,
>>> + bool hugetlb, bool hwpoison)
>>> +{
>>> + /* handle both hwpoison and other synchronous external Abort */
>>> + if (hwpoison)
>>> + kvm_send_signal(address, hugetlb, true);
>>> + else
>>> + kvm_send_signal(address, hugetlb, false);
>>> +}
>>
>> Why the extra level of indirection? We only want to signal userspace like 
>> this
>> from KVM for hwpoison. Signals for RAS related reasons should come from the 
>> bits
>> of the kernel that decoded the error.
> 
> For the SEA, the are maily two types:
> 0b01 Synchronous External Abort on memory access.
> 0b0101xx Synchronous External Abort on page table walk. DFSC[1:0]
> encode the level.

(KVM shouldn't have to make decisions about this)


> hwpoison should belong to the  "Synchronous External Abort on memory access"
> if the SEA type is not hwpoison, such as page table walk, do you mean
> KVM do not deliver the SIGBUS?


The flow of events should be SEI/SEA from firmware to the hosts's APEI code. KVM
should only be involved to get us back to the host if we were running a guest.
The APEI/hwpoison code may cause a set of processes to be sent signals. The code
in mm/memory-failure.c does this by walking the process rmaps using the physical
addresses in the CPER records.

We want user space to be sent signals as this can (and should) work in exactly
the same way on arm64 as it does on x86 or any other architecture. If a
web-browser can handle SIGBUS notifications for memory-corruption, it shouldn't
have to care what architecture it is running on.

So what is that KVM+SIGBUS patch about?...

>> (hwpoison for KVM is a corner case as Qemu's memory effectively has two 
>> users,
>> Qemu and KVM. This isn't the example of how user-space gets signalled.)

KVM creates guests as if they were additional users of Qemu's memory. The code
in mm/memory-failure.c may find that Qemu didn't have the affected page mapped
to user-space - but it may have been in use by stage2.

The KVM+SIGBUS patch hides this difference, meaning Qemu gets a signal when the
guest touches the hwpoison page as if Qemu had touched the page itself.

Signals from KVM is a corner case, for firmware-first decisions should happen in
the APEI code based on CPER records.


> If so, how the KVM handle the SEA type other than hwpoison?

To deliver to a guest? It shouldn't have to know, user space should use a KVM
API to drive this.

When received from hardware? It shouldn't have to care, these things should be
passed into the APEI code for handling. KVM just needs to put the host registers
back.


>>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>>> index bb02909..1d2e2e7 100644
>>> --- a/include/uapi/linux/kvm.h
>>> +++ b/include/uapi/linux/kvm.h
>>> @@ -1306,6 +1306,7 @@ struct kvm_s390_ucas_mapping {
>>>  #define KVM_S390_GET_IRQ_STATE  _IOW(KVMIO, 0xb6, struct 
>>> kvm_s390_irq_state)
>>>  /* Available with KVM_CAP_X86_SMM */
>>>  #define KVM_SMI   _IO(KVMIO,   0xb7)
>>> +#define KVM_ARM_SEA   _IO(KVMIO,   0xb8)
>>>
>>>  #define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
>>>  #define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
>>>
>>
>> Why do we need a userspace API for SEA? It can also be done by using
>> KVM_{G,S}ET_ONE_REG to change the vcpu registers. The advantage of doing it 
>> this
>> way is you can choose which ESR value to use.
>>
>> Adding a new API call to do something you could do with an old one doesn't 
>> look
>> right.
> 
> James, I considered your suggestion before that use the
> KVM_{G,S}ET_ONE_REG to change the vcpu registers. but I found it does
> not have difference to use the alread existed KVM API. 

(Only that is an in-kernel helper, not a published API)


> so may be
> changing the vcpu registers in qemu will duplicate with the KVM APIs.

That is true, but the alternative is a new API that doesn't do anything new, its
just more convenient.

Marc and Christoffer are the people to convince.
I argue the existing API is sufficient.


> injection a SEA is no more than setting some registers: elr_el1, PC,
> PSTATE, SPSR_el1, far_el1, esr_el1
> I seen this KVM API do the same thing as Qemu.  do you found call this
> API will have issue and necessary to choose another ESR value?

Should we let user-space pick the ESR to deliver to the guest? Yes, letting
user-space specify the ESR gives the most flexibility to do something clever in
the future. An obvious choice for SEA is between the external-abort and 'parity
or ECC error' codes. If we tell user-space which of these happened (I don't
think Linux does today) then Qemu can relay that information to the guest.



Thanks,

James


Re: [PATCH v3 7/8] arm64: exception: handle asynchronous SError interrupt

2017-05-08 Thread James Morse
Hi Xiongfeng Wang,

On 28/04/17 03:55, Xiongfeng Wang wrote:
>>> >> It is ok to just ignore the process following the ESB instruction in 
>>> >> el0_sync, because the process will be sent SIGBUS signal.
>> > 
>> > I don't understand. How will Linux know the process caused an error if we
>> > neither take an SError nor read DISR_EL1 after an ESB?

> I think there may be some misunderstanding here. The ESB instruction is 
> placed in kernel_entry
> of el0_sync and el0_irq. For the el0_sync, such as an syscall from userspace, 
> after ESB is executed,
> we check whether DISR.A is set. If it is not set, we go on to process the 
> syscall. If it is set, we
> jump to sError vector and then just eret.

Ah, this looks like an early optimisation!

We can't assume that the SError will result in the processing being killed, the
AET bits of the SError ISS Encoding (page D7-2284 of ARM-ARM DDI0487B.a), has a
'corrected' error encoding.
For these I would expect the SError-vector C code to do nothing and return to
where it came from. In this case the syscall should still be run.


Thanks,

James




Re: [PATCH v3 7/8] arm64: exception: handle asynchronous SError interrupt

2017-04-24 Thread James Morse
Hi Wang Xiongfeng,

On 21/04/17 12:33, Xiongfeng Wang wrote:
> On 2017/4/20 16:52, James Morse wrote:
>> On 19/04/17 03:37, Xiongfeng Wang wrote:
>>> On 2017/4/18 18:51, James Morse wrote:
>>>> The host expects to receive physical SError Interrupts. The ARM-ARM doesn't
>>>> describe a way to inject these as they are generated by the CPU.
>>>>
>>>> Am I right in thinking you want this to use SError Interrupts as an APEI
>>>> notification? (This isn't a CPU thing so the RAS spec doesn't cover this 
>>>> use)
>>>
>>> Yes, using sei as an APEI notification is one part of my consideration. 
>>> Another use is for ESB.
>>> RAS spec 6.5.3 'Example software sequences: Variant: asynchronous External 
>>> Abort with ESB'
>>> describes the SEI recovery process when ESB is implemented.
>>>
>>> In this situation, SEI is routed to EL3 (SCR_EL3.EA = 1). When an SEI 
>>> occurs in EL0 and not been taken immediately,
>>> and then an ESB instruction at SVC entry is executed, SEI is taken to EL3. 
>>> The ESB at SVC entry is
>>> used for preventing the error propagating from user space to kernel space. 
>>> The EL3 SEI handler collects
>>
>>> the errors and fills in the APEI table, and then jump to EL2 SEI handler. 
>>> EL2 SEI handler inject
>>> an vSEI into EL1 by setting HCR_EL2.VSE = 1, so that when returned to OS, 
>>> an SEI is pending.
>>
>> This step has confused me. How would this work with VHE where the host runs 
>> at
>> EL2 and there is nothing at Host:EL1?
> 
> RAS spec 6.5.3 'Example software sequences: Variant: asynchronous External 
> Abort with ESB'
> I don't know whether the step described in the section is designed for guest 
> os or host os or both.
> Yes, it doesn't work with VHE where the host runs at EL2.

If it uses Virtual SError, it must be for an OS running at EL1, as the code
running at EL2 (VHE:Linux, firmware or Xen) must set HCR_EL2.AMO to make this 
work.

I can't find the section you describe in this RAS spec:
https://developer.arm.com/docs/ddi0587/latest/armreliabilityavailabilityandserviceabilityrasspecificationarmv8forthearmv8aarchitectureprofile


>> >From your description I assume you have some firmware resident at EL2.

> Our actual SEI step is planned as follows:

Ah, okay. You can't use Virtual SError at all then, as firmware doesn't own EL2:
KVM does. KVM will save/restore the HCR_EL2 register as part of its world
switch. Firmware must not modify the hyp registers behind the hyper-visors back.


> Host OS:  EL0/EL1 -> EL3 -> EL0/EL1

i.e. While the host was running, so EL3 sees HCR_EL2.AMO clear, and directs the
SEI to EL1.

> Guest OS:  EL0/EL1 -> EL3 -> EL2 -> EL0/EL1

i.e. While the guest was running, so EL3 sees HCR_EL2.AMO set, directs the SEI
to EL2 where KVM performs the world-switch and returns to host EL1.

Looks sensible.


> In guest os situation, we can inject an vSEI and return to where the SEI is 
> taken from.

An SEI from firmware should always end up in the host OS. If a guest was running
KVM is responsible for doing the world switch to restore host EL1 and return
there. This should all work today.


> But in host os situation, we can't inject an vSEI (if we don't set 
> HCR_EL2.AMO), so we have to jump to EL1 SEI vector.

Because your firmware is at EL3 you have to check PSTATE.A and jump into the
SError vector in both cases, you just choose if its VBAR_EL2 or VBAR_EL1 based
on HCR_EL2.AMO.


> Then the process following ESB won't be executed becuase SEI is taken to EL3 
> from the ESB instruction in EL1, and when control
> is returned to OS, we are in EL1 SEI vector rather than the ESB instruction.

Firmware should set ELR_EL1 so that an eret from the EL1 SError vector takes you
back to the ESB instruction that took us to EL3 in the first place.


There is a problem here mixing SError as the CPU->Software notification of RAS
errors and as APEI's SEI Software->Software notification that a firmware-first
error has been handled by EL3.

To elaborate:
The routine in this patch was something like:
1 ESB
2 Read DISR_EL1
3 If set, branch to the SError handler.

If we have firmware-first ESB will generate an SError as PSTATE.A at EL{1,2}
doesn't mask SError for EL3. Firmware can then handle the RAS error. With this
the CPU->Software story is done. Firmware notifying the OS via APEI is a
different problem.

If the error doesn't need to be notified to the OS via SEI, firmware can return
to step 1 above with DISR_EL1 clear. The OS may be told via another mechanism
such as polling or an irq.

If PSTATE.A was clear, firmware can return to t

Re: [PATCH v3 7/8] arm64: exception: handle asynchronous SError interrupt

2017-04-20 Thread James Morse
Hi Wang Xiongfeng,

On 19/04/17 03:37, Xiongfeng Wang wrote:
> On 2017/4/18 18:51, James Morse wrote:
>> The host expects to receive physical SError Interrupts. The ARM-ARM doesn't
>> describe a way to inject these as they are generated by the CPU.
>>
>> Am I right in thinking you want this to use SError Interrupts as an APEI
>> notification? (This isn't a CPU thing so the RAS spec doesn't cover this use)
> 
> Yes, using sei as an APEI notification is one part of my consideration. 
> Another use is for ESB.
> RAS spec 6.5.3 'Example software sequences: Variant: asynchronous External 
> Abort with ESB'
> describes the SEI recovery process when ESB is implemented.
> 
> In this situation, SEI is routed to EL3 (SCR_EL3.EA = 1). When an SEI occurs 
> in EL0 and not been taken immediately,
> and then an ESB instruction at SVC entry is executed, SEI is taken to EL3. 
> The ESB at SVC entry is
> used for preventing the error propagating from user space to kernel space. 
> The EL3 SEI handler collects

> the errors and fills in the APEI table, and then jump to EL2 SEI handler. EL2 
> SEI handler inject
> an vSEI into EL1 by setting HCR_EL2.VSE = 1, so that when returned to OS, an 
> SEI is pending.

This step has confused me. How would this work with VHE where the host runs at
EL2 and there is nothing at Host:EL1?
>From your description I assume you have some firmware resident at EL2.


> Then ESB is executed again, and DISR_EL1.A is set by hardware (2.4.4 ESB and 
> virtual errors), so that
> the following process can be executed.


> So we want to inject a vSEI into OS, when control is returned from EL3/2 to 
> OS, no matter whether
> it is on host OS or guest OS.

I disagree. With Linux/KVM you can't use Virtual SError to notify the host OS.
The host OS expects to receive Physical SError, these shouldn't be taken to EL2
unless a guest is running. Notifications from firmware that use SEA or SEI
should follow the routing rules in the ARM-ARM, which means they should never
reach a guest OS.

For VHE the host runs at EL2 and sets HCR_EL2.AMO. Any firmware notification
should come from EL3 to the host at EL2. The host may choose to notify the
guest, but this should always go via Qemu.

For non-VHE systems the host runs at EL1 and KVM lives at EL2 to do the world
switch. When KVM is running a guest it sets HCR_EL2.AMO, when it has switched
back to the host it clears it.

Notifications from EL3 that pretend to be SError should route SError to EL2 or
EL1 depending on HCR_EL2.AMO.
When KVM gets a Synchronous External Abort or an SError while a guest is running
it will switch back to the host and tell the handle_exit() code.  Again the host
may choose to notify the guest, but this should always go via Qemu.

The ARM-ARM pseudo code for the routing rules is in
AArch64.TakePhysicalSErrorException(). Firmware injecting fake SError should
behave exactly like this.


Newly appeared in the ARM-ARM is HCR_EL2.TEA, which takes Synchronous External
Aborts to EL2 (if SCR_EL3 hasn't claimed them). We should set/clear this bit
like we do HCR_EL2.AMO if the CPU has the RAS extensions.


>> You cant use SError to cover all the possible RAS exceptions. We already have
>> this problem using SEI if PSTATE.A was set and the exception was an imprecise
>> abort from EL2. We can't return to the interrupted context and we can't 
>> deliver
>> an SError to EL2 either.
> 
> SEI came from EL2 and PSTATE.A is set. Is it the situation where VHE is 
> enabled and CPU is running
> in kernel space. If SEI occurs in kernel space, can we just panic or shutdown.

firmware-panic? We can do a little better than that:
If the IESB bit is set in the ESR we can behave as if this were an ESB and have
firmware write an appropriate ESR to DISR_EL1 if PSTATE.A is set and the
exception came from EL2.

Linux should have an ESB in its __switch_to(), I've re-worked the daif masking
in entry.S so that PSTATE.A will always be unmasked here.

Other than these two cases, yes, this CPU really is wrecked. Firmware can power
it off via PSCI, and could notify another CPU about what happened. UEFI's table
250 'Processor Generic Error Section' has a 'Processor ID' field, with a note
that on ARM this is the MPIDR_EL1. Table 260 'ARM Processor Error Section' has
something similar.


Thanks,

James


Re: [PATCH v3 7/8] arm64: exception: handle asynchronous SError interrupt

2017-04-18 Thread James Morse
Hi Wang Xiongfeng,

On 18/04/17 02:09, Xiongfeng Wang wrote:
> I have some confusion about the RAS feature when VHE is enabled. Does RAS 
> spec support
> the situation when VHE is enabled. When VHE is disabled, the hyperviosr 
> delegates the error
> exception to EL1 by setting HCR_EL2.VSE to 1, and this will inject a virtual 
> SEI into OS.

(The ARM-ARM also requires the HCR_EL2.AMO to be set so that physical SError
 Interrupts are taken to EL2, meaning EL1 can never receive a physical SError)


> My understanding is that HCR_EL2.VSE is only used to inject a virtual SEI 
> into EL1.

... mine too ...

> But when VHE is enabled, the host OS will run at EL2. We can't inject a 
> virtual SEI into
> host OS. I don't know if RAS spec can handle this situation.

The host expects to receive physical SError Interrupts. The ARM-ARM doesn't
describe a way to inject these as they are generated by the CPU.

Am I right in thinking you want this to use SError Interrupts as an APEI
notification? (This isn't a CPU thing so the RAS spec doesn't cover this use)

This is straightforward for the hyper-visor to implement using Virtual SError.
I don't think its not always feasible for the host as Physical SError is routed
to EL3 by SCR_EL3.EA, meaning there is no hardware generated SError that can
reach EL2. Another APEI notification mechanism may be more appropriate.

EL3 may be able to 'fake' an SError by returning into the appropriate EL2 vector
if the exception came from EL{0,1}, or from EL2 and PSTATE.A is clear.
If the SError came from EL2 and the ESR_EL3.IESB bit is set, we can write an
appropriate ESR into DISR.
You cant use SError to cover all the possible RAS exceptions. We already have
this problem using SEI if PSTATE.A was set and the exception was an imprecise
abort from EL2. We can't return to the interrupted context and we can't deliver
an SError to EL2 either.

Setting SCR_EL3.EA allows firmware to handle these ugly corner cases. Notifying
the OS is a separate problem where APEI's SEI may not always be the best choice.


Thanks,

James


Re: [PATCH v3 8/8] arm64: exception: check shared writable page in SEI handler

2017-04-07 Thread James Morse
Hi Xie XiuQi,

On 30/03/17 11:31, Xie XiuQi wrote:
> From: Wang Xiongfeng 
> 
> Since SEI is asynchronous, the error data has been consumed. So we must
> suppose that all the memory data current process can write are
> contaminated. If the process doesn't have shared writable pages, the
> process will be killed, and the system will continue running normally.
> Otherwise, the system must be terminated, because the error has been
> propagated to other processes running on other cores, and recursively
> the error may be propagated to several another processes.

This is pretty complicated. We can't guarantee that another CPU hasn't modified
the page tables while we do this, (so its racy). We can't guarantee that the
corrupt data hasn't been sent over the network or written to disk in the mean
time (so its not enough).

The scenario you have is a write of corrupt data to memory where another CPU
reading it doesn't know the value is corrupt.

The hardware gives us quite a lot of help containing errors. The RAS
specification (DDI 0587A) describes your scenario as error propagation in '2.1.2
Architectural error propagation', and then classifies it in '2.1.3
Architecturally infected, containable and uncontainable' as uncontained because
the value is no longer in the general-purpose registers. For uncontained errors
we should panic().

We shouldn't need to try to track errors after we get a notification as the
hardware has done this for us.


Firmware-first does complicate this if events like this are not delivered using
a synchronous external abort, as Linux may have PSTATE.A masked preventing
SError Interrupts from being taken. It looks like PSTATE.A is masked much more
often than is necessary. I will look into cleaning this up.


Thanks,

James


Re: [PATCH v3] arm64: print a fault message when attempting to write RO memory

2017-04-04 Thread James Morse
Hi Stephen,

On 04/04/17 07:58, Stephen Boyd wrote:
> If a page is marked read only we should print out that fact,
> instead of printing out that there was a page fault. Right now we
> get a cryptic error message that something went wrong with an
> unhandled fault, but we don't evaluate the esr to figure out that
> it was a read/write permission fault.
> 
> Instead of seeing:
> 
>   Unable to handle kernel paging request at virtual address 08e460d8
>   pgd = 83504000
>   [08e460d8] *pgd=83473003, *pud=83503003, 
> *pmd=
>   Internal error: Oops: 964f [#1] PREEMPT SMP
> 
> we'll see:
> 
>   Unable to handle kernel write to read-only memory at virtual address 
> 08e760d8
>   pgd = 80003d3de000
>   [08e760d8] *pgd=83472003, *pud=83435003, 
> *pmd=
>   Internal error: Oops: 964f [#1] PREEMPT SMP
> 
> We also fold the userspace address check into is_permission_fault()
> instead of at the current callsite so that the function can't be
> abused with software PAN and a kernel space address.


> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 156169c6981b..c6560cb4ef50 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -321,7 +337,7 @@ static int __kprobes do_page_fault(unsigned long addr, 
> unsigned int esr,
>   mm_flags |= FAULT_FLAG_WRITE;
>   }
>  
> - if (addr < USER_DS && is_permission_fault(esr, regs)) {
> + if (is_permission_fault(esr, regs, addr)) {
>   /* regs->orig_addr_limit may be 0 if we entered from EL0 */
>   if (regs->orig_addr_limit == KERNEL_DS)
>   die("Accessing user space memory with fs=KERNEL_DS", 
> regs, esr);
> 

This change means the PAN checks claim permission faults on kernel addresses
too, we need to keep the addr check for these. (sorry, I missed this detail
first time round)

When I tried lkdtm's 'WRITE_RO' test it gave:
> [ 2114.718807] Internal error: Accessing user space memory outside uaccess.h
> routines: 964e [#1] PREEMPT SMP

With this hunk omitted I got the expected:
> [ 1476.243296] Unable to handle kernel write to read-only memory at virtual
> address 08a11f10

I also gave this a spin on software-models with PAN and PAN+UAO, and TTBR0-PAN
on Juno.


With that hunk omitted:
Reviewed-by: James Morse 
Tested-by: James Morse 


Thanks,

James






Re: [RFC][CFT][PATCHSET v1] uaccess unification

2017-04-03 Thread James Morse
On 29/03/17 06:57, Al Viro wrote:
> The series lives in git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git
> in #work.uaccess.  It's based at 4.11-rc1.  Infrastructure is in
> #uaccess.stem, then it splits into per-architecture branches (uaccess.),

> Comments, review, testing, replacement patches, etc. are very welcome.

For the two "arm64: " patches:
Reviewed-by: James Morse 


Thanks,

James


Re: [PATCH v3 4/8] APEI: GHES: reserve a virtual page for SEI context

2017-03-31 Thread James Morse
Hi Xie XiuQi,

On 30/03/17 11:31, Xie XiuQi wrote:
> On arm64 platform, SEI may interrupt code which had interrupts masked.
> But SEI could be masked, so it's not treated as NMI, however SEA is
> treated as NMI.
> 
> So, the  memory area used to transfer hardware error information from
> BIOS to Linux can be determined only in NMI, SEI(arm64), IRQ or timer
> handler.
> 
> In this patch, we add a virtual page for SEI context.

I don't think this is the best way of solving the interaction problem. If we
ever need to add another notification method this gets even more complicated,
and the ioremap code has to know how these methods can interact.


> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 045d101..b1f9b1f 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -155,54 +162,55 @@ static void ghes_ioremap_exit(void)

> -static void __iomem *ghes_ioremap_pfn_irq(u64 pfn)
> -{
> - unsigned long vaddr, paddr;
> - pgprot_t prot;
> -
> - vaddr = (unsigned long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
> + if (in_nmi()) {
> + raw_spin_lock(&ghes_ioremap_lock_nmi);
> + vaddr = (unsigned 
> long)GHES_IOREMAP_NMI_PAGE(ghes_ioremap_area->addr);
> + } else if (this_cpu_read(sei_in_process)) {

> + spin_lock_irqsave(&ghes_ioremap_lock_sei, flags);

I think this one should be a raw_spin_lock. I'm pretty sure this is for RT-linux
where spin_lock() on a contented lock will call schedule() in the same way
mutex_lock() does. If interrupts were masked by arch code then you need to use
raw_spin_lock.
So now we need to know how we got in here, we interrupted the SError handler so
this should only be Synchronous External Abort. Having to know how we got here
is another problem with this approach.


As suggested earlier[0], I think the best way is to allocate one page of virtual
address space per struct ghes, and move the locks out to the notify calls, which
can know how they are called.

I've pushed a branch to:
http://www.linux-arm.org/git?p=linux-jm.git;a=commit;h=refs/heads/apei_ioremap_rework/v1

I intend to post those patches as an RFC series later in the cycle, I've only
build tested it so far.


Thanks,

James

> + vaddr = (unsigned 
> long)GHES_IOREMAP_SEI_PAGE(ghes_ioremap_area->addr);
> + } else {
> + spin_lock_irqsave(&ghes_ioremap_lock_irq, flags);
> + vaddr = (unsigned 
> long)GHES_IOREMAP_IRQ_PAGE(ghes_ioremap_area->addr);
> + }


[0] https://lkml.org/lkml/2017/3/20/434



Re: [PATCH v3 2/8] acpi: apei: handle SEI notification type for ARMv8

2017-03-31 Thread James Morse
Hi Xie XiuQi,

On 30/03/17 11:31, Xie XiuQi wrote:
> ARM APEI extension proposal added SEI (asynchronous SError interrupt)
> notification type for ARMv8.
> 
> Add a new GHES error source handling function for SEI. In firmware
> first mode, if an error source's notification type is SEI. Then GHES
> could parse and report the detail error information.

The APEI additions are unsafe until patch 4 as SEA can interrupt SEI and
deadlock while trying to take the same set of locks. This patch needs to be
after that interaction is fixed/prevented, or we should prevent it by adding a
depends-on-not to the Kconfig to prevent SEI and SEA being registered at the
same time. (as a short term fix).

(more comments on this on that later patch)


> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index e52be6a..53710a2 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c

> @@ -625,6 +627,14 @@ asmlinkage void bad_mode(struct pt_regs *regs, int 
> reason, unsigned int esr)

bad_mode() is called in other scenarios too, for example executing an undefined
instruction at EL1. You split the SError path out of the vectors in patch 7, I
think we should do that here.


>   handler[reason], smp_processor_id(), esr,
>   esr_get_class_string(esr));
>  
> + /*
> +  * In firmware first mode, we could assume firmware will only generate 
> one
> +  * of cper records at a time. There is no risk for one cpu to parse 
> ghes table.
> +  */

I don't follow this comment, is this saying SError can't interrupt SError? We
already get this guarantee as the CPU masks SError when it takes an exception.

Firmware can generate multiple CPER records for a single 'event'. The CPER
records are the 'Data' in ACPI:Table 18-343 Generic Error Data Entry, and there
are 'zero or more' of these with a 'Generic Error Status Block' header that
describes the overall event. (Table 18-342).

I don't think we need this comment.


> + if (IS_ENABLED(CONFIG_ACPI_APEI_SEI) && ESR_ELx_EC(esr) == 
> ESR_ELx_EC_SERROR) {
> + ghes_notify_sei();
> + }

>   die("Oops - bad mode", regs, 0);
>   local_irq_disable();
>   panic("bad mode");

Thanks,

James



Re: [PATCH V2] acpi: apei: check for pending errors when probing HED type GHES entries

2017-03-30 Thread James Morse
Hi Tyler,

On 29/03/17 16:54, Tyler Baicar wrote:
> If a HED type error occurs prior to GHES probing, the kernel will
> never report the error. The HED driver will see that no notifiers
> are registered, and clear the interrupt.
> 
> This becomes a more serious problem with firmware that supports
> GHESv2 acknowledgements from the kernel. The firmware will populate
> the error and wait for the kernel ack. But since the kernel will
> never process the error we get into a state that the firmware will
> not send any more errors and the kernel will never see or ack the
> original error.
> 
> Check for pending errors when probing HED type GHES entries to
> avoid the above situation.

Isn't this a problem for the other notification types too?

It looks like SEI can indicate the notification is non-fatal even if we haven't
done the ghes_probe() yet and fail to find the CPER records.

Would moving the OSC call to set the APEI bit later solve this, or is it
specific to the way AMLs Notify() works?


Thanks,

James


> 
> This patch is based on Shiju's patch that adds support for GSIV
> and GPIO notification types:
> https://patchwork.kernel.org/patch/9628817/
> 
> Signed-off-by: Tyler Baicar 
> ---
>  drivers/acpi/apei/ghes.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index fd39929..cf5e938 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1035,6 +1035,7 @@ static int ghes_probe(struct platform_device *ghes_dev)
>   register_acpi_hed_notifier(&ghes_notifier_hed);
>   list_add_rcu(&ghes->list, &ghes_hed);
>   mutex_unlock(&ghes_list_mutex);
> + ghes_proc(ghes);
>   break;
>   case ACPI_HEST_NOTIFY_NMI:
>   ghes_nmi_add(ghes);
> 



Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-28 Thread James Morse
Hi gengdongjiu,

On 28/03/17 13:16, gengdongjiu wrote:
> On 2017/3/28 19:54, Achin Gupta wrote:
>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>> In a guest?, I don't know.
>>>> Qemu could generate the records, or drive some other component to do it.
>>>
>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>> instance is specifically built for the machine it runs on, QEMU's virt
>>> machine in this case, they could simply agree (by some contract) to
>>> place the records at some specific location in memory, and if the guest
>>> kernel asks its guest UEFI for that location, things should just work by
>>> having logic in QEMU to process error reports and populate guest memory.
>>>
>>> Is this how others see the world too?
>>
>> I think so!
>>
>> AFAIU, the memory where CPERs will reside should be specified in a GHES 
>> entry in
>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI 
>> creates a
>> HEST for the guest Kernel?
>>
>> If so, then the question is how the guest UEFI finds out where QEMU (acting 
>> as
>> EL3 firmware) will populate the CPERs. This could either be a contract 
>> between
>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask 
>> QEMU
>> where the memory is.
> 
> whether invoke the guest UEFI will be complex? not see the advantage. it 
> seems x86 Qemu
> directly generate the ACPI table, but I am not sure, we are checking the qemu
logical.
> let Qemu generate CPER record may be clear.

At boot UEFI in the guest will need to make sure the areas of memory that may be
used for CPER records are reserved. Whether UEFI or Qemu decides where these are
needs deciding, (but probably not here)...

At runtime, when an error has occurred, I agree it would be simpler (fewer
components involved) if Qemu generates the CPER records. But if UEFI made the
memory choice above they need to interact and it gets complicated again. The
CPER records are defined in the UEFI spec, so I would expect UEFI to contain
code to generate/parse them.


Thanks,

James



Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-28 Thread James Morse
Hi Peter,

On 28/03/17 12:33, Peter Maydell wrote:
> On 28 March 2017 at 12:23, Christoffer Dall  wrote:
>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>> On the host, part of UEFI is involved to generate the CPER records.
>>> In a guest?, I don't know.
>>> Qemu could generate the records, or drive some other component to do it.
>>
>> I think I am beginning to understand this a bit.  Since the guet UEFI
>> instance is specifically built for the machine it runs on, QEMU's virt
>> machine in this case, they could simply agree (by some contract) to
>> place the records at some specific location in memory, and if the guest
>> kernel asks its guest UEFI for that location, things should just work by
>> having logic in QEMU to process error reports and populate guest memory.
> 
> Is "write direct to guest memory" the best ABI here or would
> it be preferable to use the fw_cfg interface for the guest UEFI
> to retrieve the data items on demand?

As far as I understand the interaction between Qemu and UEFI isn't defined. The
eventual aim is to emulate ACPI's firmware first error handling for a guest.
This way the RAS behaviour for a host and the guest is the same.

The ABI is the guest OS gets a 'notification' (there is a list in acpi: 18.3.2.9
Hardware Error Notification), and then finds a pointer to the CPER records
(defined in the UEFI Spec Appendix N) at the address advertised by one of the
Generic Hardware Error Source (GHES) entries of the ACPI Hardware Error Source
Table (HEST).

How Qemu and UEFI conspire to make this happen is up for discussion.
My suggestion would be to try and closely mirror whatever happens on a physical
system so that UEFI only needs to test one set of code.


> Is there a pre-existing "this is how it works on x86" implementation?

I found [0], which on page 16 talks about Qemu injecting a Pseudo 'Software
Recoverable Action Required', which I assume is a flavour of NMI.

The ACPI/CPER stuff is arch agnostic and given Qemu is only ever likely to have
to handle memory errors it should be possible to use the same code for both x86
and arm64.


Thanks,

James

[0] https://events.linuxfoundation.org/images/stories/slides/lfcs2013_tanino.pdf


Re: [PATCH V13 10/10] arm/arm64: KVM: add guest SEA support

2017-03-28 Thread James Morse
Hi Tyler,

On 21/03/17 22:47, Tyler Baicar wrote:
> Currently external aborts are unsupported by the guest abort
> handling. Add handling for SEAs so that the host kernel reports
> SEAs which occur in the guest kernel.

Looks good,

Can we squash the APEI changes into the patch that added them? This would be one
fewer patches that then need the ACPI maintainer to review.


> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616f..105b6ab 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1406,6 +1407,24 @@ static void handle_access_fault(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa)
>   kvm_set_pfn_accessed(pfn);
>  }
>  
> +static bool is_abort_synchronous(unsigned long fault_status) {

I missed kvm_vcpu_dabt_isextabt() when I suggested we would need a helper (my
fault). Can we use that instead?

(my argument that the unused encodings are reserved doesn't hold if KVM is
already doing this... )


> @@ -1426,23 +1445,31 @@ int kvm_handle_guest_abort(struct kvm_vcpu *vcpu, 
> struct kvm_run *run)
>   unsigned long hva;
>   bool is_iabt, write_fault, writable;
>   gfn_t gfn;
> - int ret, idx;
> + int ret, idx, sea_status = 1;
> +
> + /* Check the stage-2 fault is trans. fault or write fault */
> + fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
> +
> + fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> +
> + /* The host kernel will handle the synchronous external abort. There
> +  * is no need to pass the error into the guest.
> +  */
> + if (is_abort_synchronous(fault_status))
> + sea_status = handle_guest_sea((unsigned long)fault_ipa,
> + kvm_vcpu_get_hsr(vcpu));


Why not return from here if the error has been handled?

You use sea_status to skip the next two things that KVM might do, but it goes on
to try and process this, possibly calling user_mem_abort(), surely all this is
unnecessary?


>  
>   is_iabt = kvm_vcpu_trap_is_iabt(vcpu);
> - if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu))) {
> + if (unlikely(!is_iabt && kvm_vcpu_dabt_isextabt(vcpu)) && sea_status) {
>   kvm_inject_vabt(vcpu);
>   return 1;
>   }
>  
> - fault_ipa = kvm_vcpu_get_fault_ipa(vcpu);
> -
>   trace_kvm_guest_fault(*vcpu_pc(vcpu), kvm_vcpu_get_hsr(vcpu),
> kvm_vcpu_get_hfar(vcpu), fault_ipa);
>  
> - /* Check the stage-2 fault is trans. fault or write fault */
> - fault_status = kvm_vcpu_trap_get_fault_type(vcpu);
>   if (fault_status != FSC_FAULT && fault_status != FSC_PERM &&
> - fault_status != FSC_ACCESS) {
> + fault_status != FSC_ACCESS && sea_status) {
>   kvm_err("Unsupported FSC: EC=%#x xFSC=%#lx ESR_EL2=%#lx\n",
>   kvm_vcpu_trap_get_class(vcpu),
>   (unsigned long)kvm_vcpu_trap_get_fault(vcpu),


Thanks,

James



Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-28 Thread James Morse
Hi Christoffer,

(CC: Leif and Achin who know more about how UEFI fits into this picture)

On 21/03/17 19:39, Christoffer Dall wrote:
> On Tue, Mar 21, 2017 at 07:11:44PM +0000, James Morse wrote:
>> On 21/03/17 11:34, Christoffer Dall wrote:
>>> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
>>>> On 2017/3/20 23:08, James Morse wrote:
>>>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>>>> the guest OS
>>>>>
>>>>> How does this work with firmware first?
>>>>
>>>> I explained it in previous mail about the work flow.
>>>
>>> When delivering and reporting SEIs to the VM, should this happen
>>> directly to the OS running in the VM, or to the guest firmware (e.g.
>>> UEFI) running in the VM as well?
>>
>> 'firmware first' is the ACPI specs name for x86's BIOS or management-mode
>> handling the error. On arm64 we have multiple things called firmware, so the
>> name might be more confusing than helpful.
>>
>> As far as I understand it, firmware here refers to the secure-world and EL3.
>> Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
>> routing them to EL3 where secure platform specific firmware generates CPER 
>> records.
>> For a guest, Qemu takes the role of this EL3-firmware.
>>
> Thanks for the clarification.  So UEFI in the VM would not be involved
> in this at all?

On the host, part of UEFI is involved to generate the CPER records.
In a guest?, I don't know.
Qemu could generate the records, or drive some other component to do it.

Leif and Achin are the people with the UEFI/bigger picture.


> My confusion here comes from not thinking about QEMU or KVM as firmware,
> but as the machine, so it would be sort of like the functionality is
> baked into hardware rather than firmware.
> 
> Note that to the VM, the environment will look like hardware without EL3
> and without a secure world, so any software assuming there's something
> 'hidden' behind the available non-secure modes must not decide to
> disable features if discovering the lack of a secure world.


Thanks,

James



Re: [PATCH] arm/arm64: KVM: send SIGBUS error to qemu

2017-03-23 Thread James Morse
Hi Dongjiu Geng,

On 23/03/17 13:01, Dongjiu Geng wrote:
> when the pfn is KVM_PFN_ERR_HWPOISON, it indicates to send
> SIGBUS signal from KVM's fault-handling code to qemu, qemu
> can handle this signal according to the fault address.

I'm afraid I beat you to it on this one:
https://www.spinics.net/lists/arm-kernel/msg568919.html

(Are you the same gengdj who ask me to post that patch?:
 https://lkml.org/lkml/2017/3/5/187 )

We don't need upstream KVM to do this until either arm or arm64 has
ARCH_SUPPORTS_MEMORY_FAILURE. Punit and Tyler have discovered problems with the
way arm64's hugepage and hwpoison interact:
https://www.spinics.net/lists/arm-kernel/msg568995.html


Some comments on the differences:

> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index 962616fd4ddd..1307ec400de3 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -1237,6 +1237,20 @@ static void coherent_cache_guest_page(struct kvm_vcpu 
> *vcpu, kvm_pfn_t pfn,
>   __coherent_cache_guest_page(vcpu, pfn, size);
>  }
>  
> +static void kvm_send_hwpoison_signal(unsigned long address,
> + struct task_struct *tsk)
> +{
> + siginfo_t info;
> +
> + info.si_signo   = SIGBUS;
> + info.si_errno   = 0;
> + info.si_code= BUS_MCEERR_AR;
> + info.si_addr= (void __user *)address;
> + info.si_addr_lsb = PAGE_SHIFT;

Any version of this patch should handle hugepage for the sizes KVM uses in its
stage2 mappings. By just passing PAGE_SHIFT you let the guest fault for each
page that makes up the hugepage.


> +
> + send_sig_info(SIGBUS, &info, tsk);
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
> struct kvm_memory_slot *memslot, unsigned long hva,
> unsigned long fault_status)
> @@ -1309,6 +1323,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
> phys_addr_t fault_ipa,
>   if (is_error_noslot_pfn(pfn))
>   return -EFAULT;
>  
> + if (is_error_hwpoison_pfn(pfn)) {
> + kvm_send_hwpoison_signal(kvm_vcpu_gfn_to_hva(vcpu, gfn),
> + current);
> + return -EFAULT;

This will return -EFAULT from the KVM_RUN ioctl(). Is Qemu expected to know it
should try again? This is indistinguishable from the is_error_noslot_pfn() error
above.

x86 returns 0 from this path, kvm_handle_bad_page() in arch/x86/kvm/mmu.c as the
SIGBUS should arrive first. If the SIGBUS is handled the error has been resolved
and Qemu can call KVM_RUN again. Returning an error and sending SIGBUS suggests
there are two problems.


> + }
> +
>   if (kvm_is_device_pfn(pfn)) {
>   mem_type = PAGE_S2_DEVICE;
>   flags |= KVM_S2PTE_FLAG_IS_IOMAP;



Thanks,

James



Re: [PATCH] arm64: kconfig: allow support for memory failure handling

2017-03-23 Thread James Morse
Hi Punit,

On 01/02/17 21:38, Tyler Baicar wrote:
> From: "Jonathan (Zhixiong) Zhang" 
> 
> If ACPI_APEI and MEMORY_FAILURE is configured, select
> ACPI_APEI_MEMORY_FAILURE. This enables memory failure recovery
> when such memory failure is reported through ACPI APEI. APEI
> (ACPI Platform Error Interfaces) provides a means for the
> platform to convey error information to the kernel.
> 
> Declare ARCH_SUPPORTS_MEMORY_FAILURE, as arm64 does support
> memory failure recovery attempt.

Am I right in thinking we should wait for the hugepage issue you found with
hwpoison [0] to be fixed before arm64 can have ARCH_SUPPORTS_MEMORY_FAILURE?

(If so, can this patch become part of that series to they are obviously 
related!)

Thanks,

James

[0] https://www.spinics.net/lists/arm-kernel/msg568995.html




> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index f92778d..4cd12a0 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -15,6 +15,8 @@ config ARM64
>   select ARCH_HAS_SG_CHAIN
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>   select ARCH_USE_CMPXCHG_LOCKREF
> + select ACPI_APEI_MEMORY_FAILURE if ACPI_APEI && MEMORY_FAILURE
> + select ARCH_SUPPORTS_MEMORY_FAILURE
>   select ARCH_SUPPORTS_ATOMIC_RMW
>   select ARCH_SUPPORTS_NUMA_BALANCING
>   select ARCH_WANT_COMPAT_IPC_PARSE_VERSION
> 





Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-22 Thread James Morse
Hi gengdongjiu

On 22/03/17 13:37, gengdongjiu wrote:
> On 2017/3/21 21:10, James Morse wrote:
>> On 21/03/17 06:32, gengdongjiu wrote:
>>> so for both SEA and SEI, do you prefer to below steps?
>>> EL0/EL1 SEI/SEA ---> EL3 firmware first handle --> EL2 hypervisor 
>>> notify >
>> the Qemu to inject SEI/SEA-->Qemu call KVM API to inject SEA/SEI>KVM 
>> >
>> inject SEA/SEI to guest OS
>>
>> Yes, to expand your EL2 hypervisor notify Qemu step:
>> 1 The host should call its APEI code to parse the CPER records.
>> 2 User space processes are then notified via SIGBUS (or for rasdaemon, trace
>>   points).
>> 3 Qemu can take the address delivered via SIGBUS and generate CPER records 
>> for
>>   the guest. It knows how to convert host addresses to guest IPAs, and it 
>> knows
>>   where in guest memory to write the CPER records.
>> 4 Qemu can then notify the guest via whatever mechanism it advertised via the
>>   HEST/GHES table. It might not be the same mechanism that the host received
>>   the notification through.
>>
>> Steps 1 and 2 are the same even if no guest is running, so we don't have to 
>> add
>> any special case for KVM. This is existing code that x86 uses.
>> We can test the Qemu parts without any firmware support and the APEI path in 
>> the
>> host and guest is the same.

>here do you mean map host APEI table to guest for steps 1 and 2 test? so 
> that the APEI path in the
>   host and guest is the same.

No, the hosts ACPI/APEI tables describe host physical addresses, the guest can't
access these.

Instead we can use Linux's hwpoison mechanism to call memory_failure() and if we
pick the address carefully, signal Qemu. From there we can test Qemu's
generation of CPER records and signalling the guest.

When a host and a guest both use APEI the HEST tables will be different because
the memory layout is different, but the path through APEI and the kernel's error
handling code would be the same.


>>>> How does this work with firmware first?
>>
>>> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, 
>>> El3 firmware records the error
>>> info to the APEI table, 
>>
>> These are CPER records in a memory area pointed to by one of HEST's GHES 
>> entries?
>>
>>
>>> then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to 
>>> the
>>> hypervisor, hypervisor delegates the error exception to EL1 guest
>>
>> This is a problem, just because the error occurred while the guest was 
>> running
>> doesn't mean we should deliver it directly to the guest. Some of these errors
>> will be fatal for the CPU and the host should try and power it off to contain

> yes, some of error does not need to deliver to guest OS directly. for example 
> if the error is guest kernel fault error,
> hypervisor can directly power off the whole guest OS

I agree, Qemu should make that decision, depending on the users choice it can
print a helpful error message and exit, or try and restart the guest.


>> the fault. For example: CPER's 'micro-architectural error', should the guest
>> power-off the vCPU? All that really does is return to the hypervisor, the 
>> error

> for this example, I think it is better hypervisor directly close the whole 
> guest OS, instead of
> guest power-off the vCPU.

I picked this as an example because its not clear what it means and it probably
affects the host as well as the guest. We need to do the host error containment
first.


>>>> If we took a Physical SError Interrupt the CPER records are in the hosts 
>>>> memory.
>>>> To deliver a RAS event to the guest something needs to generate CPER 
>>>> records and
>>>> put them in the guest memory. Only Qemu knows where these memory regions 
>>>> are.
>>>>
>>>> Put another way, what is the guest expected to do with this SError 
>>>> interrupt?
>>>
>>> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
>>> agent will terminate the EL0 application to isolate the error; If it is EL1 
>>> guest
>>> OS SError, guest OS can see whether it can recover. if the error was in a 
>>> read-only file cache buffer, guest OS
>>> can invalidate the page and reload the data from disk.
>>
>> How do we get an address for memory failure? SError is asynchronous, I don't
>> think it sets the FAR. (SEA is synchronous and its not guaranteed to set the

> Thank you to point that. sorry, my answe

Re: [PATCH V11 10/10] arm/arm64: KVM: add guest SEA support

2017-03-22 Thread James Morse
Hi Wang Xiongfeng,

On 22/03/17 02:46, Xiongfeng Wang wrote:
>> Guests are a special case as QEMU may never access the faulty memory itself, 
>> so
>> it won't receive the 'late' signal. It looks like ARM/arm64 KVM lacks support
>> for KVM_PFN_ERR_HWPOISON which sends SIGBUS from KVM's fault-handling code. I
>> have patches to add support for this which I intend to send at rc1.
>>
>> [0] suggests 'KVM qemu' sets these MCE flags to take the 'early' path, but 
>> given
>> x86s KVM_PFN_ERR_HWPOISON, this may be out of date.
>>
>>
>> Either way, once QEMU gets a signal indicating the virtual address, it can
>> generate its own APEI CPER records and use the KVM APIs to mock up an
>> Synchronous External Abort, (or inject an IRQ or run the vcpu waiting for the
>> guest's polling thread to come round, whichever was described to the guest 
>> via
>> the HEST/GHES tables).
>>
> 
> I have another confusion about the SIGBUS signal. Can QEMU always get a 
> SIGBUS when needed.
> I know one circumstance which will send SIGBUS. The 
> ghes_handle_memory_failure() in
> ghes_do_proc() will send SIGBUS to QEMU, but this only happens when there 
> exists memory section
> in ghes, that is the section type is CPER_SEC_PLATFORM_MEM.
> Suppose this case, an load  error in guest application causes an SEA, and the 
> firmware take it.
> The firmware begin to scan the error record and fill the ghes. But the error 
> record in memory node
> has been read by other handler.

(this looks like a race)

> The firmware won't add memory section in ghes, so 
> ghes_handle_memory_failure() won't be called.

I think this would be a firmware bug. Firmware can reserve as much memory as it
needs for writing CPER records, there should not be a case where 'the memory' is
currently being processed by another handler.

The memory firmware uses to write CPER records too shouldn't be published to the
OS until it has finished. Once firmware has finished writing the CPER records it
can update the memory pointed to by GHES->ErrorStatusAddress with the location
of the CPER records and invoke the Notification method for this GHES. (SEI, SEA,
IRQ etc). We should always get a complete set of CPER records to describe the 
error.

It firmware uses GHESv2 it can get an 'ack' write from APEI once it has finished
processing the records. Once it gets this firmware knows it can re-use the 
memory.

(Obviously each GHES entry can only process one error at a time. Firmware should
either handle this, or have one entry for each Error Source that can happen
independently)


> I mean that we may not rely on ghes_handle_memory_failure() to send SIGBUS to 
> QEMU. Whether we should
> add some other code to send SIGBUS in handle_guest_abort(). I don't know 
> whether the ARM/arm64
>  KVM_PFN_ERR_HWPOISON you mentioned above will cover all the cases.

The SIGBUS routine is part of the kernel's recovery method for memory errors. It
should cover all the errors reported with this CPER_SEC_PLATFORM_MEM.

Back to the race you describe. It shouldn't matter if one CPU processes an error
for guest memory while a vcpu is running on another. This may happen if the
error was detected by DRAM's background scrub.
If we don't treat KVM/Qemu as anything special the memory_failure()->SIGBUS path
will happen regardless of whether the fault interrupted the guest or not.


There are other types of error such as PCIe, CPU, BUS error etc. If it's
possible to recover from these we may need additional code in the kernel. This
shouldn't necessarily treat KVM as a special case.


Thanks,

James



Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-21 Thread James Morse
Hi Christoffer,

On 21/03/17 11:34, Christoffer Dall wrote:
> On Tue, Mar 21, 2017 at 02:32:29PM +0800, gengdongjiu wrote:
>> On 2017/3/20 23:08, James Morse wrote:
>>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>>> the guest OS
>>>
>>> How does this work with firmware first?
>>
>> I explained it in previous mail about the work flow.
> 
> When delivering and reporting SEIs to the VM, should this happen
> directly to the OS running in the VM, or to the guest firmware (e.g.
> UEFI) running in the VM as well?

'firmware first' is the ACPI specs name for x86's BIOS or management-mode
handling the error. On arm64 we have multiple things called firmware, so the
name might be more confusing than helpful.

As far as I understand it, firmware here refers to the secure-world and EL3.
Something like ATF can use SCR_EL3.EA to claim SErrors and external aborts,
routing them to EL3 where secure platform specific firmware generates CPER 
records.
For a guest, Qemu takes the role of this EL3-firmware.



Thanks,

James


Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-21 Thread James Morse
Hi,

On 21/03/17 06:32, gengdongjiu wrote:
> On 2017/3/20 23:08, James Morse wrote:
>> On 20/03/17 13:58, Marc Zyngier wrote:
>>> On 20/03/17 12:28, gengdongjiu wrote:
>>>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>>>> In the RAS implementation, hardware pass the virtual SEI
>>>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>>>> the guest OS

(I've juggled the order of your replies:)

> so for both SEA and SEI, do you prefer to below steps?
> EL0/EL1 SEI/SEA ---> EL3 firmware first handle --> EL2 hypervisor notify >
the Qemu to inject SEI/SEA-->Qemu call KVM API to inject SEA/SEI>KVM >
inject SEA/SEI to guest OS

Yes, to expand your EL2 hypervisor notify Qemu step:
1 The host should call its APEI code to parse the CPER records.
2 User space processes are then notified via SIGBUS (or for rasdaemon, trace
  points).
3 Qemu can take the address delivered via SIGBUS and generate CPER records for
  the guest. It knows how to convert host addresses to guest IPAs, and it knows
  where in guest memory to write the CPER records.
4 Qemu can then notify the guest via whatever mechanism it advertised via the
  HEST/GHES table. It might not be the same mechanism that the host received
  the notification through.

Steps 1 and 2 are the same even if no guest is running, so we don't have to add
any special case for KVM. This is existing code that x86 uses.
We can test the Qemu parts without any firmware support and the APEI path in the
host and guest is the same.


>> Is anyone from Huawei looking at adding RAS support for Qemu?
>  yes, I am looking at Qemu and want to add RAS support.

Great, support in Qemu is one of the missing pieces. On x86 it looks like it
emulates machine-check-exceptions, which is how x86 did this before
firmware-first and APEI became the standard.


>  do you mean let Qemu inject both the SEA and SEI?

To do the notification, yes. It needs to happen after the CPER records have been
written, and the mechanism and CPER memory location need to match what the guest
was told via the HEST/GHES table.

If Qemu didn't tell the guest about firmware-first, it can still deliver the
guest an SError Interrupt.


SEA should be possible to do with the KVM_SET_REG API, GPIO/GSIV and the other
kind of interrupts can use irqfd. For SEI we may need to add an API call to KVM
to let it pend SError with a specific ESR.



>> How does this work with firmware first?

> when the Guest OS triggers an SEI, it will firstly trap to EL3 firmware, El3 
> firmware records the error
> info to the APEI table, 

These are CPER records in a memory area pointed to by one of HEST's GHES 
entries?


> then copy the ESR_EL3 ELR_EL3 to ESR_EL2 ELR_EL2 and transfers control to the
> hypervisor, hypervisor delegates the error exception to EL1 guest

This is a problem, just because the error occurred while the guest was running
doesn't mean we should deliver it directly to the guest. Some of these errors
will be fatal for the CPU and the host should try and power it off to contain
the fault. For example: CPER's 'micro-architectural error', should the guest
power-off the vCPU? All that really does is return to the hypervisor, the error
hasn't been contained.

Firmware should handle the error first, then the host, finally the guest via 
Qemu.


> OS by setting HCR_EL2.VSE to 1 and pass the virtual SEI syndrome through 
> vsesr_el2. 
> The EL1 guest OS check the DISR_EL1 syndrome information to decide to
> terminate the application, or do some other recovery action. because the 
> HCR_EL2.AMO is set, so in fact, read
> DISR_EL1, it returns the VDISR_EL2. and VDISR_EL2 is loaded from VSESR_EL2, 
> so here I pass the virtual SEI
> syndrome vsesr_el2.

So this is how an SError Interrupt's ESR gets into a guest. How does it get hold
of the CPER records?


>> If we took a Physical SError Interrupt the CPER records are in the hosts 
>> memory.
>> To deliver a RAS event to the guest something needs to generate CPER records 
>> and
>> put them in the guest memory. Only Qemu knows where these memory regions are.
>>
>> Put another way, what is the guest expected to do with this SError interrupt?
>
> No, we do not only panic,if it is EL0 application SEI. the OS error recovery
> agent will terminate the EL0 application to isolate the error; If it is EL1 
> guest
> OS SError, guest OS can see whether it can recover. if the error was in a 
> read-only file cache buffer, guest OS
> can invalidate the page and reload the data from disk.

How do we get an address for memory failure? SError is asynchronous,

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-03-20 Thread James Morse
Hi Dongjiu Geng,

On 20/03/17 13:58, Marc Zyngier wrote:
> On 20/03/17 12:28, gengdongjiu wrote:
>> On 2017/3/20 19:24, Marc Zyngier wrote:
>>> Please include James Morse on anything RAS related, as he's already
>>> looking at related patches.

(Thanks Marc,)

>>> On 20/03/17 07:55, Dongjiu Geng wrote:
>>>> In the RAS implementation, hardware pass the virtual SEI
>>>> syndrome information through the VSESR_EL2, so set the virtual
>>>> SEI syndrome using physical SEI syndrome el2_elr to pass to
>>>> the guest OS

How does this work with firmware first?
If we took a Physical SError Interrupt the CPER records are in the hosts memory.
To deliver a RAS event to the guest something needs to generate CPER records and
put them in the guest memory. Only Qemu knows where these memory regions are.

Put another way, what is the guest expected to do with this SError interrupt?
The only choice is panic(). We should send this via Qemu so that we can add
proper guest RAS support later. Once Qemu has written the CPER records into
guest memory, it can notify the guest.

Is anyone from Huawei looking at adding RAS support for Qemu?


It looks like we should save/restore VSESR_EL2 as part of the guest CPU state,
but this needs doing with the cpufeature framework so that the single-image
kernel works on platforms with and without these features.

Xie XiuQi's series for SEI also touches the cpufeature framework.


>>>> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
>>>> index aede1658aeda..770a153fb6ba 100644
>>>> --- a/arch/arm64/kvm/hyp/switch.c
>>>> +++ b/arch/arm64/kvm/hyp/switch.c
>>>> @@ -86,6 +86,13 @@ static void __hyp_text __activate_traps(struct kvm_vcpu 
>>>> *vcpu)
>>>>isb();
>>>>}
>>>>write_sysreg(val, hcr_el2);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +  /* If virtual System Error or Asynchronous Abort is pending. set
>>>> +   * the virtual exception syndrome information
>>>> +   */
>>>> +  if (vcpu->arch.hcr_el2 & HCR_VSE)

>>>> +  write_sysreg(vcpu->arch.fault.vsesr_el2, vsesr_el2);

This won't build with versions of binutils that don't recognise vsesr_el2.
Is there another patch out there that adds a sysreg definition for vsesr_el2?


>>>> diff --git a/arch/arm64/kvm/inject_fault.c b/arch/arm64/kvm/inject_fault.c
>>>> index da6a8cfa54a0..08a13dfe28a8 100644
>>>> --- a/arch/arm64/kvm/inject_fault.c
>>>> +++ b/arch/arm64/kvm/inject_fault.c
>>>> @@ -242,4 +242,14 @@ void kvm_inject_undefined(struct kvm_vcpu *vcpu)
>>>>  void kvm_inject_vabt(struct kvm_vcpu *vcpu)
>>>>  {
>>>>vcpu_set_hcr(vcpu, vcpu_get_hcr(vcpu) | HCR_VSE);
>>>> +#ifdef CONFIG_HAS_RAS_EXTENSION
>>>> +  /* If virtual System Error or Asynchronous Abort is set. set
>>>> +   * the virtual exception syndrome information
>>>> +   */
>>>> +  kvm_vcpu_set_vsesr(vcpu, ((kvm_vcpu_get_vsesr(vcpu)
>>>> +  & (~VSESR_ELx_IDS_ISS_MASK))
>>>> +  | (kvm_vcpu_get_hsr(vcpu)
>>>> +  & VSESR_ELx_IDS_ISS_MASK)));
>>>
>>> What is the rational for setting VSESR_EL2 with the EL1 syndrome
>>> information? That doesn't make any sense to me.
>> thanks, I set the VSESR_EL2 using the  EL2 syndrome information, 
>> "kvm_vcpu_get_hsr"
>> return the value of esr_el2, not EL1 syndrome information
> 
> Ah, good point. But that doesn't make it more valid. I still don't see
> anything in the spec that supports this behaviour, and I still propose
> that when RAS is enabled, the VSError injection is mediated by userspace.

I agree, we should be handling RAS errors as firmware-first, and Qemu plays the
part of firmware for a guest. We will probably need to have a KVM API for Qemu
to pend an SError with a specific ESR value.

If this isn't a firmware-first RAS error the existing code will pend an SError
for the guest.


Thanks,

James



Re: [PATCH v2 3/3] arm64: KVM: add guest SEI support

2017-03-20 Thread James Morse
Hi Xie XiuQi,

On 20/03/17 07:48, Xie XiuQi wrote:
> On 2017/3/14 17:45, James Morse wrote:
>> On 08/03/17 04:09, Xie XiuQi wrote:
>>> Add ghes handling for SEI so that the host kernel could parse and
>>> report detailed error information for SEI which occur in the guest
>>> kernel.
>>
>> How does this interact with Synchronous External Abort as a notify method?
>> Both of these take the in_nmi() path through APEI.
>>
>> SError Interrupts are masked during exception processing, so we don't have to
>> worry about them becoming recursive.
> 
> If we use firmware first mode, SEI will be routed to EL3 first, in which mode
> the interrupt cannot be masked by the PSTATE.{A,I,F}.
> 
>> For SEA the firmware has to promise not to invoke another SEA while we are 
>> still
>> processing the first, and SEI will be masked if we took it as an exception.
>>
> 
> Yes, for SEI the firmware should also promise not to invoke another SEI while 
> the
> first SEI processing.

Because the OS can mask the exception while it does the work this should be 
easy.


> But I have a question here, how to handle this case: on the same cpu, another 
> SEA
> is taken while we are processing the first SEA. Should firmware detect this 
> case and
> reset the system directly?

For SEA firmware has to only deliver one at a time. Tyler's comment[0] on this 
was:
Tyler Baicar wrote:
> Firmware that supports the new specs should only generate one of these at a
> time, it will wait for the ack from kernel before sending a second error
> (patch 1 of this series).

I think this is what the read ack register in GHESv2 is for.

What should happen here is up to firmware. System reset sounds sensible, if
possible it would be good if any such firmware could write both sets of error
records somewhere persistent and hand them to the OS via the BERT on the next 
boot.


> The same question is also for SEI.

I think SEI is different because it can be masked. For KVM we already have
kvm_inject_vabt() which sets the VSE bit in HCR_EL2. The hardware will deliver
an SError Interrupt to the guest when it next runs with SError unmasked.

If the guest was already running the APEI SEI code it should have SError masked
until its finished.

This should be the same for firmware, I don't know enough about how physical
SError is triggered.


>> What happens if we take an SEA while processing another event notified via 
>> SEI?
>> Can this happen on your platform? Can someone else build a platform where 
>> this
>> happens? Does the GHES APEI code need to be able to handle this?
> 
> IMO, the system should be panic if we take an SEA while processing another 
> event
> notified via SEI on the same cpu, and it's not necessary to parse the GHES 
> for the
> second SEA. However, if on different cpu, it might be taken simultaneously.

For a different CPU we will spin waiting for the APEI locks, this should all
work properly today.

How can we know that SEA interrupted a CPU that was running the APEI SEI code?

The CPU masks SError when we take an exception so we can't use PSTATE.A to tell.
Judging from the range of PC values or setting some per-cpu variable is likely
to get messy.

I think the cleanest thing is to initially make SEI and SEA mutually exclusive
using Kconfig, then refactor the APEI GHES code to allow interactions like this:

>> If we need to support both at the same time we will need to change Linux's 
>> APEI
>> code to reserve a page of virtual address space per GHES entry, instead of 
>> one
>> for NMI and one for IRQ.

This way it doesn't matter if SEA interrupts SEI. I will have a go at writing 
this.


> We cannot assume that firmware could prevent the SEA notify to OS while SEI is
> processing on different cpu. Because firmware use two different GHES for SEA 
> and SEI.

I agree. We should handle any sequence of APEI notify methods that the hardware
allows to happen.



Thanks,

James

[0] https://www.spinics.net/lists/arm-kernel/msg567837.html


Re: [PATCH V12 05/10] acpi: apei: handle SEA notification type for ARMv8

2017-03-17 Thread James Morse
Hi Tyler,

On 06/03/17 20:44, Tyler Baicar wrote:
> ARM APEI extension proposal added SEA (Synchronous External Abort)
> notification type for ARMv8.
> Add a new GHES error source handling function for SEA. If an error
> source's notification type is SEA, then this function can be registered
> into the SEA exception handler. That way GHES will parse and report
> SEA exceptions when they occur.
> An SEA can interrupt code that had interrupts masked and is treated as
> an NMI. To aid this the page of address space for mapping APEI buffers
> while in_nmi() is always reserved, and ghes_ioremap_pfn_nmi() is
> changed to use the helper methods to find the prot_t to map with in
> the same way as ghes_ioremap_pfn_irq().

> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index b25e7cf..b0596ba 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -1023,6 +1075,13 @@ static int ghes_probe(struct platform_device *ghes_dev)
>   pr_warning(GHES_PFX "Generic hardware error source: %d notified 
> via local interrupt is not supported!\n",
>  generic->header.source_id);
>   goto err;
> + case ACPI_HEST_NOTIFY_GPIO:
> + case ACPI_HEST_NOTIFY_SEI:
> + case ACPI_HEST_NOTIFY_GSIV:
> + pr_warn(GHES_PFX "Generic hardware error source: %d notified 
> via notification type %u is not supported\n",
> + generic->header.source_id, generic->header.source_id);
> + rc = -ENOTSUPP;
> + goto err;
>   default:
>   pr_warning(FW_WARN GHES_PFX "Unknown notification type: %u for 
> generic hardware error source: %d\n",
>  generic->notify.type, generic->header.source_id);


This hunk will conflict with Shiju Jose's patch[0] that adds GPIO and GSIV
support. Can we remove it?


Thanks,

James


[0] https://www.spinics.net/lists/linux-acpi/msg72654.html


<    1   2   3   4   5   6   7   8   >