from:"James Morse"

Re: [PATCH v4 4/4] edac: Add support for Amazon's Annapurna Labs L2 EDAC

2019-08-02 Thread James Morse

em.


> + }
> +
> + if (cpumask_empty(_l2->cluster_cpus)) {
> + dev_err(dev, "CPU mask is empty for this L2 cache\n");
> + ret = -EINVAL;
> + goto err;
> + }
> +
> + ret = edac_device_add_device(edac_dev);
> + if (ret) {
> + dev_err(dev, "Failed to add L2 edac device\n");
> + goto err;
> + }
> +
> + return 0;
> +
> +err:
> + edac_device_free_ctl_info(edac_dev);
> +
> + return ret;
> +}

With the of_node_put()ing:
Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH] Function stack size and its name mismatch in arm64

2019-07-31 Thread James Morse

Hi Jiping,

(CC: +linux-arm-kernel)

On 31/07/2019 11:57, Steven Rostedt wrote:
> On Wed, 31 Jul 2019 17:04:37 +0800
> Jiping Ma  wrote:

> Note, the subject is not properly written, as it is missing the
> subsystem. In this case, it should start with "tracing: "
> 
> 
>> The PC of one the frame is matched to the next frame function, rather
>> than the function of his frame.
> 
> The above change log doesn't make sense. I have no idea what the actual
> problem is here. Why is this different for arm64 and no one else? Seems
> the bug is with the stack logic code in arm64 not here.

Please copy the mailing list for the arm64 arch code too.

Is this thing a recent change? arm64's stacktrace code gained some better 
protection for
loops at -rc2.


Thanks,

James


>> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
>> index 5d16f73898db..ed80b95abf06 100644
>> --- a/kernel/trace/trace_stack.c
>> +++ b/kernel/trace/trace_stack.c
>> @@ -40,16 +40,28 @@ static void print_max_stack(void)
>>  
>>  pr_emerg("DepthSize   Location(%d entries)\n"
>> "-   \n",
>> +#ifdef CONFIG_ARM64
> 
> We do not allow arch specific defines in generic code. Otherwise this
> would blow up and become unmaintainable. Not to mention it makes the
> code ugly and hard to follow.
> 
> Please explain the problem better. I'm sure there's much better ways to
> solve this than this patch.
> 
> Thanks,
> 
> -- Steve
> 
> 
> 
>> +   stack_trace_nr_entries - 1);
>> +#else
>> stack_trace_nr_entries);
>> -
>> +#endif
>> +#ifdef CONFIG_ARM64
>> +for (i = 1; i < stack_trace_nr_entries; i++) {
>> +#else
>>  for (i = 0; i < stack_trace_nr_entries; i++) {
>> +#endif
>>  if (i + 1 == stack_trace_nr_entries)
>>  size = stack_trace_index[i];
>>  else
>>  size = stack_trace_index[i] - stack_trace_index[i+1];
>>  
>> +#ifdef CONFIG_ARM64
>> +pr_emerg("%3ld) %8d   %5d   %pS\n", i-1, stack_trace_index[i],
>> +size, (void *)stack_dump_trace[i-1]);
>> +#else
>>  pr_emerg("%3ld) %8d   %5d   %pS\n", i, stack_trace_index[i],
>>  size, (void *)stack_dump_trace[i]);
>> +#endif
>>  }
>>  }
>>  
>> @@ -324,8 +336,11 @@ static int t_show(struct seq_file *m, void *v)
>>  seq_printf(m, "DepthSize   Location"
>> "(%d entries)\n"
>> "-   \n",
>> +#ifdef CONFIG_ARM64
>> +   stack_trace_nr_entries - 1);
>> +#else
>> stack_trace_nr_entries);
>> -
>> +#endif
>>  if (!stack_tracer_enabled && !stack_trace_max_size)
>>  print_disabled(m);
>>  
>> @@ -334,6 +349,10 @@ static int t_show(struct seq_file *m, void *v)
>>  
>>  i = *(long *)v;
>>  
>> +#ifdef CONFIG_ARM64
>> +if (i == 0)
>> +return 0;
>> +#endif
>>  if (i >= stack_trace_nr_entries)
>>  return 0;
>>  
>> @@ -342,9 +361,14 @@ static int t_show(struct seq_file *m, void *v)
>>  else
>>  size = stack_trace_index[i] - stack_trace_index[i+1];
>>  
>> +#ifdef CONFIG_ARM64
>> +seq_printf(m, "%3ld) %8d   %5d   ", i-1, stack_trace_index[i], size);
>> +trace_lookup_stack(m, i-1);
>> +#else
>>  seq_printf(m, "%3ld) %8d   %5d   ", i, stack_trace_index[i], size);
>>  
>>  trace_lookup_stack(m, i);
>> +#endif
>>  
>>  return 0;
>>  }
>

Re: [PATCH] arm64/kexec: Use consistent convention of initializing 'kxec_buf.mem' with KEXEC_BUF_MEM_UNKNOWN

2019-07-26 Thread James Morse

Hi Bhupesh,

On 11/07/2019 12:57, Bhupesh Sharma wrote:
> With commit b6664ba42f14 ("s390, kexec_file: drop arch_kexec_mem_walk()"),
> we introduced the KEXEC_BUF_MEM_UNKNOWN macro. If kexec_buf.mem is set
> to this value, kexec_locate_mem_hole() will try to allocate free memory.
> 
> While other arch(s) like s390 and x86_64 already use this macro to
> initialize kexec_buf.mem with, arm64 uses an equivalent value of 0.
> Replace it with KEXEC_BUF_MEM_UNKNOWN, to keep the convention of
> initializing 'kxec_buf.mem' consistent across various archs.

Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH v3 2/4] edac: Add support for Amazon's Annapurna Labs L1 EDAC

2019-07-26 Thread James Morse

Hi Hanna,

On 15/07/2019 14:24, Hanna Hawa wrote:
> Adds support for Amazon's Annapurna Labs L1 EDAC driver to detect and
> report L1 errors.

> diff --git a/drivers/edac/al_l1_edac.c b/drivers/edac/al_l1_edac.c
> new file mode 100644
> index 000..70510ea
> --- /dev/null
> +++ b/drivers/edac/al_l1_edac.c
> @@ -0,0 +1,156 @@

> +#include 

You need  for on-each_cpu().

> +#include "edac_device.h"
> +#include "edac_module.h"

You need  for the sys_reg() macro. The ARCH_ALPINE dependency 
doesn't stop
this from being built on 32bit arm, where this sys_reg() won't work/exist.

[...]

> +static void al_l1_edac_cpumerrsr(void *arg)
> +{
> + struct edac_device_ctl_info *edac_dev = arg;
> + int cpu, i;
> + u32 ramid, repeat, other, fatal;
> + u64 val = read_sysreg_s(ARM_CA57_CPUMERRSR_EL1);
> + char msg[AL_L1_EDAC_MSG_MAX];
> + int space, count;
> + char *p;
> + if (!(FIELD_GET(ARM_CA57_CPUMERRSR_VALID, val)))
> + return;
> + space = sizeof(msg);
> + p = msg;
> + count = snprintf(p, space, "CPU%d L1 %serror detected", cpu,
> +  (fatal) ? "Fatal " : "");
> + p += count;
> + space -= count;

snprintf() will return the number of characters it would have generated, even 
if that is
more than space. If this happen, space becomes negative, p points outside msg[] 
and msg[]
isn't NULL terminated...

It looks like you want scnprintf(), which returns the number of characters 
written to buf
instead. (I don't see how 256 characters would be printed by this code)


> + switch (ramid) {
> + case ARM_CA57_L1_I_TAG_RAM:
> + count = snprintf(p, space, " RAMID='L1-I Tag RAM'");
> + break;
> + case ARM_CA57_L1_I_DATA_RAM:
> + count = snprintf(p, space, " RAMID='L1-I Data RAM'");
> + break;
> + case ARM_CA57_L1_D_TAG_RAM:
> + count = snprintf(p, space, " RAMID='L1-D Tag RAM'");
> + break;
> + case ARM_CA57_L1_D_DATA_RAM:
> + count = snprintf(p, space, " RAMID='L1-D Data RAM'");
> + break;
> + case ARM_CA57_L2_TLB_RAM:
> + count = snprintf(p, space, " RAMID='L2 TLB RAM'");
> + break;
> + default:
> + count = snprintf(p, space, " RAMID='unknown'");
> + break;
> + }
> +
> + p += count;
> + space -= count;
> + count = snprintf(p, space,
> +  " repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)",
> +  repeat, other, val);
> +
> + for (i = 0; i < repeat; i++) {
> + if (fatal)
> + edac_device_handle_ue(edac_dev, 0, 0, msg);
> + else
> + edac_device_handle_ce(edac_dev, 0, 0, msg);
> + }
> +
> + write_sysreg_s(0, ARM_CA57_CPUMERRSR_EL1);

Writing 0 just after you've read the value would minimise the window where 
repeat could
have increased behind your back, or another error was counted as other, when it 
could have
been reported more accurately.


> +}


> +static int al_l1_edac_probe(struct platform_device *pdev)
> +{
> + struct edac_device_ctl_info *edac_dev;
> + struct device *dev = >dev;
> + int ret;
> +
> + edac_dev = edac_device_alloc_ctl_info(0, (char *)dev_name(dev), 1, "L",
> +   1, 1, NULL, 0,
> +   edac_device_alloc_index());
> + if (IS_ERR(edac_dev))

edac_device_alloc_ctl_info() returns NULL, or dev_ctl, which comes from 
kzalloc(). I think
you need to check for NULL here, IS_ERR() only catches the -errno range. (there 
is an
IS_ERR_OR_NULL() if you really needed both)


> + return -ENOMEM;


With the header-includes and edac_device_alloc_ctl_info() NULL check:
Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH v9 0/8] EDAC drivers for Armada XP L2 and DDR

2019-07-26 Thread James Morse

Hi Chris,

On 12/07/2019 04:48, Chris Packham wrote:
> I still seem to be struggling to get this on anyone's radar.

Whose radar does it need to cross?


> The Reviews/Acks have been given so this should be good to go in via the ARM
> tree as planned.
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2017-August/525561.html

For your v8 I took this to mean this series was done!

If nothing has changed with Boris and Russell's decision (it was two years 
ago),
details of the patch system are here:

https://lore.kernel.org/linux-arm-kernel/20190624142346.pxljv3m4npatd...@shell.armlinux.org.uk/


Thanks,

James

Re: [PATCH v9 8/8] EDAC: armada_xp: Add support for more SoCs

2019-07-26 Thread James Morse

Hi Chris,

On 12/07/2019 04:49, Chris Packham wrote:
> The Armada 38x and other integrated SoCs use a reduced pin count so the
> width of the SDRAM interface is smaller than the Armada XP SoCs. This
> means that the definition of "full" and "half" width is reduced from
> 64/32 to 32/16.

> diff --git a/drivers/edac/armada_xp_edac.c b/drivers/edac/armada_xp_edac.c
> index 3759a4fbbdee..7f227bdcbc84 100644
> --- a/drivers/edac/armada_xp_edac.c
> +++ b/drivers/edac/armada_xp_edac.c
> @@ -332,6 +332,11 @@ static int axp_mc_probe(struct platform_device *pdev)
>  
>   axp_mc_read_config(mci);
>  
> + /* These SoCs have a reduced width bus */
> + if (of_machine_is_compatible("marvell,armada380") ||
> + of_machine_is_compatible("marvell,armadaxp-98dx3236"))
> + drvdata->width /= 2;

So the hardware's SDRAM_CONFIG_BUS_WIDTH value is wrong? Yuck.

Is it too late for the DTs on these two systems to provide a DT version of the 
'bus_width'
to override the hardware's mis-advertised value?

This way you don't need to grow this list.

Acked-by: James Morse 


Thanks,

James

Re: [RFC v1 0/4] arm64: MMU enabled kexec kernel relocation

2019-07-26 Thread James Morse

Hi Pavel,

On 17/07/2019 20:13, Pavel Tatashin wrote:
>> After a quick skim:
>>
>> This will map 'nomap' regions of memory with cacheable attributes. This is a 
>> non-starter.
>> These regions were described by firmware as having content that was/is 
>> written with
>> different attributes. The attributes must match whenever it is mapped, 
>> otherwise we have a
>> loss of coherency. Mapping this stuff as cacheable means the CPU can 
>> prefetch it into the
>> cache whenever it likes.
> 
>> It may be important that we do not ever map some of these regions, even 
>> though its
>> described as memory. On AMD-Seattle the bottom page of memory is reserved by 
>> firmware for
>> its own use; it is made secure-only, and any access causes an
>> external-abort/machine-check. UEFI describes this as 'Reserved', and we 
>> preserve this in
>> the kernel as 'nomap'. The equivalent DT support uses memreserve, possibly 
>> with the
>> 'nomap' attribute.
>>
>> Mapping a 'new'/unknown region with cacheable attributes can never be safe, 
>> even if we
>> trusted kexec-tool to only write the kernel to memory. The host may be using 
>> a bigger page
>> size causing more memory to become cacheable than was intended.
>> Linux's EFI support rounds the UEFI memory map to the largest support page 
>> size, (and
>> winges about firmware bugs).
>> If we're allowing kexec to load images in a region not described as 
>> IORESOURCE_SYSTEM_RAM,
>> that is a bug we should fix.
> 
> We are allowing this. If you consider this to be a bug, I will fix it,
> and this will actually simplify the idmap page table. User will
> receive an error during kexec load if a request is made to load into
> !IORESOURCE_SYSTEM_RAM region.

I consider this a bug, but we can see what others think.
This suggests kexec-tools can open /proc/iomem, find a likely looking gap, and 
try to load
the new kernel between two platform devices.


>> The only way to do this properly is to copy the linear mapping. The arch 
>> code has lots of
>> complex code to generate it correctly at boot, we do not want to duplicate 
>> it.
>> (this is why hibernate copies the linear mapping)
> 
> As I understand, you would like to take a copy of idmap page table,
> and add entries only for segment
> sources and destinations into the new page table?

I don't think there is a need to idmap memory at all. We should copy the linear 
map so you
know you won't overwrite its page tables as part of loading the new kernel.


> If so, there is a slight problem: arch hook machine_kexec_prepare() is
> called prior to loading segments from userland. We can solve this by
> adding another hook that is called after kimage_terminate().

Yes, all this would need doing as machine_kexec() runs. We preferably need to 
allocate
memory in this path, or at least have a bitmap of what we can/can't overwrite.


>> These patches do not remove the running page tables from TTBR1. As you 
>> overwrite the live
>> page tables you will corrupt the state of the CPU. The page-table walker may 
>> access things
>> that aren't memory, cache memory that shouldn't be cached (see above), and 
>> allocate
>> conflicting entries in the TLB.
> 
> Indeed. However, I was following what is done in create_safe_exec_page():
> https://soleen.com/source/xref/linux/arch/arm64/kernel/hibernate.c?r=af873fce#263
> 
> ttbr1 is not removed there. Am I missing something, or is not yet
> configured there?

Hibernate maps a single executable page in ttbr0_el1 that holds its relocation 
code.
The relocation code then switches ttbr1_el1 to point to the copy of the linear 
map. See
the 'break_before_make_ttbr_switch' macro in swsusp_arch_suspend_exit().


> I will set ttbr1 to zero page.
> 
>> You cannot use the mm page table helpers to build an idmap on arm64. The mm 
>> page table
>> helpers have a compile-time VA_BITS, and we support systems where there is 
>> no memory below
>> 1<> 39bit VA kernel,
>> the idmap will have more page table levels than the page table helpers can 
>> build. This is
>> why there are special helpers to load the idmap, and twiddle TCR_EL1.T0SZ.
>> You already need to copy the linear-map, so using an idmap is extra work. 
>> You want to work
>> with linear-map addresses, you probably need to add the field to the 
>> appropriate structure.
> 
> OK. Makes sense. I will do the way hibernate setup this table. I was
> indeed following x86, hoping that eventually it would be possible to
> unite: kasan, hibernate, and kexec implementation of this page table.

Our kasan and hibernate code already went a different way. I doubt we can bring 
them back
in to look like x86, they have different problems to solve.


>> The kexec relocation code still runs at EL2. You can't use a copy of the 
>> linear map here
>> as there is only one TTBR on v8.0, and you'd need to setup EL2 as its been 
>> torn back to
>> the hyp-stub.
> 
> As I understand normally on baremetal kexec runs at EL1 not EL2. On my
> machine is_kernel_in_hyp_mode()

Re: [PATCHv2] EDAC, altera: Move Stratix10 SDRAM ECC to peripheral

2019-07-25 Thread James Morse

Hi Thor,

On 12/07/2019 19:28, thor.tha...@linux.intel.com wrote:
> From: Thor Thayer 
> 
> ARM32 SoCFPGAs had separate IRQs for SDRAM. ARM64 SoCFPGAs
> send all DBEs to SError so filtering by source is necessary.
> 
> The Stratix10 SDRAM ECC is a better match with the generic
> Altera peripheral ECC framework because the linked list can
> be searched to find the ECC block offset and printout
> the DBE Address.


> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> index c2e693e34d43..09a80b53acea 100644
> --- a/drivers/edac/altera_edac.c
> +++ b/drivers/edac/altera_edac.c

> @@ -2231,13 +2275,15 @@ static int altr_edac_a10_probe(struct platform_device 
> *pdev)
>   of_device_is_compatible(child, "altr,socfpga-dma-ecc") ||
>   of_device_is_compatible(child, "altr,socfpga-usb-ecc") ||
>   of_device_is_compatible(child, "altr,socfpga-qspi-ecc") ||
> +#ifdef CONFIG_EDAC_ALTERA_SDRAM
> + of_device_is_compatible(child, "altr,sdram-edac-s10") ||
> +#endif
>   of_device_is_compatible(child, "altr,socfpga-sdmmc-ecc"))

I'm just curious: This list looks suspiciously like the 
altr_edac_a10_device_of_match[]
list. Is there a reason it can't use of_match_device() here?

>  
>   altr_edac_a10_device_add(edac, child);
>  
>  #ifdef CONFIG_EDAC_ALTERA_SDRAM
> - else if ((of_device_is_compatible(child, 
> "altr,sdram-edac-a10")) ||
> -  (of_device_is_compatible(child, 
> "altr,sdram-edac-s10")))
> + else if (of_device_is_compatible(child, "altr,sdram-edac-a10"))
>   of_platform_populate(pdev->dev.of_node,
>altr_sdram_ctrl_of_match,
>NULL, >dev);


Acked-by: James Morse 


Thanks,

James

Re: [PATCH 1/1] efi: cper: print AER info of PCIe fatal error

2019-07-25 Thread James Morse

Hi,

On 12/07/2019 03:20, Xiaofei Tan wrote:
> AER info of PCIe fatal error is not printed in the current driver.
> Because APEI driver will panic directly for fatal error, and can't
> run to the place of printing AER info.
> 
> An example log is as following:
> [ 3157.655028] {763}[Hardware Error]: Hardware error from APEI Generic 
> Hardware Error Source: 11
> [ 3157.663610] {763}[Hardware Error]: event severity: fatal
> [ 3157.663612] {763}[Hardware Error]:  Error 0, type: fatal
> [ 3157.663614] {763}[Hardware Error]:   section_type: PCIe error
> [ 3157.680328] {763}[Hardware Error]:   port_type: 0, PCIe end point
> [ 3157.680329] {763}[Hardware Error]:   version: 4.0
> [ 3157.680332] {763}[Hardware Error]:   command: 0x, status: 0x0010
> [ 3157.698757] {763}[Hardware Error]:   device_id: :82:00.0
> [ 3157.698758] {763}[Hardware Error]:   slot: 0
> [ 3157.698759] {763}[Hardware Error]:   secondary_bus: 0x00
> [ 3157.698760] {763}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x10fb
> [ 3157.698761] {763}[Hardware Error]:   class_code: 02
> [ 3157.698825] Kernel panic - not syncing: Fatal hardware error!
> 
> This issue was imported by the patch, '37448adfc7ce ("aerdrv: Move
> cper_print_aer() call out of interrupt context")'. To fix this issue,
> this patch adds print of AER info in cper_print_pcie() for fatal error.
> 
> Here is the example log after this patch applied:
> [ 7032.893566] {24}[Hardware Error]: Hardware error from APEI Generic 
> Hardware Error Source: 10
> [ 7032.901965] {24}[Hardware Error]: event severity: fatal
> [ 7032.907166] {24}[Hardware Error]:  Error 0, type: fatal
> [ 7032.912366] {24}[Hardware Error]:   section_type: PCIe error
> [ 7032.917998] {24}[Hardware Error]:   port_type: 0, PCIe end point
> [ 7032.923974] {24}[Hardware Error]:   version: 4.0
> [ 7032.928569] {24}[Hardware Error]:   command: 0x0546, status: 0x4010
> [ 7032.934806] {24}[Hardware Error]:   device_id: :01:00.0
> [ 7032.940352] {24}[Hardware Error]:   slot: 0
> [ 7032.944514] {24}[Hardware Error]:   secondary_bus: 0x00
> [ 7032.949714] {24}[Hardware Error]:   vendor_id: 0x15b3, device_id: 0x1019
> [ 7032.956381] {24}[Hardware Error]:   class_code: 02
> [ 7032.961495] {24}[Hardware Error]:   aer_uncor_status: 0x0004, 
> aer_uncor_mask: 0x
> [ 7032.969891] {24}[Hardware Error]:   aer_uncor_severity: 0x00062010
> [ 7032.976042] {24}[Hardware Error]:   TLP Header: 00c0 0101 0001 
> 
> [ 7032.983663] Kernel panic - not syncing: Fatal hardware error!

> Fixes: 37448adfc7ce ("aerdrv: Move cper_print_aer() call out of
> interrupt context")

(Please put this all on one line)


> diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c
> index 8fa977c..bf8600d 100644
> --- a/drivers/firmware/efi/cper.c
> +++ b/drivers/firmware/efi/cper.c
> @@ -390,6 +390,19 @@ static void cper_print_pcie(const char *pfx, const 
> struct cper_sec_pcie *pcie,
>   printk(
>   "%s""bridge: secondary_status: 0x%04x, control: 0x%04x\n",
>   pfx, pcie->bridge.secondary_status, pcie->bridge.control);

It may be worth a comment explaining why we only do this for fatal errors. 
Something like:
| /* Fatal errors call __ghes_panic() before the AER handler gets to print this 
*/


> + if (pcie->validation_bits & CPER_PCIE_VALID_AER_INFO &&
> + gdata->error_severity & CPER_SEV_FATAL) {
> + struct aer_capability_regs *aer;
> +
> + aer = (struct aer_capability_regs *)pcie->aer_info;
> + printk("%saer_uncor_status: 0x%08x, aer_uncor_mask: 0x%08x\n",

The convention in the rest of the file is for the prefix format string to be 
separate. i.e:
| "%s""aer_uncor_status: ..."

Could it be the same for consistency?

> +pfx, aer->uncor_status, aer->uncor_mask);
> + printk("%saer_uncor_severity: 0x%08x\n",
> +pfx, aer->uncor_severity);
> + printk("%sTLP Header: %08x %08x %08x %08x\n", pfx,
> +aer->header_log.dw0, aer->header_log.dw1,
> +aer->header_log.dw2, aer->header_log.dw3);
> + }
>  }

Regardless,
Reviewed-by; James Morse 


Thanks,

James

Re: [PATCH v2 3/4] arm64: Make debug exception handlers visible from RCU

2019-07-23 Thread James Morse

Hi,

On 22/07/2019 08:48, Masami Hiramatsu wrote:
> Make debug exceptions visible from RCU so that synchronize_rcu()
> correctly track the debug exception handler.
> 
> This also introduces sanity checks for user-mode exceptions as same
> as x86's ist_enter()/ist_exit().
> 
> The debug exception can interrupt in idle task. For example, it warns
> if we put a kprobe on a function called from idle task as below.
> The warning message showed that the rcu_read_lock() caused this
> problem. But actually, this means the RCU is lost the context which
> is already in NMI/IRQ.

> So make debug exception visible to RCU can fix this warning.

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 9568c116ac7f..a6b244240db6 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -777,6 +777,42 @@ void __init hook_debug_fault_code(int nr,
>   debug_fault_info[nr].name   = name;
>  }
>  
> +/*
> + * In debug exception context, we explicitly disable preemption.
> + * This serves two purposes: it makes it much less likely that we would
> + * accidentally schedule in exception context and it will force a warning
> + * if we somehow manage to schedule by accident.
> + */
> +static void debug_exception_enter(struct pt_regs *regs)
> +{
> + if (user_mode(regs)) {
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake 
> RCU");

Would moving entry.S's context_tracking_user_exit() call to be before 
do_debug_exception()
also fix this?

I don't know the reason its done 'after' debug exception handling. Its always 
been like
this: commit 6c81fe7925cc4c42 ("arm64: enable context tracking").

> + } else {
> + /*
> +  * We might have interrupted pretty much anything.  In
> +  * fact, if we're a debug exception, we can even interrupt
> +  * NMI processing.

> +  * We don't want in_nmi() to return true,
> +  * but we need to notify RCU.

How come? If you interrupted an SError or pseudo-nmi, it already is. Those 
paths should
all be painted no-kprobe, but I'm sure there are gaps. The hw-breakpoints can 
almost
certainly hook them.

> +  */
> + rcu_nmi_enter();

Can we interrupt printk()? Do we need printk_nmi_enter()? ... What about ftrace?

Because SError and pseudo-nmi can interrupt interrupt-masked code, we describe 
them as
NMI. The only difference here is these exceptions are synchronous.

I suspect we should make these debug exceptions nmi for EL1. We can then use 
this for the
kprobe-re-entrance stuff so the pre/post hooks don't get run if they 
interrupted something
also described as NMI.

> + }
> +
> + preempt_disable();
> +
> + /* This code is a bit fragile.  Test it. */
> + RCU_LOCKDEP_WARN(!rcu_is_watching(), "exception_enter didn't work");
> +}
> +NOKPROBE_SYMBOL(debug_exception_enter);

Thanks,

James

Re: [PATCH v2 2/4] arm64: unwind: Prohibit probing on return_address()

2019-07-23 Thread James Morse

Hi,

On 22/07/2019 08:48, Masami Hiramatsu wrote:
> Prohibit probing on return_address() and subroutines which
> is called from return_address(), since the it is invoked from
> trace_hardirqs_off() which is also kprobe blacklisted.

(Nits: "which are called" and "since it is")


> diff --git a/arch/arm64/kernel/return_address.c 
> b/arch/arm64/kernel/return_address.c
> index b21cba90f82d..7f8a143268b0 100644
> --- a/arch/arm64/kernel/return_address.c
> +++ b/arch/arm64/kernel/return_address.c
> @@ -8,6 +8,7 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -17,7 +18,7 @@ struct return_address_data {
>   void *addr;
>  };
>  
> -static int save_return_addr(struct stackframe *frame, void *d)
> +static nokprobe_inline int save_return_addr(struct stackframe *frame, void 
> *d)

This nokprobe_inline ends up as __always_inline if kprobes is enabled.
What do we expect the compiler to do with this? save_return_addr is passed as a
function-pointer to walk_stackframe()... I don't see how the compiler can 
inline it!

This would be needed for on_accessible_stack().
Should we cover ftrace_graph_get_ret_stack()?, or is that already in hand?


>  {
>   struct return_address_data *data = d;
>  
> @@ -52,3 +53,4 @@ void *return_address(unsigned int level)
>   return NULL;
>  }
>  EXPORT_SYMBOL_GPL(return_address);
> +NOKPROBE_SYMBOL(return_address);


Thanks,

James

Re: [PATCH v2 1/4] arm64: kprobes: Recover pstate.D in single-step exception handler

2019-07-23 Thread James Morse

Hi!

On 22/07/2019 08:48, Masami Hiramatsu wrote:
> On arm64, if a nested kprobes hit, it can crash the kernel with below
> error message.
> 
> [  152.118921] Unexpected kernel single-step exception at EL1
> 
> This is because commit 7419333fa15e ("arm64: kprobe: Always clear
> pstate.D in breakpoint exception handler") unmask pstate.D for
> doing single step but does not recover it after single step in
> the nested kprobes.

> That is correct *unless* any nested kprobes
> (single-stepping) runs inside other kprobes user handler.

(I don't think this is correct, its just usually invisible as PSTATE.D is 
normally clear)


> When the 1st kprobe hits, do_debug_exception() will be called. At this
> point, debug exception (= pstate.D) must be masked (=1). When the 2nd
>  (nested) kprobe is hit before single-step of the first kprobe, it
> unmask debug exception (pstate.D = 0) and return.
> Then, when the 1st kprobe setting up single-step, it saves current
> DAIF, mask DAIF, enable single-step, and restore DAIF.
> However, since "D" flag in DAIF is cleared by the 2nd kprobe, the
> single-step exception happens soon after restoring DAIF.

This is pretty complicated. Just to check I've understood this properly:
Stepping on a kprobe in a kprobe-user's pre_handler will cause the remainder of 
the
handler (the first one) to run with PSTATE.D clear. Once we enable single-step, 
we start
stepping the debug handler, and will never step the original kprobe'd 
instruction.

This is describing the most complicated way that this problem shows up! (I 
agree its also
the worst)

I can get this to show up with just one kprobe. (function/file names here are 
meaningless):

| static int wibble(struct seq_file *m, void *discard)
| {
|unsigned long d, flags;
|
|flags = local_daif_save();
|
|kprobe_me();
|d = read_sysreg(daif);
|local_daif_restore(flags);
|
|seq_printf(m, "%lx\n", d);
|
|return 0;
| }

plumbed into debugfs, then kicked using the kprobe_example module:
| root@adam:/sys/kernel/debug# cat wibble
| 3c0

| root@adam:/sys/kernel/debug# insmod ~morse/kprobe_example.ko symbol=kprobe_me
| [   69.478098] Planted kprobe at [..]
| root@adam:/sys/kernel/debug# cat wibble
| [   71.478935]  pre_handler: p->addr = [..], pc = [..], pstate = 
0x63c5
| [   71.488942]  post_handler: p->addr = [..], pstate = 0x61c5
| 1c0

| root@adam:/sys/kernel/debug#

This is problem for any code that had debug masked, not just kprobes.

Can we start the commit-message with the simplest description of the problem: 
kprobes
manipulates the interrupted PSTATE for single step, and doesn't restore it.

(trying to understand this bug through kprobe's interaction with itself is 
hard!)


> To solve this issue, this stores all DAIF bits and restore it
> after single stepping.


> diff --git a/arch/arm64/kernel/probes/kprobes.c 
> b/arch/arm64/kernel/probes/kprobes.c
> index bd5dfffca272..348e02b799a2 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -29,6 +29,8 @@
>  
>  #include "decode-insn.h"
>  
> +#define PSR_DAIF_MASK(PSR_D_BIT | PSR_A_BIT | PSR_I_BIT | PSR_F_BIT)

We should probably move this to daifflags.h. Its going to be useful to other 
series too.


Patch looks good!
Reviewed-by: James Morse 
Tested-by: James Morse 

(I haven't tried to test the nested kprobes case...)


Thanks,

James

Re: [PATCH 1/3] arm64: kprobes: Recover pstate.D in single-step exception handler

2019-07-19 Thread James Morse

Hi!

On 18/07/2019 06:43, Masami Hiramatsu wrote:
> On arm64, if a nested kprobes hit, it can crash the kernel with below
> error message.
> 
> [  152.118921] Unexpected kernel single-step exception at EL1
> 
> This is because commit 7419333fa15e ("arm64: kprobe: Always clear
> pstate.D in breakpoint exception handler") clears pstate.D always in
> the nested kprobes. That is correct *unless* any nested kprobes
> (single-stepping) runs inside other kprobes (including kprobes in
>  user handler).

kprobes probing kprobes!? ... why do we support this?

We treat 'debug' as our highest exception level, it can interrupt pNMI and 
RAS-errors.
Letting it loop doesn't sound like a good idea.

> When the 1st kprobe hits, do_debug_exception() will be called. At this
> point, debug exception (= pstate.D) must be masked (=1).

> When the 2nd (nested) kprobe is hit before single-step of the first kprobe,

How does this happen?
I guess the kprobe-helper-function gets called in debug context, but surely you 
can't
kprobe a kprobe-helper-function? What stops this going in a loop?

> it modifies debug exception clear (pstate.D = 0).

After taking the first BRK, DAIF=0xf, everything is masked. When you take the 
second BRK
this shouldn't change.

Those spsr_set_debug_flag() calls are modifying the spsr in the regs structure, 
they only
become PSTATE when we eret for single-step.

> Then, when the 1st kprobe setting up single-step, it saves current
> DAIF, mask DAIF, enable single-step, and restore DAIF.

> However, since "D" flag in DAIF is cleared by the 2nd kprobe, the
> single-step exception happens soon after restoring DAIF.

PSTATE.D bit clearing should only be effective for the duration of the 
single-step.

> To solve this issue, this refers saved pstate register to check the
> previous pstate.D and recover it if needed.

(This sounds like undoing something that shouldn't have happened in the first 
place)

> diff --git a/arch/arm64/kernel/probes/kprobes.c 
> b/arch/arm64/kernel/probes/kprobes.c
> index bd5dfffca272..6e1dc0bb4c82 100644
> --- a/arch/arm64/kernel/probes/kprobes.c
> +++ b/arch/arm64/kernel/probes/kprobes.c
> @@ -201,12 +201,14 @@ spsr_set_debug_flag(struct pt_regs *regs, int mask)
>   * interrupt occurrence in the period of exception return and  start of
>   * out-of-line single-step, that result in wrongly single stepping
>   * into the interrupt handler.
> + * This also controls debug flag, so that we can refer the saved pstate.
>   */
>  static void __kprobes kprobes_save_local_irqflag(struct kprobe_ctlblk *kcb,
>   struct pt_regs *regs)
>  {
>   kcb->saved_irqflag = regs->pstate;
>   regs->pstate |= PSR_I_BIT;
> + spsr_set_debug_flag(regs, 0);

(Nit: this is the only caller of spsr_set_debug_flag(), as we're modifing 
regs->pstate
directly here, can we lose the helper and just manipulate regs->pstate? )

>  }
>  
>  static void __kprobes kprobes_restore_local_irqflag(struct kprobe_ctlblk 
> *kcb,
> @@ -245,15 +251,12 @@ static void __kprobes setup_singlestep(struct kprobe *p,
>   kcb->kprobe_status = KPROBE_HIT_SS;
>   }
>
> -
>   if (p->ainsn.api.insn) {
>   /* prepare for single stepping */
>   slot = (unsigned long)p->ainsn.api.insn;
>
>   set_ss_context(kcb, slot);  /* mark pending ss */
>
> - spsr_set_debug_flag(regs, 0);
> -
>   /* IRQs and single stepping do not mix well. */
>   kprobes_save_local_irqflag(kcb, regs);
>   kernel_enable_single_step(regs);

These two hunks look like cleanup, could we do this separately from a fix for 
stable?

> @@ -216,6 +218,10 @@ static void __kprobes 
> kprobes_restore_local_irqflag(struct kprobe_ctlblk *kcb,
>   regs->pstate |= PSR_I_BIT;
>   else
>   regs->pstate &= ~PSR_I_BIT;
> +
> + /* Recover pstate.D mask if needed */
> + if (kcb->saved_irqflag & PSR_D_BIT)
> + spsr_set_debug_flag(regs, 1);
>  }

Ugh. .. I get it ..

I think the simplest summary of the problem is:
Kprobes unmasks debug exceptions for single-step, then leaves them unmasked 
when the
probed function is restarted.

I'd like to know more about this nested case, but I don't think its the 
simplest example
of this problem.
The commit message is describing both the interrupted and running PSTATE as 
PSTATE. I
think it would be clearer if you called the interrupted one SPSR (saved pstate 
register).
That's the value in the regs structure.

Please don't re-manipulate the flags, its overly verbose and we've already got 
this wrong
once! We should just blindly restore the DAIF setting we had before as its 
simpler.

Could we change kprobes_save_local_irqflag() to save the DAIF bits of pstate:
| kcb->saved_irqflag = regs->pstate & DAIF_MASK;
(DAIF_MASK is all four PSR bits)

So that we can then fix this in kprobes_restore_local_irqflag() with:
| regs->pstate &= ~DAIF_MASK;
|

Re: [PATCH 3/3] arm64: debug: Remove rcu_read_lock from debug exception

2019-07-19 Thread James Morse


Hi,

On 7/18/19 3:31 PM, Masami Hiramatsu wrote:

On Thu, 18 Jul 2019 10:20:23 +0100
Mark Rutland  wrote:


On Wed, Jul 17, 2019 at 11:22:15PM -0700, Paul E. McKenney wrote:

On Thu, Jul 18, 2019 at 02:43:58PM +0900, Masami Hiramatsu wrote:

Remove rcu_read_lock()/rcu_read_unlock() from debug exception
handlers since the software breakpoint can be hit on idle task.


Why precisely do we need to elide these? Are we seeing warnings today?


Yes, unfortunately, or fortunately. Naresh reported that warns when
ftracetest ran. I confirmed that happens if I probe on default_idle_call too.

/sys/kernel/debug/tracing # echo p default_idle_call >> kprobe_events
/sys/kernel/debug/tracing # echo 1 > events/kprobes/enable
/sys/kernel/debug/tracing # [  135.122237]
[  135.125035] =
[  135.125310] WARNING: suspicious RCU usage



[  135.132224] Call trace:
[  135.132491]  dump_backtrace+0x0/0x140
[  135.132806]  show_stack+0x24/0x30
[  135.133133]  dump_stack+0xc4/0x10c
[  135.133726]  lockdep_rcu_suspicious+0xf8/0x108
[  135.134171]  call_break_hook+0x170/0x178
[  135.134486]  brk_handler+0x28/0x68
[  135.134792]  do_debug_exception+0x90/0x150
[  135.135051]  el1_dbg+0x18/0x8c
[  135.135260]  default_idle_call+0x0/0x44
[  135.135516]  cpu_startup_entry+0x2c/0x30
[  135.135815]  rest_init+0x1b0/0x280
[  135.136044]  arch_call_rest_init+0x14/0x1c
[  135.136305]  start_kernel+0x4d4/0x500



The exception entry and exit use irq_enter() and irq_exit(), in this
case, correct?  Otherwise RCU will be ignoring this CPU.


This is missing today, which sounds like the underlying bug.


Agreed. I'm not so familier with how debug exception is handled on arm64,
would it be a kind of NMI or IRQ?


Debug exceptions can interrupt both SError (think: machine check) and 
pseudo-NMI, which both in turn interrupt interrupt-masked code. So they 
are a kind of NMI. But, be careful not to call 'nmi_enter()' twice, see 
do_serror() for how we work around this...




Anyway, it seems that normal irqs are also not calling irq_enter/exit
except for arch/arm64/kernel/smp.c
drivers/irqchip/irq-gic.c:gic_handle_irq() either calls 
handle_domain_irq() or handle_IPI(). The enter/exit calls live in those 
functions.



Thanks,

James

Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling

2019-07-17 Thread James Morse

Hi Tyler,

On 11/07/2019 05:14, Tyler Baicar OS wrote:
> On Tue, Jul 9, 2019 at 8:52 PM Tyler Baicar OS 
>  wrote:
>> On Mon, Jul 8, 2019 at 10:10 AM James Morse  wrote:
>>> On 02/07/2019 17:51, Tyler Baicar OS wrote:
>>>> @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int 
>>>> esr, struct pt_regs *regs)
>>>>
>>>>   inf = esr_to_fault_info(esr);
>>>>
>>>> + arch_arm_ras_report_error();
>>>> +
>>>>   /*
>>>>* Return value ignored as we rely on signal merging.
>>>>* Future patches will make this more robust.
>>>>
>>>
>>> If we interrupted a preemptible context, do_sea() is preemptible too... 
>>> This means we
>>> can't know if we're still running on the same CPU as the one that took the 
>>> external-abort.
>>> (until this series, it hasn't mattered).
>>>
>>> Fixing this means cramming something into entry.S's el1_da, as this may 
>>> unmask interrupts
>>> before calling do_mem_abort(). But its going to be ugly because some of 
>>> do_mem_abort()s
>>> ESR values need to be preemptible because they sleep, e.g. page-faults 
>>> calling
>>> handle_mm_fault().
>>> For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, 
>>> but if we
>>> don't, it still needs to be balanced. Doing all this in assembly is going 
>>> to be unreadable!
>>>
>>> Mark Rutland has a series to move the entry assembly into C [0]. Based on 
>>> that that it
>>> should be possible for the new el1_abort() to spot a 
>>> Synchronous-External-Abort ESR, and
>>> wrap the do_mem_abort() with preempt enable/disable, before inheriting the 
>>> flags. (which
>>> for synchronous exceptions, I think we should always do)
>>>
>>> [0] 
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

>> Good catch! I didn't think the synchronous route was preemptible.

>> I wasn't seeing this issue when testing this on emulation, but I was able to
>> test and prove the issue on a Neoverse N1 SDP:
>>
>> root@genericarmv8:~# echo 0x1 > /proc/cached_read
>> [   42.985622] Reading from address 0x1
>> [   42.989893] WARNING: CPU: 0 PID: 2812 at 
>> /home/tyler/neoverse/arm-reference-
>> platforms/linux/arch/arm64/kernel/cpufeature.c:1940 
>> this_cpu_has_cap+0x68/0x78

[...]

>> [   43.175647] Internal error: synchronous external abort: 96000410 [#1]
>> PREEMPT SMP

[...]

>> I'll pull Mark's series in and add the preempt enable/disable around the call
>> to do_mem_abort() in el1_abort() and test that out!
> 
> I was able to pull in the series mentioned [0] and add a patch to wrap
> do_mem_abort with preempt disable/enable and the warning has gone away.

Great.

I spoke to Mark who commented he hadn't had the time to finish rebasing it onto
for-next/core. (which I guess is why it didn't get posted!).

I've taken a stab at picking out the 'synchronous' parts and rebasing it onto 
arm64's
for-next/core. That tree is here:
http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/deasm_sync_only/v0

(this should save you doing the rebase)

I'll aim to rebase/retest and post it next week.


> diff --git a/arch/arm64/kernel/entry-common.c 
> b/arch/arm64/kernel/entry-common.c
> index 43aa78331e72..26cdf7db511a 100644
> --- a/arch/arm64/kernel/entry-common.c
> +++ b/arch/arm64/kernel/entry-common.c
> @@ -118,7 +118,25 @@ static void el1_abort(struct pt_regs *regs, unsigned 
> long esr)

el0_ia/da will have the same problem.


>   unsigned long far = read_sysreg(far_el1);
>   local_daif_inherit(regs);
>   far = untagged_addr(far);
> - do_mem_abort(far, esr, regs);
> +
> + switch (esr & ESR_ELx_FSC) {
> + case ESR_ELx_FSC_EXTABT:// Synchronous External Abort
> + case 0x14:  // SEA level 0 translation table walk
> + case 0x15:  // SEA level 1 translation table walk
> + case 0x16:  // SEA level 2 translation table walk
> + case 0x17:  // SEA level 3 translation table walk
> + case 0x18:  // Synchronous ECC error
> + case 0x1c:  // SECC level 0 translation table walk
> + case 0x1d:  // SECC level 1 translation table walk
> + case 0x1e:  // SECC level 2 translation table walk
> + case 0x1f:  // SECC level 3 translation table walk

Re: kprobes sanity test fails on next-20190708

2019-07-08 Thread James Morse

Hi,

On 08/07/2019 15:11, Anders Roxell wrote:
> argh... resending, with plaintext... Sorry =/
> 
> I tried to build a next-201908 defconfig + CONFIG_KPROBES=y and
> CONFIG_KPROBES_SANITY_TEST=y
> 
> I get the following Call trace, any ideas?
> I've tried tags back to next-20190525 and they also failes... I haven't
> found a commit that works yet.
> 
> [0.098694] Kprobe smoke test: started
> [0.102001] audit: type=2000 audit(0.088:1): state=initialized
> audit_enabled=0 res=1
> [0.104753] Internal error: aarch64 BRK: f204 [#1] PREEMPT SMP

This sounds like the issue Mark reported:
https://lore.kernel.org/r/20190702165008.gc34...@lakrids.cambridge.arm.com

It doesn't look like Steve's patch has percolated into next yet:
https://lore.kernel.org/lkml/20190703103715.32579...@gandalf.local.home/

Could you give that a try to see if this is a new issue?


Thanks,

James

Re: [PATCH RFC 2/4] arm64: mm: Add RAS extension system register check to SEA handling

2019-07-08 Thread James Morse

Hey Tyler,

On 02/07/2019 17:51, Tyler Baicar OS wrote:
> On systems that support the ARM RAS extension, synchronous external
> abort syndrome information could be captured in the core's RAS extension
> system registers. So, when handling SEAs check the RAS system registers
> for error syndrome information.

> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 2d11501..76b42ca 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -37,6 +37,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  struct fault_info {
>   int (*fn)(unsigned long addr, unsigned int esr,
> @@ -632,6 +633,8 @@ static int do_sea(unsigned long addr, unsigned int esr, 
> struct pt_regs *regs)
>  
>   inf = esr_to_fault_info(esr);
>  
> + arch_arm_ras_report_error();
> +
>   /*
>* Return value ignored as we rely on signal merging.
>* Future patches will make this more robust.
> 

If we interrupted a preemptible context, do_sea() is preemptible too... This 
means we
can't know if we're still running on the same CPU as the one that took the 
external-abort.
(until this series, it hasn't mattered).

Fixing this means cramming something into entry.S's el1_da, as this may unmask 
interrupts
before calling do_mem_abort(). But its going to be ugly because some of 
do_mem_abort()s
ESR values need to be preemptible because they sleep, e.g. page-faults calling
handle_mm_fault().
For do_sea(), do_exit() will 'fix' the preempt count if we kill the thread, but 
if we
don't, it still needs to be balanced. Doing all this in assembly is going to be 
unreadable!

Mark Rutland has a series to move the entry assembly into C [0]. Based on that 
that it
should be possible for the new el1_abort() to spot a Synchronous-External-Abort 
ESR, and
wrap the do_mem_abort() with preempt enable/disable, before inheriting the 
flags. (which
for synchronous exceptions, I think we should always do)

Thanks,

James

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/entry-deasm

Re: [PATCH] arm64: mm: free the initrd reserved memblock in a aligned manner

2019-07-05 Thread James Morse

Hi,

On 04/07/2019 00:59, Yi Wang wrote:
> From: Junhua Huang 
> 
> We should free the reserved memblock in an aligned manner 
> because the initrd reserves the memblock in an aligned manner 
> in arm64_memblock_init(). 
> Otherwise there are some fragments in memblock_reserved regions. e.g.:
> /sys/kernel/debug/memblock # cat reserved 
>0: 0x8008..0x817fafff
>1: 0x8340..0x83ff
>2: 0x9000..0x9000407f
>3: 0xb000..0xb03f
>4: 0xb26184ea..0xb2618fff
> The fragments like the ranges from b000 to b03f and 
> from b26184ea to b2618fff should be freed.
> 
> And we can do free_reserved_area() after memblock_free(),
> as free_reserved_area() calls __free_pages(), once we've done 
> that it could be allocated somewhere else, 
> but memblock and iomem still say this is reserved memory.
> 
> Signed-off-by: Junhua Huang 

You need to add your own Signed-off-by after Junhua Huang's. This tells the 
maintainer
that you're providing the patch with the 'Developer's Certificate of Origin'. 
Details in
/Documentation/process/submitting-patches.rst.


> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index d2adffb81b5d..03774b8bd364 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -580,8 +580,13 @@ void free_initmem(void)
>  #ifdef CONFIG_BLK_DEV_INITRD
>  void __init free_initrd_mem(unsigned long start, unsigned long end)
>  {
> + unsigned long aligned_start, aligned_end;
> +
> + aligned_start = __virt_to_phys(start) & PAGE_MASK;
> + aligned_end = PAGE_ALIGN(__virt_to_phys(end));

> + memblock_free(aligned_end, aligned_end - aligned_start);

We're not free-ing the same memory as we reserved here!
(start/end typo)

>   free_reserved_area((void *)start, (void *)end, 0, "initrd");
> - memblock_free(__virt_to_phys(start), end - start);
> +

(stray newline)

>  }
>  #endif


Thanks,

James

Re: [PATCH] remove the initrd resource in /proc/iomem as theinitrdhas freed the reserved memblock.

2019-07-05 Thread James Morse

Hi,

(CC: +devicetree list:
memreserving the initrd, which linux then frees causes a zombie memreserve in 
all future
kexec'd kernels)

On 03/07/2019 12:42, huang.jun...@zte.com.cn wrote:
 On 02/07/2019 11:34, Yi Wang wrote:
> From: Junhua Huang 
> The 'commit 50d7ba36b916 ("arm64: export memblock_reserve()d regions via 
> /proc/iomem")'
> show the reserved memblock in /proc/iomem. But the initrd's reserved 
> memblock
> will be freed in free_initrd_mem(), which executes after the 
> reserve_memblock_reserved_regions().
> So there are some incorrect information shown in /proc/iomem. e.g.:
> 8000-bbdf : System RAM
>   8008-813e : Kernel code
>   813f-8156 : reserved
>   8157-817fcfff : Kernel data
>   8340-83ff : reserved
>   9000-90004fff : reserved
>   b000-b2618fff : reserved
>   b8c0-bbbf : reserved
> In this case, the range from b000 to b2618fff is reserved for initrd, 
> which should be
> clean from the resource tree after it was freed.

 (There was some discussion about this over-estimate on the list, but it 
 didn't make it
 into the commit message.) I think a reserved->free change is fine. If 
 user-space thinks
 its still reserved nothing bad happens.
>>
> As kexec-tool will collect the iomem reserved info 
> and use it in second kernel, which causes error message generated a 
> second time.
>>
 What error message?
>>
>>> Sorry, it's my mistake. The kexec-tool could not use iomem reserved info in 
>>> the second kernel.
>>> The error message I mean is that the initrd reserved memblock region will 
>>> be shown in 
>>> second kernel /proc/iomem. But this message comes from the dtb's memreserve 
>>> node, 
>>> not the first kernel /proc/iomem.
>>
>> This doesn't sound right.
>> Is kexec-tool spraying anything reserved in /proc/iomem into the DT as 
>> memreserve?

> No, it isn't. The kexec-tool could not spray resserved info to DT as 
> memreserve.

(well, it generates the second DT, and it has the reserved /proc/iomem entries 
on hand)


> After we started the kernel from uboot, the /sys/firmware/fdt in this kernel 
> has been add some infos,
> incluing the dtb memreserve and ramdisk memreserve info.

Aha! Ugh.
arm64_memblock_init() memblock_reserves() this, as does fdt_init_reserved_mem().
But then we memblock_free() it in free_initrd_mem().


> And then we use this fdt as the kexec dtb, and the second kernel would read 
> the memreserve node and reserve the memblock.

> So we can see the first kernel initrd reserved info from the second kernel 
> /proc/iomem.

[...]

> This phenomenon is not caused by first kernel /proc/iomem or kexec, but dtb.

Right, so the options are:

(1) Delete the memreserve node from the DT when we free the initrd. I don't 
think this is
the sort of thing the kernel should be doing.

(2) Memreserving the initramfs implies 'keepinitrd' on the commandline. The DT 
has told us
to "exclude reserved memory from normal usage". If the bootloader didn't mean 
for us to
reserve this memory forever, it should drop the memreserve.

(3) kexec-tools determines this memreserve is no longer needed and removes it 
from the new
DT. It has the details to do this: /sys/firmware/fdt will show the physical 
addresses of
the initramfs. With your patch /proc/iomem will show that this region is 
regular memory.
(not covered by any reserved type). This is safe with older kernels.


My preference is 2 then 3.

[...]

> I agree. The kexec-tool will use the second-level reserved info to avoid the 
> load address 
> conflict with the important thing, It is true that if the image load address 
> kexec-tools set 
> would belong to other important thing, something would go wrong. 
> So I think we need clean the initrd reserved info from /proc/iomem, it is 
> useless.

If we've freed the memory and we can update that file, sure.
I think at the time of that patch the assumption was only the arch code does
memblock_reserve() early, so we never need to update /proc/iomem.


Thanks,

James

Re: [PATCH v4 2/2] EDAC: add EDAC driver for DMC520

2019-07-04 Thread James Morse

edac_printk(KERN_ERR, EDAC_MC,
>>> +   "interrupt-config error: "
>>> +   "element %d's interrupt mask %d has overlap.\n",
>>> +   intr_index, edac->interrupt_masks[intr_index]);
>>> +   goto err_free_mc;
>>> +   }
>>> +
>>> +   edac->interrupt_mask_all |= edac->interrupt_masks[intr_index];
>>> +   }

>> Ah, so the driver doesn't support overlapping masks... but wasn't this the 
>> reason for
>> describing the interrupts with these masks in the first place?
>> (It looks like the DT-folk want this as named interrupts)
>>
>> lore.kernel.org/r/byapr21mb1319bc4d079b918ab038a4d590...@byapr21mb1319.namprd21.prod.outlook.com
>>
>> Would this driver support the configuration you gave there?

> The interrupt line to mask mapping is to solve how to flexibly adapting 
> to possible hardware implementations. dmc520 supports multiple 
> interrupts (3.3.169 interrupt_control). And these interrupts may have 
> different ways to be wired to physical interrupt lines. As in the above 
> link, in this particular brcm implementation:
> 
> Line 841: source dram_ecc_errc_int
> Line 843: source dram_ecc_errd_int
> Line 839: source dram_ecc_errc_int and dram_ecc_errd_int
> 
> Two straightforward possibilities for implementing ecc counts for ce/ue: 
> 1. We chose to use the single source line. 2. It's possible to implement 
> using the combined-source line too. Our implementation would support 
> either of these cases.
> 
> Of course there might be other possibilities that involve overlapping, 
> such as including all above 3 interrupt lines into the DT. But this 
> unlikely is of any real value of use. Our implementation does not 
> support this case.

Right, so the driver does support this, but not at the same time as independant 
interrupts.


>> With the bool/enum and interrupt-disabling things fixed:
>> Reviewed-by: James Morse 
>>
> New to the upstreaming review process. Does this last comment mean we're 
> closer? :)

Heh, yes. This translates as: If you post a subsequent version with those two 
issues
fixed, please include that Reviewed-by tag next to your Signed-off-by.

{
If you could also summarise the changes you make next to the diffstat, it 
allows people
who have given tags to only look at the bits you changed, (instead of playing 
spot the
difference).

As an example:
https://lore.kernel.org/r/20190521172139.21277-3-julien.gr...@arm.com

git knows to discard the changes-between-versions and diffstat bits when the 
patch is
applied, they don't end up in the log.
}

What happens next? Your series gets more review and collects tags. This will 
include the
maintainers of each tree you're touching either giving tags, or queueing the 
series. From
there it sits in linux-next until the next merge-window, when the maintainer 
will send a
pull-request to Linus. Eventually it ends up in the release-candidates, and 
finally a
released kernel.


(N.B: your mail is still coming base64 encoded, so its very unlikely the 
maintainer can
pick it up.)


Thanks,

James

Re: [PATCH] remove the initrd resource in /proc/iomem as the initrdhas freed the reserved memblock.

2019-07-03 Thread James Morse

Hi,

On 03/07/2019 10:16, huang.jun...@zte.com.cn wrote:
>> On 02/07/2019 11:34, Yi Wang wrote:
>>> From: Junhua Huang 
>>> The 'commit 50d7ba36b916 ("arm64: export memblock_reserve()d regions via 
>>> /proc/iomem")'
>>> show the reserved memblock in /proc/iomem. But the initrd's reserved 
>>> memblock
>>> will be freed in free_initrd_mem(), which executes after the 
>>> reserve_memblock_reserved_regions().
>>> So there are some incorrect information shown in /proc/iomem. e.g.:
>>> 8000-bbdf : System RAM
>>>   8008-813e : Kernel code
>>>   813f-8156 : reserved
>>>   8157-817fcfff : Kernel data
>>>   8340-83ff : reserved
>>>   9000-90004fff : reserved
>>>   b000-b2618fff : reserved
>>>   b8c0-bbbf : reserved
>>> In this case, the range from b000 to b2618fff is reserved for initrd, 
>>> which should be
>>> clean from the resource tree after it was freed.
>>
>> (There was some discussion about this over-estimate on the list, but it 
>> didn't make it
>> into the commit message.) I think a reserved->free change is fine. If 
>> user-space thinks
>> its still reserved nothing bad happens.

>>> As kexec-tool will collect the iomem reserved info 
>>> and use it in second kernel, which causes error message generated a second 
>>> time.

>> What error message?

> Sorry, it's my mistake. The kexec-tool could not use iomem reserved info in 
> the second kernel.
> The error message I mean is that the initrd reserved memblock region will be 
> shown in 
> second kernel /proc/iomem. But this message comes from the dtb's memreserve 
> node, 
> not the first kernel /proc/iomem.

This doesn't sound right.
Is kexec-tool spraying anything reserved in /proc/iomem into the DT as 
memreserve?

These top-level 'nomap' and second-level 'reserved' entries exist to stop 
kexec-tools
trying to write the new kernel over the top of something important. This only 
matters
between 'load' and 'exec' during the #1-kernel:

| kexec-tools reads /proc/iomem.
| kexec-tools tells #1-kernel "I want this 10MB image to be located at 0xf00".
| #1-kernel knows 0xf00 is in use, so it stores the data else where until 
kexec-time.
[some time passes]
| #1-kernel kexec's, copying the image to 0xf00
| #2-kernel now owns the machine

This goes wrong if 0xf00 belonged to firmware (nomap), or contained something 
important
(uefi memory map, acpi tables etc).

Once the second kernel has started running it should re-discover where this 
important
stuff is from the EFI and ACPI tables.

We deliberately over-estimate these second-level reserved regions as its the 
simplest
thing to do. (e.g. the per-cpu chunk allocations get swept up too)

Does this mean the amount of usable memory in the system reduces each time you 
kexec? That
shouldn't be true!

Thanks,

James

Re: [PATCH v4 2/2] EDAC: add EDAC driver for DMC520

2019-07-03 Thread James Morse

_ce, );
> +
> + cnt = dmc520_get_dram_ecc_error_count(edac, is_ce);
> +
> + if (cnt > 0) {
> + snprintf(message, ARRAY_SIZE(message),
> +  "rank:%d bank:%d row:%d col:%d",
> +  info.rank, info.bank,
> +  info.row, info.col);
> +
> + spin_lock_irqsave(>ecc_lock, flags);

irqsave/irqrestore is overkill as this function is only called from an 
interrupt handler.
There is no way for this to be called with interrupts unmasked.


> + edac_mc_handle_error((is_ce ? HW_EVENT_ERR_CORRECTED :
> +  HW_EVENT_ERR_UNCORRECTED),
> +  mci, cnt, 0, 0, 0, info.rank, -1, -1,
> +  message, "");
> + spin_unlock_irqrestore(>ecc_lock, flags);
> + }
> +}
> +
> +static irqreturn_t dmc520_edac_dram_ecc_isr(int irq, void *data, bool is_ce)

data here could be struct mem_ctl_info *, as it only has one caller.

> +{
> + u32 i_mask;
> + struct mem_ctl_info *mci;
> + struct dmc520_edac *edac;
> +
> + mci = data;
> + edac = mci->pvt_info;
> +
> + i_mask = is_ce ? DRAM_ECC_INT_CE_MASK : DRAM_ECC_INT_UE_MASK;

(The mask/bit here could be passed in directly, its the value you need most 
often)


> + dmc520_handle_dram_ecc_errors(mci, is_ce);
> +
> + dmc520_write_reg(edac, i_mask, REG_OFFSET_INTERRUPT_CLR);
> +
> + return IRQ_HANDLED;
> +}

[...]


> +static int dmc520_edac_probe(struct platform_device *pdev)
> +{

[...]

> + if (nintr > ARRAY_SIZE(dmc520_isr_array)) {
> + edac_printk(KERN_ERR, EDAC_MOD_NAME,
> + "Invalid device node configuration: # of interrupt 
> config "
> + "elements (%d) can not exeed %ld.\n",

(Nit: exceed)

> + nintr, ARRAY_SIZE(dmc520_isr_array));
> + return -EINVAL;
> + }

[...]

> + ret = of_property_read_u32_array(dev->of_node, "interrupt-config",
> + edac->interrupt_masks, nintr);
> + if (ret) {
> + edac_printk(KERN_ERR, EDAC_MOD_NAME,
> + "Failed to get interrupt-config arrays.\n");
> + goto err_free_mc;
> + }

> + for (intr_index = 0; intr_index < nintr; ++intr_index) {
> + if (edac->interrupt_mask_all & 
> edac->interrupt_masks[intr_index]) {
> + edac_printk(KERN_ERR, EDAC_MC,
> + "interrupt-config error: "
> + "element %d's interrupt mask %d has overlap.\n",
> + intr_index, edac->interrupt_masks[intr_index]);
> + goto err_free_mc;
> + }
> +
> + edac->interrupt_mask_all |= edac->interrupt_masks[intr_index];
> + }

Ah, so the driver doesn't support overlapping masks... but wasn't this the 
reason for
describing the interrupts with these masks in the first place?
(It looks like the DT-folk want this as named interrupts)

lore.kernel.org/r/byapr21mb1319bc4d079b918ab038a4d590...@byapr21mb1319.namprd21.prod.outlook.com

Would this driver support the configuration you gave there?


> + edac->interrupt_mask_all &= ALL_INT_MASK;

This is to removed invalid interrupt fields? Shouldn't we print a warning 
instead? Either
the DT is invalid, or its some future hardware that has an extra interrupt that 
this
driver won't enable.


[...]

> + /* Clear interrupts */
> + reg_val = dmc520_read_reg(edac, REG_OFFSET_INTERRUPT_CONTROL);
> + dmc520_write_reg(edac, reg_val & (~(edac->interrupt_mask_all)),
> + REG_OFFSET_INTERRUPT_CONTROL);
> + dmc520_write_reg(edac, edac->interrupt_mask_all, 
> REG_OFFSET_INTERRUPT_CLR);

[...]

> + /* Enable interrupts */
> + dmc520_write_reg(edac, edac->interrupt_mask_all, 
> REG_OFFSET_INTERRUPT_CONTROL);

Won't this disable any interrupts we weren't told about? You did a read-modify 
write
above. Can we do the same here?


> + return 0;
> +
> +err_free_irq:
> + for (intr_index = 0; intr_index < nintr_registered; ++intr_index) {
> + int irq_id = platform_get_irq(pdev, intr_index);
> +     devm_free_irq(>dev, irq_id, mci);
> + }
> + edac_mc_del_mc(>dev);
> +err_free_mc:
> + edac_mc_free(mci);
> +
> + return ret;
> +}
> +

[...]

> +static const struct of_device_id dmc520_edac_driver_id[] = {
> + { .compatible = "brcm,dmc-520", },
> + { .compatible = "arm,dmc-520", },

You should only need the "arm,dmc-520" entry here. The additional compatible 
values are
for quirking the driver when integration issues are discovered.
The 'brcm' version should be in the DT from day-one, but the kernel only needs 
to pick it
up when it needs to treat the brcm version differently.


> + { /* end of table */ }
> +};


With the bool/enum and interrupt-disabling things fixed:
Reviewed-by: James Morse 



Thanks,

James

[0] 
https://static.docs.arm.com/10/0200/corelink_dmc520_trm_10_0200_01_en.pdf

Re: [PATCH] remove the initrd resource in /proc/iomem as the initrd has freed the reserved memblock.

2019-07-02 Thread James Morse

Hello,

On 02/07/2019 11:34, Yi Wang wrote:
> From: Junhua Huang 
> 
> The 'commit 50d7ba36b916 ("arm64: export memblock_reserve()d regions via 
> /proc/iomem")'
> show the reserved memblock in /proc/iomem. But the initrd's reserved memblock
> will be freed in free_initrd_mem(), which executes after the 
> reserve_memblock_reserved_regions().
> So there are some incorrect information shown in /proc/iomem. e.g.:
> 8000-bbdf : System RAM
>   8008-813e : Kernel code
>   813f-8156 : reserved
>   8157-817fcfff : Kernel data
>   8340-83ff : reserved
>   9000-90004fff : reserved
>   b000-b2618fff : reserved
>   b8c0-bbbf : reserved
> In this case, the range from b000 to b2618fff is reserved for initrd, 
> which should be
> clean from the resource tree after it was freed.

(There was some discussion about this over-estimate on the list, but it didn't 
make it
into the commit message.) I think a reserved->free change is fine. If 
user-space thinks
its still reserved nothing bad happens.

> As kexec-tool will collect the iomem reserved info 
> and use it in second kernel, which causes error message generated a second 
> time.

What error message?

These second-level reserved regions are to avoid kexec overwriting ACPI tables, 
or bits of
the UEFI memory map with its data.

The second kernel should not expect to find the initramfs here...

> At the same time, we should free the reserved memblock in an aligned manner 
> because 
> the initrd reserves the memblock in an aligned manner in 
> arm64_memblock_init(). 
> Otherwise there are some fragments in memblock_reserved regions. e.g.:
> /sys/kernel/debug/memblock # cat reserved 
>0: 0x8008..0x817fafff
>1: 0x8340..0x83ff
>2: 0x9000..0x9000407f
>3: 0xb000..0xb03f
>4: 0xb26184ea..0xb2618fff
> The fragments like the ranges from b000 to b03f and from b26184ea to 
> b2618fff 
> should be freed.

I agree we should fix this. Could it be done as a separate patch, as this is a 
different
issue?

> diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
> index d2adffb81b5d..14ba8113eab5 100644
> --- a/arch/arm64/mm/init.c
> +++ b/arch/arm64/mm/init.c
> @@ -580,8 +580,16 @@ void free_initmem(void)
>  #ifdef CONFIG_BLK_DEV_INITRD
>  void __init free_initrd_mem(unsigned long start, unsigned long end)
>  {
> + struct resource *res = NULL;

>   free_reserved_area((void *)start, (void *)end, 0, "initrd");

Could we do this last? This calls __free_pages(), once we've done that it could 
be
allocated somewhere else, but memblock and iomem still say this is reserved 
memory.

> - memblock_free(__virt_to_phys(start), end - start);
> + start = __virt_to_phys(start) & PAGE_MASK;
> + end = PAGE_ALIGN(__virt_to_phys(end));
> + memblock_free(start, end - start);

> + res = lookup_resource(_resource, memblock_start_of_DRAM());
> + if (res != NULL)

("if (res)" is the preferred style)

> + __release_region(res, start, end - start);

Hmm, I can't see where __release_region() splits start/end out of an existing 
resource...

I think this means if you have memblock_reserved() pages either size of the 
initramfs they
get merged by memblock and put into iomem_resource as one entry. We'd hit
__release_regions()s:
|   if (res->start != start || res->end != end)
|   break;
[...]
|   printk(KERN_WARNING "Trying to free nonexistent resource "
|   "<%016llx-%016llx>\n", (unsigned long long)start,
|   (unsigned long long)end);

It looks like __release_region() is only for exactly the region you reserved. 
We need it
to split one resource into two, the reverse of what reserve_region_with_split() 
does.

I can't see a helper that would do this.

An alternative is to explicitly create an iomem_resource for the initramfs, 
like we do for
the kernel text/data. reserve_region_with_split() will then fill in the gaps 
around it if
there are any, and you can pass it to release_resource() in here instead of 
searching for it.

It should be possible to test by rounding the initramfs base/size in 
arm64_memblock_init()
by a bit more.

Thanks,

James

Re: [PATCH] EDAC: Fix global-out-of-bounds write when setting edac_mc_poll_msec

2019-06-27 Thread James Morse

Hello,

(CC: +Tony Luck.
 Original Patch: lore.kernel.org/r/20190626054011.30044-1-de...@etsukata.com )

On 26/06/2019 06:40, Eiichi Tsukata wrote:
> Commit 9da21b1509d8 ("EDAC: Poll timeout cannot be zero, p2") assumes
> edac_mc_poll_msec to be unsigned long, but the type of the variable still
> remained as int. Setting edac_mc_poll_msec can trigger out-of-bounds
> write.

Thanks for catching this!


> Fix it by changing the type of edac_mc_poll_msec to unsigned int.

This means reverting more of 9da21b1509d8, but it also fixes signed/unsigned 
issues:
| root@debian-guest:/sys/module/edac_core/parameters# echo 4294967295 >  
edac_mc_poll_msec
| root@debian-guest:/sys/module/edac_core/parameters# cat edac_mc_poll_msec
| -1
| root@debian-guest:/sys/module/edac_core/parameters# echo -1 > 
edac_mc_poll_msec
| -bash: echo: write error: Invalid argument


> The reason why this patch adopts unsigned int rather than unsigned long
> is msecs_to_jiffies() assumes arg to be unsigned int.

Ah, so the range is limited anyway.

It looks like it was switched to long to be consistent with 
edac_mc_workq_setup(), which
has since been removed in preference to msecs_to_jiffies().


Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH 12/21] EDAC, ghes: Add support for legacy API counters

2019-06-26 Thread James Morse

Hi Robert,

On 20/06/2019 07:55, Robert Richter wrote:
> On 19.06.19 18:22:32, James Morse wrote:
>>> In any case, this patch cleans up code as old API's counter code is
>>> isolated and moved to common code. Making the counter's work for ghes
>>> is actually a side-effect here. The cleanup is a prerequisit for
>>> follow on patches.
>>
>> I'm all for removing/warning-its-broken it when ghes_edac is in use. But the 
>> convincing
>> argument is debian ships a 'current' version of edac-utils that predates 
>> 199747106934,
>> (that made all this fake csrow stuff deprecated), and debian's popcon says 
>> ~1000 people
>> have it installed.
> 
> All arm64 distribution kernels that I have checked come with:
> 
> CONFIG_EDAC_SUPPORT=y
> CONFIG_EDAC=y
> CONFIG_EDAC_LEGACY_SYSFS=y
> # CONFIG_EDAC_DEBUG is not set
> CONFIG_EDAC_GHES=y
> CONFIG_EDAC_LAYERSCAPE=m
> CONFIG_EDAC_THUNDERX=m
> CONFIG_EDAC_XGENE=m

(distros also enable drivers for hardware no-one has!)

Who uses this? edac-utils, on both arm64 and x86.

>> If you want it fixed, please don't do it as a side effect of cleanup. Fixes 
>> need to be a
>> small separate series that can be backported. (unless we're confident no-one 
>> uses it, in
>> which case, why fix it?)

> It is not that I am keen on fixing legacy edac sysfs. It just happens
> while unifying the error handlers in ghes_edac and edac_mc. As I see
> you are reluctant on just letting it go, let's just disable
> EDAC_LEGACY_SYSFS for ARM64.

That would break other drivers where those legacy counters expose valid values.

You're painting me as some kind of stubborn villan here. You're right my 
initial reaction
was 'what for?'. Adding new support for legacy counters that have never worked 
with
ghes_edac looks like the wrong thing to do.

But unfortunately edac-utils is still using this legacy interface.

If we're going to fix it, could we fix it properly? (separate series that can be
backported to stable).

> Though, I don't agree with it as there
> still could be some userland tools that use this interface that cannot
> be used any longer after a transition from x86 to arm64.

I don't think this is the right thing to do. ghes_edac's behaviour should not 
change
between architectures.

Where we aren't agreeing is how we fix bugs:

Its either broken, and no-one cares, we should remove it.
Or, we should fix it and those fixes should go to stable.

We can't mix fixes and features in a patch series, as the fixes then can't 
easily be
backported. If its ever in doubt, the patches should still be as separate fixes 
so the
maintainer can decide.

Thanks,

James

Re: [PATCH v2 24/24] EDAC, ghes: Disable legacy API for ARM64

2019-06-26 Thread James Morse

On 24/06/2019 16:09, Robert Richter wrote:
> James Morse: "I'm all for removing/warning-its-broken it when
> ghes_edac is in use."

Thanks for taking that out of context. The very next word was 'but':
http://lore.kernel.org/r/c08290d8-3690-efa9-3bc7-37f8b1fdb...@arm.com

followed by details of the user-space that is still using this.

> Let's just disable legacy API for the ghes driver on arm64. Though, I
> don't agree with it as there still could be some userland tools that

Not could. Are. Someone went and found them for you.

> use this interface that cannot be used any longer after a transition
> from x86 to arm64.

[PATCH v2] drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT

2019-06-24 Thread James Morse

The cacheinfo structures are alloced/freed by cpu online/offline
callbacks. Originally these were only used by sysfs to expose the
cache topology to user space. Without any in-kernel dependencies
CPUHP_AP_ONLINE_DYN was an appropriate choice.

resctrl has started using these structures to identify CPUs that
share a cache. It updates its 'domain' structures from cpu
online/offline callbacks. These depend on the cacheinfo structures
(resctrl_online_cpu()->domain_add_cpu()->get_cache_id()->
 get_cpu_cacheinfo()).
These also run as CPUHP_AP_ONLINE_DYN.

Now that there is an in-kernel dependency, move the cacheinfo
work earlier so we know its done before resctrl's CPUHP_AP_ONLINE_DYN
work runs.

Fixes: 2264d9c74dda1 ("x86/intel_rdt: Build structures for each resource based 
on cache topology")
Cc: 
Cc: Fenghua Yu 
Cc: Reinette Chatre 
Signed-off-by: James Morse 
---
I haven't seen any problems because of this.

Changes since v1: lore.kernel.org/r/20190530161024.85637-1-james.mo...@arm.com
 * CC-list, added fixes/stable tags.

 drivers/base/cacheinfo.c   | 3 ++-
 include/linux/cpuhotplug.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index a7359535caf5..b444f89a2041 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -655,7 +655,8 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 static int __init cacheinfo_sysfs_init(void)
 {
-   return cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "base/cacheinfo:online",
+   return cpuhp_setup_state(CPUHP_AP_BASE_CACHEINFO_ONLINE,
+"base/cacheinfo:online",
 cacheinfo_cpu_online, cacheinfo_cpu_pre_down);
 }
 device_initcall(cacheinfo_sysfs_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6a381594608c..50c893f03c21 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -175,6 +175,7 @@ enum cpuhp_state {
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
+   CPUHP_AP_BASE_CACHEINFO_ONLINE,
CPUHP_AP_ONLINE_DYN,
CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,
CPUHP_AP_X86_HPET_ONLINE,
-- 
2.20.1

Re: [PATCH v5 1/1] EDAC, mellanox: Add ECC support for BlueField DDR4

2019-06-21 Thread James Morse

+ mci = edac_mc_alloc(mc_idx, ARRAY_SIZE(layers), layers, sizeof(*priv));
> + if (!mci)
> + return -ENOMEM;
> +
> + priv = mci->pvt_info;
> +
> + priv->dimm_per_mc = dimm_count;
> + priv->emi_base = devm_ioremap_resource(dev, emi_res);
> + if (IS_ERR(priv->emi_base)) {
> + dev_err(dev, "failed to map EMI IO resource\n");
> + ret = PTR_ERR(priv->emi_base);
> + goto err;
> + }
> +
> + mci->pdev = dev;
> + mci->mtype_cap = MEM_FLAG_DDR4 | MEM_FLAG_RDDR4 |
> +  MEM_FLAG_LRDDR4 | MEM_FLAG_NVDIMM;
> + mci->edac_ctl_cap = EDAC_FLAG_SECDED;
> +
> + mci->mod_name = DRIVER_NAME;
> + mci->ctl_name = "BlueField_Memory_Controller";
> + mci->dev_name = dev_name(dev);
> + mci->edac_check = bluefield_edac_check;
> +
> + /* Initialize mci with the actual populated DIMM information. */
> + bluefield_edac_init_dimms(mci);
> +
> + platform_set_drvdata(pdev, mci);
> +
> + /* Register with EDAC core */
> + rc = edac_mc_add_mc(mci);
> + if (rc) {
> + dev_err(dev, "failed to register with EDAC core\n");
> + ret = rc;
> + goto err;
> + }
> +
> + /* Only POLL mode supported so far. */
> + edac_op_state = EDAC_OPSTATE_POLL;


> + return 0;
> +
> +err:
> + edac_mc_free(mci);
> +
> + return ret;
> +
> +}

With the MLXBF_EDAC_MAX_DIMM_PER_MC check and offset_in_page()/~PAGE_MASK:

Reviewed-by: James Morse 


Thanks,

James


[0]
https://static.docs.arm.com/den0028/b/ARM_DEN0028B_SMC_Calling_Convention.pdf

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-19 Thread James Morse

Hi Hawa,

On 17/06/2019 14:00, Hawa, Hanna wrote:
>> I don't think it can, on a second reading, it looks to be even more 
>> complicated than I
>> thought! That bit is described as disabling forwarding of uncorrected data, 
>> but it looks
>> like the uncorrected data never actually reaches the other end. (I'm unsure 
>> what 'flush'
>> means in this context.)
>> I was looking for reasons you could 'know' that any reported error was 
>> corrected. This was
>> just a bad suggestion!

> Is there interrupt for un-correctable error?

The answer here is somewhere between 'not really' and 'maybe'.
There is a signal you may have wired-up as an interrupt, but its not usable 
from linux.

A.8.2 "Asychronous error signals" of the A57 TRM [0] has:
| nINTERRIRQ output Error indicator for an L2 RAM double-bit ECC error.
("7.6 Asynchronous errors" has more on this).

Errors cause L2ECTLR[30] to get set, and this value output as a signal, you may 
have wired
it up as an interrupt.

If you did, beware its level sensitive, and can only be cleared by writing to 
L2ECTLR_EL1.
You shouldn't allow linux to access this register as it could mess with the L2
configuration, which could also affect your EL3 and any secure-world software.

The arrival of this interrupt doesn't tell you which L2 tripped the error, and 
you can
only clear it if you write to L2ECTLR_EL1 on a CPU attached to the right L2. So 
this isn't
actually a shared (peripheral) interrupt.

This stuff is expected to be used by firmware, which can know the affinity 
constraints of
signals coming in as interrupts.

> Does 'asynchronous errors' in L2 used to report UE?

>From "7.2.4 Error correction code" single-bit errors are always corrected.
A.8.2 quoted above gives the behaviour for double-bit errors.

> In case no interrupt, can we use die-notifier subsystem to check if any error 
> had occur
> while system shutdown?

notify_die() would imply a synchronous exception that killed a thread. SError 
are a whole
lot worse. Before v8.2 these are all treated as 'uncontained': unknown memory 
corruption.
Which in your L2 case is exactly what happened. The arch code will panic().

If your driver can print something useful to help debug the panic(), then a 
panic_notifier
sounds appropriate. But you can't rely on these notifiers being called, as 
kdump has some
hooks that affect if/when they run.

(KVM will 'contain' SError that come from a guest to the guest, as we know a 
distinct set
of memory was in use. You may see fatal error counters increasing without the 
system
panic()ing)

contained/uncontained is part of the terminology from the v8.2 RAS spec [1].

Thanks,

James

[0]
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf
[1]
https://static.docs.arm.com/ddi0587/ca/ARM_DDI_0587C_a_RAS.pdf?_ga=2.148234679.1686960568.1560964184-897392434.1556719556

Re: [PATCH 12/21] EDAC, ghes: Add support for legacy API counters

2019-06-19 Thread James Morse

Hi Robert,

On 12/06/2019 19:41, Robert Richter wrote:
> On 29.05.19 16:13:02, James Morse wrote:
>> On 29/05/2019 09:44, Robert Richter wrote:
>>> The ghes driver is not able yet to count legacy API counters in sysfs,
>>> e.g.:
>>>
>>>  /sys/devices/system/edac/mc/mc0/csrow2/ce_count
>>>  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count
>>>  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count
>>>
>>> Make counting csrows/channels generic so that the ghes driver can use
>>> it too.
>>
>> What for?
> 
> With EDAC_LEGACY_SYSFS enabled those counters are exposed to sysfs,
> but the numbers are wrong (all zero).

Excellent, so its legacy and broken.

>> Is this for an arm64 system? Surely we don't have any systems that used to 
>> work with these
>> legacy counters. Aren't they legacy because we want new software to stop 
>> using them!
> 
> The option is to support legacy userland. If we want to provide a> similar 
> "user experience" as for x86 the counters should be correct.

The flip-side is arm64 doesn't have the same baggage. These counters have never 
worked
with this driver (even on x86).

This ghes driver also probes on HPE Server platform, so the architecture isn't 
really
relevant. (I was curious why Marvell care).

> Of course it is not a real mapping to csrows, but it makes that i/f
> work.

(...which stinks)

> In any case, this patch cleans up code as old API's counter code is
> isolated and moved to common code. Making the counter's work for ghes
> is actually a side-effect here. The cleanup is a prerequisit for
> follow on patches.

I'm all for removing/warning-its-broken it when ghes_edac is in use. But the 
convincing
argument is debian ships a 'current' version of edac-utils that predates 
199747106934,
(that made all this fake csrow stuff deprecated), and debian's popcon says 
~1000 people
have it installed.

If you want it fixed, please don't do it as a side effect of cleanup. Fixes 
need to be a
small separate series that can be backported. (unless we're confident no-one 
uses it, in
which case, why fix it?)

Thanks,

James

Re: [PATCH v1 2/5] KVM: arm/arm64: Adjust entry/exit and trap related tracepoints

2019-06-17 Thread James Morse

Hi Zenghui,

On 13/06/2019 12:28, Zenghui Yu wrote:
> On 2019/6/12 20:49, James Morse wrote:
>> On 12/06/2019 10:08, Zenghui Yu wrote:
>>> Currently, we use trace_kvm_exit() to report exception type (e.g.,
>>> "IRQ", "TRAP") and exception class (ESR_ELx's bit[31:26]) together.
>>> But hardware only saves the exit class to ESR_ELx on synchronous
>>> exceptions, not on asynchronous exceptions. When the guest exits
>>> due to external interrupts, we will get tracing output like:
>>>
>>> "kvm_exit: IRQ: HSR_EC: 0x (UNKNOWN), PC: 0x87259e30"
>>>
>>> Obviously, "HSR_EC" here is meaningless.

>> I assume we do it this way so there is only one guest-exit tracepoint that 
>> catches all
>> exits.
>> I don't think its a problem if user-space has to know the EC isn't set for 
>> asynchronous
>> exceptions, this is a property of the architecture and anything using these 
>> trace-points
>> is already arch specific.

> Actually, *no* problem in current implementation, and I'm OK to still
> keep the EC in trace_kvm_exit().  What I really want to do is adding the
> EC in trace_trap_enter (the new tracepoint), will explain it later.


>>> This patch splits "exit" and "trap" events by adding two tracepoints
>>> explicitly in handle_trap_exceptions(). Let trace_kvm_exit() report VM
>>> exit events, and trace_kvm_trap_exit() report VM trap events.
>>>
>>> These tracepoints are adjusted also in preparation for supporting
>>> 'perf kvm stat' on arm64.
>>
>> Because the existing tracepoints are ABI, I don't think we can change them.
>>
>> We can add new ones if there is something that a user reasonably needs to 
>> trace, and can't
>> be done any other way.
>>
>> What can't 'perf kvm stat' do with the existing trace points?

> First, how does 'perf kvm stat' interact with tracepoints?

Start at the beginning, good idea. (I've never used this thing!)


> We have three handlers for a specific event (e.g., "VM-EXIT") --
> "is_begin_event", "is_end_event", "decode_key". The first two handlers
> make use of two existing tracepoints ("kvm:kvm_exit" & "kvm:kvm_entry")
> to check when the VM-EXIT events started/ended, thus the time difference
> stats, event start/end time etc. can be calculated.

> "is_begin_event" handler gets a *key* from the "ret" field (exit_code)
> of "kvm:kvm_exit" payload, and "decode_key" handler makes use of the
> *key* to find out the reason for the VM-EXIT event. Of course we should
> maintain the mapping between exit_code and exit_reason in userspace.

Interpreting 'ret' is going to get tricky if we change those values on a whim. 
Its
internal to the KVM arch code.


> These are all what *patch #4* had done, #4 is a simple patch to review!

> Oh, we can also set "vcpu_id_str" to achieve per vcpu event record, but
> currently, we only have the "vcpu_pc" field in "kvm:kvm_entry", without
> something like "vcpu_id".

Heh, so from the trace-point data, you can't know which on is vcpu-0 and which 
is vcpu-1.


> OK, next comes the more important question - what should/can we do to
> the tracepoints in preparation of 'perf kvm stat' on arm64?
> 
> From the article you've provided, it's clear that we can't remove the EC
> from trace_kvm_exit(). But can we add something like "vcpu_id" into
> (at least) trace_kvm_entry(), just like what this patch has done?

Adding something is still likely to break a badly written user-space that is 
trying to
parse the trace information. A regex picking out the last argument will now get 
a
different value.


> If not, which means we have to keep the existing tracepoints totally
> unchanged, then 'perf kvm stat' will have no way to record/report per
> vcpu VM-EXIT events (other arch like X86, powerpc, s390 etc. have this
> capability, if I understand it correctly).

Well, you get the events, but you don't know which vCPU is which. You can map 
this back to
the pid of the host thread assuming user-space isn't moving vcpu between host 
threads.

If we're really stuck: Adding tracepoints to KVM-core's vcpu get/put, that 
export the
vcpu_id would let you map pid->vcpu_id, which you can then use for the batch of 
enter/exit
events that come before a final vcpu put.
grepping "vpu_id" shows perf has a mapping for which arch-specific argument in 
enter/exit
is the vcpu-id. Done with this core-code mapping, you could drop that code...

But I'd be a little nervous adding a new trace-point to work around an ABI 
problem, as we
may have jus

Re: [PATCH v3 2/2] EDAC: add EDAC driver for DMC520

2019-06-14 Thread James Morse

Hi Lei,

On 13/06/2019 19:31, Lei Wang (BSP) wrote:
> Please see inline(tagged with [Lei]) below. Thanks!

Please don't do this. Top posting is discouraged[0]. Creating a reply treasure 
hunt is
even worse!

You probably need to change your mail client as your mail is also arriving 
base64 encoded.
To the maintainer's scripts its going to look like this:
https://lore.kernel.org/lkml/byapr21mb131946e0b469e74d6054c33390...@byapr21mb1319.namprd21.prod.outlook.com/raw


> -Original Message-
> From: James Morse  
> On 16/05/2019 03:55, Lei Wang wrote:
>> New driver supports error detection and correction on the devices with ARM 
>> DMC-520 memory controller.
> 
>> diff --git a/MAINTAINERS b/MAINTAINERS index 7d1246b..23894ac 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -5573,6 +5573,12 @@ F:Documentation/driver-api/edac.rst
>>  F:  drivers/edac/
>>  F:  include/linux/edac.h
>>  
>> +EDAC-DMC520
>> +M:  Rui Zhao 
>> +L:  linux-e...@vger.kernel.org
>> +S:  Supported
>> +F:  drivers/edac/dmc520_edac.c
> 
> Hmm, you're listing someone else as maintainer of this driver.
> I think we'd need to see an Ack from Rui Zhao...
> 
> This patch was previously posted by Rui Zhao, this version has your changes 
> and you as author. (But how you arrange the attribution is up to the two of 
> you...)
> [Lei] And Rui and I sync-ed up on this code and this patch was after 
> addressing his feedbacks and his Ack.

And Rui commented that you should be listed as maintainer:
https://lore.kernel.org/lkml/cy4pr21mb0279bb0e40b86cea485cf19ab3...@cy4pr21mb0279.namprd21.prod.outlook.com/T/#u

Please don't add someone else!


>> diff --git a/drivers/edac/dmc520_edac.c b/drivers/edac/dmc520_edac.c 
>> new file mode 100644 index 000..c81bfcc
>> --- /dev/null
>> +++ b/drivers/edac/dmc520_edac.c

>> +static void dmc520_handle_dram_ecc_errors(struct mem_ctl_info *mci,
>> +  bool is_ce)
>> +{
>> +struct ecc_error_info info;
>> +struct dmc520_edac *edac;
>> +u32 cnt;
>> +char message[EDAC_MSG_BUF_SIZE];
>> +
>> +edac = mci->pvt_info;
>> +dmc520_get_dram_ecc_error_info(edac, is_ce, );
>> +
>> +cnt = dmc520_get_dram_ecc_error_count(edac, is_ce);
>> +
>> +if (cnt > 0) {
>> +snprintf(message, ARRAY_SIZE(message),
>> + "rank:%d bank:%d row:%d col:%d",
>> + info.rank, info.bank,
>> + info.row, info.col);
>> +
>> +edac_mc_handle_error((is_ce ? HW_EVENT_ERR_CORRECTED :
>> + HW_EVENT_ERR_UNCORRECTED),
>> + mci, cnt, 0, 0, 0, info.rank, -1, -1,
>> + message, "");
> 
> Because you have multiple interrupts, you can be calling 
> edac_mc_handle_error() in
> parallel on different CPUs, for the same mci.
> 
> edac_mc_handle_error() packs all these arguments into mci->error_desc, so two 
> CPUs will
> stomp over each other's values.
> 
> Please add a spinlock in 'struct dmc520_edac' to prevent this.
> 
> [Lei] This round of patch moved away from using mci->error_desc, and the 
> message is on stack now. 

It's not just the message buffer this is a problem for. You call 
edac_mc_handle_error(),
you can call it with the same mci on two different CPUs, at the same time.

> edac_mc_handle_error() packs all these arguments into mci->error_desc, so two 
> CPUs will
> stomp over each other's values.

>From drivers/edac/edac_mc.c::edac_mc_handle_error():
|   struct edac_raw_error_desc *e = >error_desc;
[...]
|   e->error_count = error_count;
|   e->top_layer = top_layer;
|   e->mid_layer = mid_layer;
|   e->low_layer = low_layer;
|   e->page_frame_number = page_frame_number;
|   e->offset_in_page = offset_in_page;
|   e->syndrome = syndrome;
|   e->msg = msg;
|   e->other_detail = other_detail;

If this happens one two CPUs at the same time, e->msg could point to the other 
CPU's stack.


Thanks,

James

[0] The best summary I found after a quick search is:
https://kernelnewbies.org/PatchCulture 's 'Email etiquette.'

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-14 Thread James Morse

Hi Hawa,

On 13/06/2019 18:05, James Morse wrote:
> On 11/06/2019 20:56, Hawa, Hanna wrote:
>> James Morse wrote:
>>> Hawa, Hanna wrote:
>>>> +    if (cluster != last_cluster) {
>>>> +    smp_call_function_single(cpu, al_a57_edac_l2merrsr,
>>>> + edac_dev, 0);
>>>> +    last_cluster = cluster;
>>>> +    }
>>> Here you depend on the CPUs being listed in cluster-order in the DT. I'm 
>>> fairly sure the
>>> numbering is arbitrary: On my Juno 0,3,4,5 are the A53 cluster, and 1,2 are 
>>> the A57
>>> cluster.
>>>
>>> If 1,3,5 were cluster-a and 2,4,6 were cluster-b, you would end up calling
>>> al_a57_edac_l2merrsr() for each cpu. As you don't wait, they could race.
>>>
>>> If you can get a cpu-mask for each cluster, smp_call_function_any() would 
>>> to the
>>> pick-one-online-cpu work for you.

>> Again, I rely on that it's alpine SoC specific driver.

An example of where this goes wrong is kexec:
If you offline CPU0, then kexec, the new kernel will start up on the lowest 
numbered
online CPU, which won't be zero. But the new kernel will call it CPU0.

Kdump is even better, as it starts up on whichever CPU called panic(), and 
calls it CPU0.


Thanks,

James


>> How can I get cpu-mask for each cluster? from DT?

> Its not cluster you want, its the L2. Cacheinfo has this for online CPUs, and 
> you're
> already holding the cpus_read_lock().

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread James Morse

Hi Hawa,

On 11/06/2019 20:56, Hawa, Hanna wrote:
> James Morse wrote:
>> Hawa, Hanna wrote:
>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>
>> How do we know this was corrected?
>>
>> 6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." 
>> in a
>> paragraph talking about the L1 memory system.

> I'll check fatal field to check if it's uncorrected/corrected.

>>> +    edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
>>> +    cpu, (fatal) ? "Fatal " : "");
>>> +    edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
>>> +
>>> +    switch (ramid) {
>>> +    case ARM_CA57_L1_I_TAG_RAM:
>>> +    pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
>>> +    break;
>>> +    case ARM_CA57_L1_I_DATA_RAM:
>>> +    pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
>>> +    break;
>>
>> Is index/way information really useful? I can't replace way-3 on the system, 
>> nor can I
>> stop it being used. If its useless, I'd rather we don't bother parsing and 
>> printing it out.

> I'll remove the index/way information, and keep CPUMERRSR_EL1 value print 
> (who want this
> information can parse the value and get the index/bank status)

Good idea, just print it raw.

>>> +    pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, 
>>> other,
>>> +    val);
>>
>> 'other' here is another error, but we don't know the ramid.
>> 'repeat' is another error for the same ramid.
>>
>> Could we still feed this stuff into edac? This would make the counters 
>> accurate if the
>> polling frequency isn't quite fast enough.

> There is no API in EDAC to increase the counters, I can increase by accessing 
> the
> ce_count/ue_count from edac_device_ctl_info/edac_device_instance structures 
> if it's okay..

Ah, sorry, I was thinking of the edac_mc_handle_error(), which has an 
error_count parameter.

(I wouldn't go poking in the structures directly).

>>> +static void al_a57_edac_l2merrsr(void *arg)
>>> +{
>>
>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");
>>
>> How do we know this is corrected?

>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to 
>> uncorrected/corrected. Is
>> this what you are depending on here?

> No - not on this. Reporting all the errors as corrected seems to be bad.
> 
> Can i be depends on fatal field?

That is described as "set to 1 on the first memory error that caused a Data 
Abort". I
assume this is one of the parity-error external-aborts.

If the repeat counter shows, say, 2, and fatal is set, you only know that at 
least one of
these errors caused an abort. But it could have been all three. The repeat 
counter only
matches against the RAMID and friends, otherwise the error is counted in 
'other'.

I don't think there is a right thing to do here, (other than increase the 
scrubbing
frequency). As you can only feed one error into edac at a time then:

> if (fatal)
> edac_device_handle_ue(edac_dev, 0, 0, "L2 Error");
> else
> edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

seems reasonable. You're reporting the most severe, and 'other/repeat' counter 
values just
go missing.

> How can L2CTLR_EL1[20] force fatal?

I don't think it can, on a second reading, it looks to be even more complicated 
than I
thought! That bit is described as disabling forwarding of uncorrected data, but 
it looks
like the uncorrected data never actually reaches the other end. (I'm unsure 
what 'flush'
means in this context.)
I was looking for reasons you could 'know' that any reported error was 
corrected. This was
just a bad suggestion!

>> (it would be good to have a list of integration-time and firmware 
>> dependencies this driver
>> has, for the next person who tries to enable it on their system and 
>> complains it doesn't
>> work for them)

> Where can i add such information?

The mailing list archive is good enough. I'll ask about any obvious dependency 
I spot from
the TRM, (e.g. that list at the top of my first reply). If you know of anything 
weird
please call it out.

>>> +    pr_cont(", cpuid/way=%d, repeat=%d, other=%d (L2MERRSR_EL1=0x%llx)\n",
>>> +    way, repeat, other, val);
>>
>> cpuid could be useful if you can map it back to the cpu number linux has.
>> If you can spot that cpu-7 is experiencing more errors than it should, you 
>> can leave it
>> offline.
>>
>> To do this you'd

Re: [PATCH 0/4] support reserving crashkernel above 4G on arm64 kdump

2019-06-13 Thread James Morse

Hi Chen Zhou,

On 13/06/2019 12:27, Chen Zhou wrote:
> On 2019/6/6 0:32, James Morse wrote:
>> On 07/05/2019 04:50, Chen Zhou wrote:
>>> We use crashkernel=X to reserve crashkernel below 4G, which will fail
>>> when there is no enough memory. Currently, crashkernel=Y@X can be used
>>> to reserve crashkernel above 4G, in this case, if swiotlb or DMA buffers
>>> are requierd, capture kernel will boot failure because of no low memory.
>>
>>> When crashkernel is reserved above 4G in memory, kernel should reserve
>>> some amount of low memory for swiotlb and some DMA buffers. So there may
>>> be two crash kernel regions, one is below 4G, the other is above 4G.
>>
>> This is a good argument for supporting the 'crashkernel=...,low' version.
>> What is the 'crashkernel=...,high' version for?
>>
>> Wouldn't it be simpler to relax the ARCH_LOW_ADDRESS_LIMIT if we see 
>> 'crashkernel=...,low'
>> in the kernel cmdline?
>>
>> I don't see what the 'crashkernel=...,high' variant is giving us, it just 
>> complicates the
>> flow of reserve_crashkernel().
>>
>> If we called reserve_crashkernel_low() at the beginning of 
>> reserve_crashkernel() we could
>> use crashk_low_res.end to change some limit variable from 
>> ARCH_LOW_ADDRESS_LIMIT to
>> memblock_end_of_DRAM().
>> I think this is a simpler change that gives you what you want.
> 
> According to your suggestions, we should do like this:
> 1. call reserve_crashkernel_low() at the beginning of reserve_crashkernel()
> 2. mark the low region as 'nomap'
> 3. use crashk_low_res.end to change some limit variable from 
> ARCH_LOW_ADDRESS_LIMIT to
> memblock_end_of_DRAM()
> 4. rename crashk_low_res as "Crash kernel (low)" for arm64

> 5. add an 'linux,low-memory-range' node in DT

(This bit would happen in kexec-tools)


> Do i understand correctly?

Yes, I think this is simpler and still gives you what you want.
It also leaves the existing behaviour unchanged, which helps with keeping 
compatibility
with existing user-space and older kdump kernels.


Thanks,

James

[tip:x86/urgent] x86/resctrl: Don't stop walking closids when a locksetup group is found

2019-06-12 Thread tip-bot for James Morse

Commit-ID:  87d3aa28f345bea77c396855fa5d5fec4c24461f
Gitweb: https://git.kernel.org/tip/87d3aa28f345bea77c396855fa5d5fec4c24461f
Author: James Morse 
AuthorDate: Mon, 3 Jun 2019 18:25:31 +0100
Committer:  Thomas Gleixner 
CommitDate: Wed, 12 Jun 2019 10:31:50 +0200

x86/resctrl: Don't stop walking closids when a locksetup group is found

When a new control group is created __init_one_rdt_domain() walks all
the other closids to calculate the sets of used and unused bits.

If it discovers a pseudo_locksetup group, it breaks out of the loop.  This
means any later closid doesn't get its used bits added to used_b.  These
bits will then get set in unused_b, and added to the new control group's
configuration, even if they were marked as exclusive for a later closid.

When encountering a pseudo_locksetup group, we should continue. This is
because "a resource group enters 'pseudo-locked' mode after the schemata is
written while the resource group is in 'pseudo-locksetup' mode." When we
find a pseudo_locksetup group, its configuration is expected to be
overwritten, we can skip it.

Fixes: dfe9674b04ff6 ("x86/intel_rdt: Enable entering of pseudo-locksetup mode")
Signed-off-by: James Morse 
Signed-off-by: Thomas Gleixner 
Acked-by: Reinette Chatre 
Cc: Fenghua Yu 
Cc: Borislav Petkov 
Cc: H Peter Avin 
Cc: 
Link: https://lkml.kernel.org/r/20190603172531.178830-1-james.mo...@arm.com

---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 333c177a2471..869cbef5da81 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2542,7 +2542,12 @@ static int __init_one_rdt_domain(struct rdt_domain *d, 
struct rdt_resource *r,
if (closid_allocated(i) && i != closid) {
mode = rdtgroup_mode_by_closid(i);
if (mode == RDT_MODE_PSEUDO_LOCKSETUP)
-   break;
+   /*
+* ctrl values for locksetup aren't relevant
+* until the schemata is written, and the mode
+* becomes RDT_MODE_PSEUDO_LOCKED.
+*/
+   continue;
/*
 * If CDP is active include peer domain's
 * usage to ensure there is no overlap

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread James Morse

Hi guys,

On 06/06/2019 12:37, Shenhar, Talel wrote:
>>> Disagree. The various drivers don't depend on each other.
>>> I think we should keep the drivers separated as they are distinct and 
>>> independent IP
>>> blocks.
>> But they don't exist in isolation, they both depend on the 
>> integration-choices/firmware
>> that makes up your platform.
>>
>> Other platforms may have exactly the same IP blocks, configured differently, 
>> or with
>> different features enabled in firmware. This means we can't just probe the 
>> driver based on
>> the presence of the IP block, we need to know the integration choices and 
>> firmware
>> settings match what the driver requires.
>>
>> (Case in point, that A57 ECC support is optional, another A57 may not have 
>> it)
>>
>> Descriptions of what firmware did don't really belong in the DT. Its not a 
>> hardware
>> property.
>>
>> This is why its better to probe this stuff based on the 
>> machine-compatible/platform-name,
>> not the presence of the IP block in the DT.
>>
>>
>> Will either of your separate drivers ever run alone? If they're probed from 
>> the same
>> machine-compatible this won't happen.
>>
>>
>> How does your memory controller report errors? Does it send back some data 
>> with an invalid
>> checksum, or a specific poison/invalid flag? Will the cache report this as a 
>> cache error
>> too, if its an extra signal, does the cache know what it is?
>>
>> All these are integration choices between the two IP blocks, done as 
>> separate drivers we
>> don't have anywhere to store that information. Even if you don't care about 
>> this, making
>> them separate drivers should only be done to make them usable on other 
>> platforms, where
>> these choices may have been different.

> From our perspective, l1/l2 has nothing to do with the ddr memory controller.

I understand you're coming from the position that these things have counters, 
you want
something to read and export them.

I'm coming at this from somewhere else. This stuff has to be considered all the 
way
through the system. Just because each component supports error detection, 
doesn't mean you
aren't going to get silent corruption. Likewise if another platform picks up 
two piecemeal
edac drivers for hardware it happens to have in common with yours, it doesn't 
mean we're
counting all the errors. This stuff has to be viewed for the whole platform.

> Its right that they both use same edac subsystem but they are using totally 
> different APIs
> of it.
> 
> We also even want to have separate control for enabling/disabling l1/l2 edac 
> vs memory
> controller edac.

Curious, what for? Surely you either care about counting errors, or you don't.

> Even from technical point-of-view L1/L2 UE collection method is totally 
> different from
> collecting memory-controller UE. (CPU exception vs actual interrupts).
> 
> So there is less reason why to combine them vs giving each one its own file, 
> e.g.
> al_mc_edac, al_l1_l2_edac (I even don't see why Hanna combined l1 and l2...)

> As we don't have any technical relation between the two we would rather avoid 
> this
> combination.
> 
> Also, Lets assume we have different setups with different memory controllers, 
> having a dt
> binding to control the difference is super easy and flexible.

If the hardware is different you should describe this in the DT. I'm not 
suggesting you
don't describe it.

The discussion here is whether we should probe the driver based on a dummy-node
compatible, (which this 'edac_l1_l2' is) or based on the machine compatible.

At the extreme end: you should paint the CPU and cache nodes with a compatible 
describing
your integration. (I've mangled Juno's DT here:)
| A57_0: cpu@0 {
|   compatible = "amazon-al,cortex-a57", "arm,cortex-a57";
|   reg = <0x0 0x0>;
|   device_type = "cpu";
|   next-level-cache = <_L2>;
| };
|
[...]
|
| A57_L2: l2-cache0 {
|   compatible = "amazon-al,cache", "cache";
|   cpu_map = 
| };

This is the most accurate way to describe what you have here. The driver can 
use this to
know that this integration of CPU and Cache support the edac registers. (This 
doesn't tell
us anything about whether firmware enabled this stuff, or made/left it all 
secure-only)

But this doesn't give you a device you can bind a driver to, to kick this stuff 
off.
This (I assume) is why you added a dummy 'edac_l1_l2' node, that just probes 
the driver.
The hardware is to do with the CPU and caches, 'edac_l1'_l2' doesn't correspond 
to any
distinct part of the soc.

The request is to use the machine compatible, not a dummy node. This wraps up 
the firmware
properties too, and any other platform property we don't know about today.

Once you have this, you don't really need the cpu/cache integration 
annotations, and your
future memory-controller support can be picked up as part of the platform 
driver.
If you have otherwise identical platforms with different memory controllers, OF 
gives you
the API to match

Re: [PATCH v3 1/3] arm64, vmcoreinfo : Append 'PTRS_PER_PGD' to vmcoreinfo

2019-06-07 Thread James Morse

Hi Bhupesh,

(sorry for the delay on this)

On 04/05/2019 13:53, Bhupesh Sharma wrote:
> On 04/03/2019 11:24 PM, Bhupesh Sharma wrote:
>> On 04/02/2019 10:56 PM, James Morse wrote:
>>> Yes the kernel code is going to move around, this is why the information we 
>>> expose via
>>> vmcoreinfo needs to be thought through: something we would always need, 
>>> regardless of how
>>> the kernel implements it.
>>>

>>> Pointer-auth changes all this again, as we may prefer to use the bits for 
>>> pointer-auth in
>>> one TTB or the other. PTRS_PER_PGD may show the 52bit value in this case, 
>>> but neither TTBR
>>> is mapping 52bits of VA.
>>>
>>>
>>>> So far, I have generally come across discussions where the following 
>>>> variations of the
>>>> address spaces have been proposed/requested:
>>>> - 48bit kernel VA + 48-bit User VA,
>>>> - 48-bit kernel VA + 52-bit User VA,
>>>
>>> + 52bit kernel, because there is excessive quantities of memory, and the 
>>> kernel maps it
>>> all, but 48-bit user, because it never maps all the memory, and we prefer 
>>> the bits for
>>> pointer-auth.
>>>
>>>> - 52-bit kernel VA + 52-bit User VA.
>>>
>>> And...  all four may happen with the same built image. I don't see how you 
>>> can tell these
>>> cases apart with the one (build-time-constant!) PTRS_PER_PGD value.
>>>
>>> I'm sure some of these cases are hypothetical, but by considering it all 
>>> now, we can avoid
>>> three more kernel:vmcoreinfo updates, and three more 
>>> fix-user-space-to-use-the-new-value.
>>
>> Agree.
>>
>>> I think you probably do need PTRS_PER_PGD, as this is the one value the mm 
>>> is using to
>>> generate page tables. I'm pretty sure you also need T0SZ and T1SZ to know 
>>> if that's
>>> actually in use, or the kernel is bodging round it with an offset.
>>
>> Sure, I am open to suggestions (as I realize that we need an additional 
>> VA_BITS_ACTUAL
>> variable export'ed for 52-bit kernel VA changes).

(stepping back a bit:)

I'm against exposing arch-specific #ifdefs that correspond to how we've 
configured the
arch code's interactions with mm. These are all moving targets, we can't have 
any of it
become ABI.

I have a straw-man for this: What is the value of PTE_FILE_MAX_BITS on your 
system?
I have no idea what this value is or means, an afternoon's archaeology would be 
needed(!).
This is something that made sense for one kernel version, a better idea came 
along, and it
was replaced. If we'd exposed this to user-space, we'd have to generate a 
value, even if
it doesn't mean anything. Exposing VA_BITS_ACTUAL is the same.

(Keep an eye out for when we change the kernel memory map, and any 
second-guessing based
on VA_BITS turns out to be wrong)

What we do have are the hardware properties. The kernel can't change these.

>> Also how do we standardize reading T0SZ and T1SZ in user-space. Do you 
>> propose I make an
>> enhancement in the cpu-feature-registers interface (see [1]) or the HWCAPS 
>> interface
>> (see [2]) for the same?

cpufeature won't help you if you've already panic()d and only have the vmcore 
file. This
stuff needs to go in vmcoreinfo.

As long as there is a description of how userspace uses these values, I think 
adding
key/values for TCR_EL1.TxSZ to the vmcoreinfo is a sensible way out of this. 
You probably
need TTBR1_EL1.BADDR too. (it should be specific fields, to prevent 'new uses' 
becoming ABI)

This tells you how the hardware was configured, and covers any combination of 
TxSZ tricks
we play, and whether those address bits are used for VA, or ptrauth for TTBR0 
or TTRB1.

> Any comments on the above points? At the moment we have to carry these fixes 
> in the
> distribution kernels and I would like to have these fixed in upstream kernel 
> itself.

Thanks,

James

Re: [PATCH v2] x86/resctrl: Don't stop walking closids when a locksetup group is found

2019-06-07 Thread James Morse

Hi,

On 06/06/2019 14:08, Sasha Levin wrote:
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: dfe9674b04ff x86/intel_rdt: Enable entering of 
> pseudo-locksetup mode.
> 
> The bot has tested the following trees: v5.1.7, v5.0.21, v4.19.48.

> v5.1.7: Failed to apply!  Possible dependencies:> 7390619ab9ea 
> ("x86/resctrl: Move per RDT domain initialization to a separate function")
> 
> v5.0.21: Failed to apply! Possible dependencies:
> 7390619ab9ea ("x86/resctrl: Move per RDT domain initialization to a 
> separate function")
That's cleanup, I don't think you want for stable. I'll do a backport.


> v4.19.48: Failed to apply! Possible dependencies:
> 2a7adf6ce643 ("x86/intel_rdt: Fix initial allocation to consider CDP")

This one changed an adjacent line.


> 723f1a0dd8e2 ("x86/resctrl: Fixup the user-visible strings")
> 7390619ab9ea ("x86/resctrl: Move per RDT domain initialization to a 
> separate function")

> How should we proceed with this patch?

I'll come up with backports for v5.1.x/v5.0.x and v4.19.x once this reaches 
mainline.


Thanks,

James

Re: [PATCH v3 1/8] arm64: Do not enable IRQs for ct_user_exit

2019-06-07 Thread James Morse

Hi Julien,

On 06/06/2019 10:31, Julien Thierry wrote:
> For el0_dbg and el0_error, DAIF bits get explicitly cleared before
> calling ct_user_exit.
> 
> When context tracking is disabled, DAIF gets set (almost) immediately
> after. When context tracking is enabled, among the first things done
> is disabling IRQs.
> 
> What is actually needed is:
> - PSR.D = 0 so the system can be debugged (should be already the case)
> - PSR.A = 0 so async error can be handled during context tracking
> 
> Do not clear PSR.I in those two locations.

(last time I looked at this I wrongly assumed ct_user_exit() should be run with 
interrupts
masked, but that isn't what you're saying).

Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH 5.2 v2 2/2] dt-binding: edac: add NPCM ECC documentation

2019-06-06 Thread James Morse

Hi George,

On 05/06/2019 15:12, George Hung wrote:
> Add device tree documentation for Nuvoton BMC ECC

(Nit: The DT folk prefer patches adding bindings to come first in the series, 
before the
driver that uses them).


> diff --git a/Documentation/devicetree/bindings/edac/npcm7xx-sdram-edac.txt 
> b/Documentation/devicetree/bindings/edac/npcm7xx-sdram-edac.txt
> new file mode 100644
> index ..dd4dac59a5bd
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/npcm7xx-sdram-edac.txt
> @@ -0,0 +1,17 @@
> +Nuvoton NPCM7xx SoC EDAC device driver
> +
> +The Nuvoton NPCM7xx SoC supports DDR4 memory with/without ECC and the driver
> +uses the EDAC framework to implement the ECC detection and corrtection.

The commit message in the driver says this is a Cadence memory controller, can 
we describe
what it is here, and give it an additional compatible?


Thanks,

James

> +Required properties:
> +- compatible:should be "nuvoton,npcm7xx-sdram-edac"
> +- reg:   Memory controller register set should be <0xf0824000 
> 0x1000>
> +- interrupts:should be MC interrupt #25
> +Example:
> +
> + mc: memory-controller@f0824000 {
> + compatible = "nuvoton,npcm7xx-sdram-edac";
> + reg = <0xf0824000 0x1000>;
> + interrupts = <0 25 4>;
> + };
>

Re: [PATCH 5.2 v2 1/2] edac: npcm: Add Nuvoton NPCM7xx EDAC driver

2019-06-06 Thread James Morse

Hi George,

On 05/06/2019 15:12, George Hung wrote:
> Add support for the Nuvoton NPCM7xx SoC EDAC driver

Could you say something about what the NPCM7xx SoC is, and what errors its 
memory
controller can detect?

The commit message is for describing what/why this code was added.

Is this Cadence DDR Manual available anywhere?


> NPCM7xx ECC datasheet from nuvoton.israel-Poleg:
> "Cadence DDR Controller User’s Manual For DDR3 & DDR4 Memories"
> 
> Tested: Forcing an ECC error event
> 
> Write a value to the xor_check_bits parameter that will trigger
> an ECC event once that word is read
> 
> For example, to force a single-bit correctable error on bit 0 of
> the user-word space shown, write 0x75 into that byte of the
> xor_check_bits parameter and then assert fwc (force write check)
> bit to 'b1' (mem base: 0xf0824000, xor_check_bits reg addr: 0x178)
> 
> $ devmem 0xf0824178 32 0x7501
> 
> To force a double-bit un-correctable error for the user-word space,
> write 0x03 into that byte of the xor_check_bits parameter
> 
> $ devmem 0xf0824178 32 0x301

This is great to know, but it doesn't belong in the commit message. Please move 
it to the
cover letter ... Please add a cover letter to any series with multiple patches!


> diff --git a/drivers/edac/npcm7xx_edac.c b/drivers/edac/npcm7xx_edac.c
> new file mode 100644
> index ..2d2deb81e49c
> --- /dev/null
> +++ b/drivers/edac/npcm7xx_edac.c
> @@ -0,0 +1,424 @@

> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "edac_module.h"
> +
> +#define ECC_ENABLE BIT(24)
> +#define ECC_EN_INT_MASK0x7f87
> +
> +#define INT_STATUS_ADDR116
> +#define INT_ACK_ADDR   117
> +#define INT_MASK_ADDR  118

> +#define ECC_EN_ADDR93
> +#define ECC_C_ADDR_ADDR98
> +#define ECC_C_DATA_ADDR100
> +#define ECC_C_ID_ADDR  101
> +#define ECC_C_SYND_ADDR99
> +#define ECC_U_ADDR_ADDR95
> +#define ECC_U_DATA_ADDR97
> +#define ECC_U_ID_ADDR  101
> +#define ECC_U_SYND_ADDR96

All the users of these multiply them by 4 first.
If these are numbers that are easy to check in the datasheet, then we should 
keep them
visible. But it would be good to wrap the '*4' up in a macro that describes 
what is going
on. Something like:

| #define SLOT(x)   (4*x)
| #define ECC_U_SYND_ADDR   SLOT(96)

This way some future developer can't accidentally use the bare ECC_U_SYND_ADDR 
as an
address/offset.


> +#define ECC_ERROR  -1
> +#define EDAC_MSG_SIZE  256
> +#define EDAC_MOD_NAME  "npcm7xx-edac"
> +
> +struct ecc_error_signature_info {
> + u32 ecc_addr;
> + u32 ecc_data;
> + u32 ecc_id;
> + u32 ecc_synd;
> +};
> +
> +struct npcm7xx_ecc_int_status {

> + u32 int_mask;
> + u32 int_status;
> + u32 int_ack;

Nothing uses these three.


> + u32 ce_cnt;
> + u32 ue_cnt;
> + struct ecc_error_signature_info ceinfo;
> + struct ecc_error_signature_info ueinfo;
> +};
> +
> +struct npcm7xx_edac_priv {
> + void __iomem *baseaddr;
> + char message[EDAC_MSG_SIZE];
> + struct npcm7xx_ecc_int_status stat;
> +};
> +
> +/**
> + * npcm7xx_edac_get_ecc_syndrom - Get the current ecc error info
> + * @base:Pointer to the base address of the ddr memory controller
> + * @p:   Pointer to the Nuvoton ecc status structure
> + *
> + * Determines there is any ecc error or not
> + *
> + * Return: ECC detection status
> + */
> +static int npcm7xx_edac_get_ecc_syndrom(void __iomem *base,
> + struct npcm7xx_ecc_int_status *p)
> +{
> + int status = 0;
> + u32 int_status = 0;

I've muddled these up already! Can status be called ret/rv/edac_error?
Or int_status renamed irq_status...


> + int_status = readl(base + 4*INT_STATUS_ADDR);

> + writel(int_status, base + 4*INT_ACK_ADDR);

Should you read the other addr/data/synd values before writing to this register?
Could the hardware clear those values before we get to read them? or overwrite 
them if a
new error is detected?


> + edac_dbg(3, "int_status: %#08x\n", int_status);
> +

> + if ((int_status & (1 << 6)) == (1 << 6)) {

| #define NPCM7XX_STATUS_MUCBIT(6)
| if (int_status & EDAC_NPCM7XX_STATUS_MUC) {

is much more readable. This lets you give the bits names that can match the 
datasheet.
If this applies to other potential users of the same Cadence IP block, please 
reflect that
in the name.


> + edac_dbg(3, "6-Mult uncorrectable detected.\n");
> + p->ue_cnt++;
> + status = ECC_ERROR;
> + }
> +
> + if ((int_status & (1 << 5)) == (1 << 5)) {
> + edac_dbg(3, "5-An uncorrectable detected\n");
> + p->ue_cnt++;
> +

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-06 Thread James Morse

Hi Hawa,

On 06/06/2019 08:53, Hawa, Hanna wrote:
> On 5/31/2019 8:14 AM, Borislav Petkov wrote:
>> On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
>>> This isn't terribly helpful, there's nothing telling anybody which of
>>> those files corresponds to an ARM SoC :-)
>>
>> drivers/edac/altera_edac.c is one example.
>>
>> Also, James and I have a small writeup on how an arm driver should look
>> like, we just need to polish it up and post it.
>>
>> James?
>>
>>> That said ...
>>>
>>> You really want a single EDAC driver that contains all the stuff for
>>> the caches, the memory controller, etc... ?
>>
>> Yap.
> 
> Disagree. The various drivers don't depend on each other.
> I think we should keep the drivers separated as they are distinct and 
> independent IP blocks.

But they don't exist in isolation, they both depend on the 
integration-choices/firmware
that makes up your platform.

Other platforms may have exactly the same IP blocks, configured differently, or 
with
different features enabled in firmware. This means we can't just probe the 
driver based on
the presence of the IP block, we need to know the integration choices and 
firmware
settings match what the driver requires.

(Case in point, that A57 ECC support is optional, another A57 may not have it)

Descriptions of what firmware did don't really belong in the DT. Its not a 
hardware property.

This is why its better to probe this stuff based on the 
machine-compatible/platform-name,
not the presence of the IP block in the DT.

Will either of your separate drivers ever run alone? If they're probed from the 
same
machine-compatible this won't happen.

How does your memory controller report errors? Does it send back some data with 
an invalid
checksum, or a specific poison/invalid flag? Will the cache report this as a 
cache error
too, if its an extra signal, does the cache know what it is?

All these are integration choices between the two IP blocks, done as separate 
drivers we
don't have anywhere to store that information. Even if you don't care about 
this, making
them separate drivers should only be done to make them usable on other 
platforms, where
these choices may have been different.

Thanks,

James

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-05 Thread James Morse

Hi Hana,

On 30/05/2019 11:15, Hanna Hawa wrote:
> Add support for error detection and correction for Amazon's Annapurna
> Labs SoCs for L1/L2 caches.
> 
> Amazon's Annapurna Labs SoCs based on ARM CA57 and CA72, the driver
> support both cortex based on compatible string.

> diff --git a/drivers/edac/amazon_al_ca57_edac.c 
> b/drivers/edac/amazon_al_ca57_edac.c
> new file mode 100644
> index 000..08237c0
> --- /dev/null
> +++ b/drivers/edac/amazon_al_ca57_edac.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0

> +/* Same bit assignments of CPUMERRSR_EL1 and L2MERRSR_EL1 in ARM CA57/CA72 */

Allowing linux to access these implementation-defined registers has come up 
before:
https://www.spinics.net/lists/kernel/msg2750349.html

It looks like you've navigated most of the issues. Accessing 
implementation-defined
registers is frowned on, but this stuff can't be done generically until v8.2.

This can't be done on 'all A57/A72' because some platforms may not have been 
integrated to
have error-checking at L1/L2, (1.3 'Features' of [0] says the ECC protection 
for L1 data
cache etc is optional). Even if they did, this stuff needs turning on in 
L2CTLR_EL1.
These implementation-defined registers may be trapped by the hypervisor, I 
assume for your
platform this is linux booted at EL2, so no hypervisor.

> +#define ARM_CA57_CPUMERRSR_INDEX_OFF (0)
> +#define ARM_CA57_CPUMERRSR_INDEX_MASK(0x3)

(GENMASK() would make it quicker and easier to compare this with the datasheet)

> +#define  ARM_CA57_L2_TAG_RAM 0x10
> +#define  ARM_CA57_L2_DATA_RAM0x11
> +#define  ARM_CA57_L2_SNOOP_RAM   0x12
> +#define  ARM_CA57_L2_DIRTY_RAM   0x14

> +#define  ARM_CA57_L2_INC_PLRU_RAM0x18

A57 describes this one as 'PF RAM'...

> +static inline u64 read_cpumerrsr_el1(void)
> +{
> + u64 val;
> +
> + asm volatile("mrs %0, s3_1_c15_c2_2" : "=r" (val));
> +
> + return val;
> +}

Linux supports versions of binutils that choke on this syntax.
See the sys_reg() definitions in arm64's asm/sysreg.h that define something you 
can feed
to read_sysreg_s(). It would save having these wrapper functions.

commit 72c583951526 ("arm64: gicv3: Allow GICv3 compilation with older 
binutils") for the
story.

> +static void al_a57_edac_cpumerrsr(void *arg)
> +{
> + struct edac_device_ctl_info *edac_dev =
> + (struct edac_device_ctl_info *)arg;
> + int cpu;
> + u32 index, way, ramid, repeat, other, fatal;
> + u64 val = read_cpumerrsr_el1();
> +
> + /* Return if no valid error */
> + if (!((val >> ARM_CA57_CPUMERRSR_VALID_OFF) &
> +   ARM_CA57_CPUMERRSR_VALID_MASK))
> + return;

| #define ARM_CA57_CPUMERRSR_VALID  BIT(31)
| if (!(val & ARM_CA57_CPUMERRSR_VALID))

would be easier to read, the same goes for 'fatal' as its a single bit.

> + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How do we know this was corrected?

6.4.8 "Error Correction Code" has "Double-bit ECC errors set the fatal bit." in 
a
paragraph talking about the L1 memory system.

"L2 Error" ? Copy and paste?

> + edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L1 %serror detected\n",
> + cpu, (fatal) ? "Fatal " : "");
> + edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
> +
> + switch (ramid) {
> + case ARM_CA57_L1_I_TAG_RAM:
> + pr_cont("'L1-I Tag RAM' index=%d way=%d", index, way);
> + break;
> + case ARM_CA57_L1_I_DATA_RAM:
> + pr_cont("'L1-I Data RAM' index=%d bank= %d", index, way);
> + break;

Is index/way information really useful? I can't replace way-3 on the system, 
nor can I
stop it being used. If its useless, I'd rather we don't bother parsing and 
printing it out.

> + pr_cont(", repeat=%d, other=%d (CPUMERRSR_EL1=0x%llx)\n", repeat, other,
> + val);

'other' here is another error, but we don't know the ramid.
'repeat' is another error for the same ramid.

Could we still feed this stuff into edac? This would make the counters accurate 
if the
polling frequency isn't quite fast enough.

> + write_cpumerrsr_el1(0);
> +}

> +static void al_a57_edac_l2merrsr(void *arg)
> +{

> + edac_device_handle_ce(edac_dev, 0, 0, "L2 Error");

How do we know this is corrected?

If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to 
uncorrected/corrected. Is
this what you are depending on here?

(it would be good to have a list of integration-time and firmware dependencies 
this driver
has, for the next person who tries to enable it on their system and complains 
it doesn't
work for them)

> + edac_printk(KERN_CRIT, DRV_NAME, "CPU%d L2 %serror detected\n",
> + cpu, (fatal) ? "Fatal " : "");
> + edac_printk(KERN_CRIT, DRV_NAME, "RAMID=");
> +
> + switch (ramid) {
> + case ARM_CA57_L2_TAG_RAM:
> + pr_cont("'L2 Tag RAM'");
> +

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-05 Thread James Morse

Hi guys,

On 31/05/2019 06:14, Borislav Petkov wrote:
> On Fri, May 31, 2019 at 01:15:33AM +, Herrenschmidt, Benjamin wrote:
>> This isn't terribly helpful, there's nothing telling anybody which of
>> those files corresponds to an ARM SoC :-)
> 
> drivers/edac/altera_edac.c is one example.
> 
> Also, James and I have a small writeup on how an arm driver should look
> like, we just need to polish it up and post it.
> 
> James?

Yes I should get on with that. Its mostly for platforms which end up with 
multiple
piecemeal drivers and some co-ordination is needed. It doesn't look like that 
will be a
problem here.

>> That said ...
>>
>> You really want a single EDAC driver that contains all the stuff for
>> the caches, the memory controller, etc... ?

This has to be platform specific as it has integration-time dependencies and 
firmware
dependencies. Doing it as a platform driver matched from the machine-compatible 
may be
more straightforward today.

The DT will already say "compatible = arm,cortex-a57" for the Alpine-v2, what 
that
'edac_l1_l2' node is telling us is the integration/firmware stuff has been 
done, and the
imp-def instructions can be used.

Thanks,

James

Re: [PATCH] EDAC/altera: Warm Reset option for Stratix10 peripheral DBE

2019-06-04 Thread James Morse

Hi Thor,

(CC: +Mark, Lorenzo and Sudeep for PSCI.
How should SYSTEM_RESET2 be used for a vendor-specific reset?

The original patch is:
lore.kernel.org/r/1559594269-10077-1-git-send-email-thor.tha...@linux.intel.com
)

On 03/06/2019 21:37, thor.tha...@linux.intel.com wrote:
> From: Thor Thayer 
> 
> The Stratix10 peripheral FIFO memories can recover from double
> bit errors with a warm reset instead of a cold reset.
> Add the option of a warm reset for peripheral (USB, Ethernet)
> memories.
> 
> CPU memories such as SDRAM and OCRAM require a cold reset for
> DBEs.
> Filter on whether the error is a SDRAM/OCRAM or a peripheral
> FIFO memory to determine which reset to use when the warm
> reset option is configured.

... so you want to make different SMC calls on each CPU after panic()?

> diff --git a/drivers/edac/altera_edac.c b/drivers/edac/altera_edac.c
> index 8816f74a22b4..179601f14b48 100644
> --- a/drivers/edac/altera_edac.c
> +++ b/drivers/edac/altera_edac.c
> @@ -2036,6 +2036,19 @@ static const struct irq_domain_ops a10_eccmgr_ic_ops = 
> {
>  /* panic routine issues reboot on non-zero panic_timeout */
>  extern int panic_timeout;
>  
> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
> +/* EL3 SMC call to setup CPUs for warm reset */
> +void panic_smp_self_stop(void)
> +{
> + struct arm_smccc_res result;
> +
> + __cpu_disable();
> + cpu_relax();
> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, S10_WARM_RESET_WFI_FLAG,
> +   S10_WARM_RESET_WFI_FLAG, 0, 0, 0, 0, 0, );
> +}
> +#endif

Oooer!

panic_smp_self_stop() isn't for drivers to override: only the arch code.
__cpu_disable() is only for the cpu-hotplug machinery. Nothing else should 
touch it.

Isn't this thing only called if another CPU out there is panic()ing too?

I think one of the problems here is arm64 leaves secondary CPUs running after 
panic().
This would be better fixed by using the appropriate cpu_ops[]->cpu_die() call 
in arm64's
ipi_cpu_stop().

As for passing platform-specific options, PSCI[0] has a 'reset_type' for 
SYSTEM_RESET2,
which looks suspiciously like what you want here. I'm not sure how its expected 
to be
used... hopefully the PSCI maintainers can give us some pointers.

(The existing support is commit 4302e381a870 ("firmware/psci: add support for 
SYSTEM_RESET2"))

Is it possible for firmware to do both the cold/warm reset work when 
SYSTEM_RESET is
called? This would mean you don't have to care here and there are fewer choices 
to be made
overall.
If not, is there anything left behind that can give it the hint? Like non-zero 
error
counters for the USB/Ethernet devices?

> @@ -2067,14 +2080,28 @@ static int s10_edac_dberr_handler(struct 
> notifier_block *this,
>   regmap_write(edac->ecc_mgr_map,
>S10_SYSMGR_UE_ADDR_OFST, err_addr);
>   edac_printk(KERN_ERR, EDAC_DEVICE,
> - "EDAC: [Fatal DBE on %s @ 0x%08X]\n",
> - ed->edac_dev_name, err_addr);
> + "EDAC: [Fatal DBE on %s [CPU=%d] @ 
> 0x%08X]\n",
> + ed->edac_dev_name, raw_smp_processor_id(),
> + err_addr);
>   break;
>   }
>   /* Notify the System through SMC. Reboot delay = 1 second */
> +#ifdef CONFIG_EDAC_ALTERA_ARM64_WARM_RESET
> + /* Handle peripheral FIFO DBE as Warm Resets */
> + if (dberror & S10_COLD_RESET_MASK) {

> + panic_timeout = 1;

Isn't this value supposed to be provided on the kernel commandline? Surely this 
prevents
debug using the commandline option to increase the delay?

(I see you already change it)

> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0,
> +   0, 0, 0, );
> + } else {
> + arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE,
> +   S10_WARM_RESET_WFI_FLAG | dberror, 0, 0,
> +   0, 0, 0, 0, );
> + }
> +#else
>   panic_timeout = 1;
>   arm_smccc_smc(INTEL_SIP_SMC_ECC_DBE, dberror, 0, 0, 0, 0,
> 0, 0, );
> +#endif
>   }
>  
>   return NOTIFY_DONE;

What do these SMC do? Are they equivalent to the PSCI CPU online/offline calls?

panic() notifiers aren't robust as they can be skipped if kdump is loaded.

Thanks,

James

[0]
https://static.docs.arm.com/den0022/d/Power_State_Coordination_Interface_PDD_v1_1_DEN0022D.pdf

Re: [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver

2019-06-04 Thread James Morse

Hi Robert,

On 03/06/2019 14:10, Robert Richter wrote:
> On 29.05.19 16:12:38, James Morse wrote:
>> On 29/05/2019 09:44, Robert Richter wrote:
>>> Almost duplicate code, remove it.
>>
>>> Note: there is a difference in the calculation of the grain_bits,
>>> using the edac_mc's version here.
>>
>> But is it the right thing to do?
>>
>> Is this an off-by-one bug being papered over as some cleanup?
>> If so could you post a separate fix that can be picked up for an rc.
>>
>> Do Marvell have firmware that populates this field?
>>
>> ...
>>
>> Unless the argument is no one cares about this...
>>
>> >From ghes_edac_report_mem_error():
>> |/* Error grain */
>> |if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
>> |e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);
>>
>> Fishy, why would the kernel page-size be relevant here?
> 
> That looked broken to me too, I did not put to much effort in fixing
> the grain yet. So I just took the edac_mc version first in the
> assumption, that one is working.

(Ah, it would have been good to note this in the commit-message)


> It looks like the intention here is to limit the grain to the page
> size.
I'm not convinced that makes sense. If some architecture let you configure the 
page-size,
(as arm64 does), and your hypervisor had a bigger page-size, then any hardware 
fault would
be rounded up to hypervisor's page-size.

The kernel's page-size has very little to do with the error, it only matters 
for when we
go unmapping stuff in memory_failure().


> But right, the calculation is wrong here. I am also going to
> reply to your patch you sent on this.

Thanks!


>> If physical_addr_mask were the same as PAGE_MASK this wouldn't this always 
>> give ~0?
>> (masking logic like this always does my head in)
>>
>> /me gives it ago:
>> | {1}[Hardware Error]:   physical_address: 0xdeadbeef
>> | {1}[Hardware Error]:   physical_address_mask: 0x
>> | {1}[Hardware Error]:   error_type: 6, master abort
>> | EDAC MC0: 1 CE Master abort on unknown label ( page:0xdead offset:0xbeef
>> | grain:-1 syndrome:0x0 - status(0x0001): reserved)
>>
>> That 'grain:-1' is because the calculated e->grain was an unlikely 
>> 0x.
>> Patch incoming, if you could test it on your platform that'd be great.
>>
>> I don't think ghes_edac.c wants this '+1'.
> 
> The +1 looks odd to me also for the edac_mc driver, but I need to take
> a closer look here as well as some logs suggest the grain is
> calculated correctly.

My theory on this is that ghes_edac.c is generating a grain like 0x1000, fls() 
does the
right thing. Other edac drivers are generating a grain like 0xfff to describe 
the same
size, fls() is now off-by-one, hence the addition.
I don't have a platform where I can trigger any other edac driver to test this 
though.

The way round this would be to put the grain_bits in struct edac_raw_error_desc 
so that
ghes_edac.c can calculate it directly.


Thanks,

James

Re: [PATCH v3] EDAC, mellanox: Add ECC support for BlueField DDR4

2019-06-04 Thread James Morse

Hi Junhan,

On 30/05/2019 19:48, Junhan Zhou wrote:
>> -Original Message-
>> From: James Morse 
>> Sent: Thursday, May 23, 2019 1:30 PM
>> To: Junhan Zhou 

>>> +union mlxbf_emi_dram_additional_info_0 {
>>> +   struct {
>>> +   u32 err_bank : 4;
>>> +   u32 err_lrank: 2;
>>> +   u32 __reserved_0 : 2;
>>> +   u32 err_prank: 2;
>>> +   u32 __reserved_1 : 6;
>>> +   u32 err_edge : 8;
>>> +   u32 __reserved_2 : 8;
>>> +   };
>>> +
>>> +   u32 word;
>>> +};
>>
>> ... you're expecting the compiler to pack this bitfield in exactly the same 
>> way
>> your hardware did. I don't think that's guaranteed.
>> It evidently works for your current compiler, but another compiler may pack
>> this structure differently. Toggling endianness will break this, (arm64
>> supports both). If your platform supports aarch32, someone may want to get
>> 32bit arm running, which may have different compiler behaviour.
>>
>> You are also using bitfields between hardware and firmware, so its currently
>> possible the firmware requires the kernel to be built with a compiler that
>> means it can't interact with the hardware...
>>
>> When this has come up in the past, the advice was not to use bitfields:
>> https://lore.kernel.org/lkml/1077080607.1078.109.camel@gaston/
>>
>> Please use shifts and masks.
>>
> Okay if you insist. I do see this kind of style used elsewhere, e.g. in 
> pnd2_edac.h.

Yup, and that is fragile. It can only work if the toolchain for firmware, the 
kernel and
hardware have only one configuration between them. I don't think we have enough 
control of
the platform to assert this.


> And this driver is only used on the BlueField SoC which our bootloader would
> configure it to be only running at Aarch64 little endian.

(endian-ness is something the kernel configures during boot. I can happily 
kexec a
big-endian kernel on my Juno, even though the bootloader chokes and splutters 
when its
asked to do this.)

This stuff is 'implementation defined' for the compiler, we can't complain to 
the compiler
people if it doesn't do what we wanted.


>>> +#define MLXBF_EDAC_DIMM_PER_MC 2
>>> +#define MLXBF_EDAC_ERROR_GRAIN 8
>>
>> If these numbers changed, would it still be a BlueField SoC?
>> (if next years made-up:BlueField2 supports more Dimms per MC, it would be

> BTW, how did you ever guess that the next gen chip would be called BlueField2?
> Did you search for "BlueField" in the registered PCI IDs and accidentally
> found it? (https://fossies.org/linux/systemd/hwdb/pci.ids)

Ha! The 'made-up:' was to indicate I really know nothing about this. I should 
have gone
for Bluefield++


>>> +/*
>>> + * Request MLNX_SIP_GET_DIMM_INFO
>>> + *
>>> + * Retrieve information about DIMM on a certain slot.
>>> + *
>>> + * Call register usage:
>>> + * a0: MLNX_SIP_GET_DIMM_INFO
>>> + * a1: (Memory controller index) << 16 | (Dimm index in memory
>>> +controller)
>>> + * a2-7: not used.
>>> + *
>>> + * Return status:
>>> + * a0: dimm_info_smc union defined below describing the DIMM.
>>> + * a1-3: not used.
>>> + */
>>
>> Have Mellanox published these call numbers/arguments in a document
>> somewhere? If they differ with the firmware, it would be good to know
>> which side needs fixing.
>>
>> It is a little odd that you read the number of memory controllers from the
>> ACPI table, but use an SMC call to read the DIMM information.
>> Is it too-late to describe the DIMMs in the ACPI table too? (this would let
>> firmware hard-code it on platforms where it could never change, instead of
>> having to support a runtime call)
>>
>> The DIMM information should also be in the SMBIOS table. See ghes_edac.c
>> for some code that uses this. SMBIOS isn't popular in the arm world: but
>> edac already uses this, and we want to match DIMM numbers with the text
>> on the board's silk-screen so the user can replace the correct DIMM.
>>
> 
> No we haven't. But, we publish the source code to our firmware so anybody can 
> edit
> and boot their own version.

Cool!
(With a different compiler that does the bitfields differently? *cough* *cough*)

[..]

> So the issue here is that ATF does the memory training, but UEFI is in charge
> of publishing the ACPI tables (which is where the SMBIOS table would be 
> located
> correct?).

Yes, edk2-platforms has a 
"Platform/ARM/JunoPkg/SmbiosPlatfo

[PATCH v2] x86/resctrl: Don't stop walking closids when a locksetup group is found

2019-06-03 Thread James Morse

When a new control group is created __init_one_rdt_domain() walks all
the other closids to calculate the sets of used and unused bits.

If it discovers a pseudo_locksetup group, it breaks out of the loop.
This means any later closid doesn't get its used bits added to used_b.
These bits will then get set in unused_b, and added to the new
control group's configuration, even if they were marked as exclusive
for a later closid.

When encountering a pseudo_locksetup group, we should continue. This
is because "a resource group enters 'pseudo-locked' mode after the
schemata is written while the resource group is in 'pseudo-locksetup'
mode." When we find a pseudo_locksetup group, its configuration is
expected to be overwritten, we can skip it.

Fixes: dfe9674b04ff6 ("x86/intel_rdt: Enable entering of pseudo-locksetup mode")
Cc: 
Acked-by: Reinette Chatre 
Signed-off-by: James Morse 
---
Changes since v1:
 * Removed braces round multi-line comment and whitespace after the if() block

 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 333c177a2471..869cbef5da81 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2542,7 +2542,12 @@ static int __init_one_rdt_domain(struct rdt_domain *d, 
struct rdt_resource *r,
if (closid_allocated(i) && i != closid) {
mode = rdtgroup_mode_by_closid(i);
if (mode == RDT_MODE_PSEUDO_LOCKSETUP)
-   break;
+   /*
+* ctrl values for locksetup aren't relevant
+* until the schemata is written, and the mode
+* becomes RDT_MODE_PSEUDO_LOCKED.
+*/
+   continue;
/*
 * If CDP is active include peer domain's
 * usage to ensure there is no overlap
-- 
2.20.1

[PATCH] drivers: base: cacheinfo: Ensure cpu hotplug work is done before Intel RDT

2019-05-30 Thread James Morse

The cacheinfo structures are alloced/freed by cpu online/offline
callbacks. Originally these were only used by sysfs to expose the
cache topology to user space. Without any in-kernel dependencies
CPUHP_AP_ONLINE_DYN was an appropriate choice.

resctrl has started using these structures to identify CPUs that
share a cache. It updates its 'domain' structures from cpu
online/offline callbacks. These depend on the cacheinfo structures
(resctrl_online_cpu()->domain_add_cpu()->get_cache_id()->
 get_cpu_cacheinfo()).
These also run as CPUHP_AP_ONLINE_DYN.

Now that there is an in-kernel dependency, move the cacheinfo
work earlier so we know its done before resctrl's CPUHP_AP_ONLINE_DYN
work runs.

Cc: Fenghua Yu 
Cc: Reinette Chatre 
Signed-off-by: James Morse 
---
I haven't seen any problems because of this. If someone thinks it should
go to stable:
Cc:  #4.10.x

The particular patch that added RDT is:
Fixes: 2264d9c74dda1 ("x86/intel_rdt: Build structures for each resource based 
on cache topology")

But as this touches a different set of files, I'm not sure how appropriate
a fixes tag is.

 drivers/base/cacheinfo.c   | 3 ++-
 include/linux/cpuhotplug.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index a7359535caf5..b444f89a2041 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -655,7 +655,8 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
 
 static int __init cacheinfo_sysfs_init(void)
 {
-   return cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, "base/cacheinfo:online",
+   return cpuhp_setup_state(CPUHP_AP_BASE_CACHEINFO_ONLINE,
+"base/cacheinfo:online",
 cacheinfo_cpu_online, cacheinfo_cpu_pre_down);
 }
 device_initcall(cacheinfo_sysfs_init);
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 6a381594608c..50c893f03c21 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -175,6 +175,7 @@ enum cpuhp_state {
CPUHP_AP_WATCHDOG_ONLINE,
CPUHP_AP_WORKQUEUE_ONLINE,
CPUHP_AP_RCUTREE_ONLINE,
+   CPUHP_AP_BASE_CACHEINFO_ONLINE,
CPUHP_AP_ONLINE_DYN,
CPUHP_AP_ONLINE_DYN_END = CPUHP_AP_ONLINE_DYN + 30,
CPUHP_AP_X86_HPET_ONLINE,
-- 
2.20.1

[PATCH] x86/resctrl: Don't stop walking closids when a locksetup group is found

2019-05-30 Thread James Morse

When a new control group is created __init_one_rdt_domain() walks all
the other closids to calculate the sets of used and unused bits.

If it discovers a pseudo_locksetup group, it breaks out of the loop.
This means any later closid doesn't get its used bits added to used_b.
These bits will then get set in unused_b, and added to the new
control group's configuration, even if they were marked as exclusive
for a later closid.

When encountering a pseudo_locksetup group, we should continue. This
is because "a resource group enters 'pseudo-locked' mode after the
schemata is written while the resource group is in 'pseudo-locksetup'
mode." When we find a pseudo_locksetup group, its configuration is
expected to be overwritten, we can skip it.

Fixes: dfe9674b04ff6 ("x86/intel_rdt: Enable entering of pseudo-locksetup mode")
Signed-off-by: James Morse 
---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index 333c177a2471..049ccb709957 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -2541,8 +2541,15 @@ static int __init_one_rdt_domain(struct rdt_domain *d, 
struct rdt_resource *r,
for (i = 0; i < closids_supported(); i++, ctrl++) {
if (closid_allocated(i) && i != closid) {
mode = rdtgroup_mode_by_closid(i);
-   if (mode == RDT_MODE_PSEUDO_LOCKSETUP)
-   break;
+   if (mode == RDT_MODE_PSEUDO_LOCKSETUP) {
+   /*
+* ctrl values for locksetup aren't relevant
+* until the schemata is written, and the mode
+* becomes RDT_MODE_PSEUDO_LOCKED.
+*/
+   continue;
+   }
+
/*
 * If CDP is active include peer domain's
 * usage to ensure there is no overlap
-- 
2.20.1

Re: [PATCH 14/21] EDAC, ghes: Extract numa node information for each dimm

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> In a later patch we want to have one mc device per node. This patch
> extracts the numa node information for each dimm. This is done by
> collecting the physical address ranges from the DMI table (Memory
> Array Mapped Address - Type 19 of SMBIOS spec). The node information> for a 
> physical address is already know to a numa aware system (e.g. by
> using the ACPI _PXM method or the ACPI SRAT table), so based on the PA
> we can assign the node id to the dimms.

I think you're letting the smbios information drive you here. We'd like to do 
as much as
possible without it, all its really good for is telling us the label on the PCB.

With this approach, you only get numa information by parsing more smbios, which 
we have to
try and validate, and fall back to some error path if it smells wrong. We end 
up needing
things like a 'fallback memory controller' in the case the firmware fault-time 
value is
missing, or nuts.

What bugs me is we already know the numa information from the address. We could 
expose
that without the smbios tables at all, and it would be useful to someone 
playing the
dimm-bisect game. Not making it depend on smbios means there is a good chance 
it can
become common with other edac drivers.

I don't think we need to know the dimm->node mapping up front. When we get an 
error,
pfn_to_nid() on the address tells us which node that memory is attached to. 
This should be
the only place nid information comes from, that way we don't need to check it. 
Trying to
correlate it with smbios tables is much more code. If the CPER comes with a 
DIMM handle,
we know the DIMM too.

So all we really need is to know at setup time is how many numa-nodes there 
are, and the
maximum DIMM per node if we don't want phantom-dimms. Type-17 already has a
Physical-Memory-Array-Handle, which points at Type-19... but we don't need to 
read it,
just count them and find the biggest.

If the type-19 information is missing from smbios, or not linked up properly, 
or the
values provided at fault-time don't overlap with the values in the table, or 
there is no
fault-time node information: you still get the numa-node information based on 
the address.

Using the minimum information should give us the least code, and the least 
exposure to
misdescribed tables.

> A fallback that disables numa is implemented in case the node
> information is inconsistent.

> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 50f4ee36b755..083452a48b42 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -67,14 +67,34 @@ struct memdev_dmi_entry {
>   u16 conf_mem_clk_speed;
>  } __attribute__((__packed__));
>  
> +/* Memory Array Mapped Address - Type 19 of SMBIOS spec */
> +struct memarr_dmi_entry {
> + u8  type;
> + u8  length;
> + u16 handle;
> + u32 start;
> + u32 end;
> + u16 phys_mem_array_handle;
> + u8  partition_width;
> + u64 ext_start;
> + u64 ext_end;
> +} __attribute__((__packed__));

Any chance we could collect the structures from the smbios spec in a header 
file somewhere?

I'd prefer not to read this thing at all if we can help it.

>  struct ghes_dimm_info {
>   struct dimm_info dimm_info;
>   int idx;
> + int numa_node;

(I thought nid was the preferred term)

> + phys_addr_t start;
> + phys_addr_t end;

I think start and end are deceptive as they overlap with other DIMMs on systems 
with
interleaving memory controllers. I'd prefer not to keep these values around.

> + u16 phys_handle;
>  };
>  
>  struct ghes_mem_info {
> - int num_dimm;
> + int num_dimm;
>   struct ghes_dimm_info *dimms;
> + int num_nodes;

> + int num_per_node[MAX_NUMNODES];

Number of what?

> + boolenable_numa;

This is used locally in mem_info_setup(), but its not read from here by any of 
the later
patches in the series. Is it needed?

I don't like the idea that this is behaviour that is turned on/off. Its a 
property of the
system. I think it would be better if CONFIG_NUMA causes you to get multiple
memory-controllers created, but if its not actually a NUMA machine there would 
only be
one. This lets us test that code on not-really-numa systems.

>  };
>  
>  struct ghes_mem_info mem_info;
> @@ -97,10 +117,50 @@ static void ghes_dimm_info_init(void)
>  
>   for_each_dimm(dimm) {
>   dimm->idx   = idx;
> + dimm->numa_node = NUMA_NO_NODE;
>   idx++;
>   }
>  }
>  
> +static void ghes_edac_set_nid(const struct dmi_header *dh, void *arg)
> +{
> + struct memarr_dmi_entry *entry = (struct memarr_dmi_entry *)dh;
> + struct ghes_dimm_info *dimm;
> + phys_addr_t start, end;
> + int nid;
> +
> + if

Re: [PATCH 09/21] EDAC, ghes: Use standard kernel macros for page calculations

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> Use standard macros for page calculations.

Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH 10/21] EDAC, ghes: Remove pvt->detail_location string

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> The detail_location[] string in struct ghes_edac_pvt is complete
> useless and data is just copied around. Put everything into
> e->other_detail from the beginning.

We still print all that complete-useless detail_location stuff... so this 
commit message
had me confused about what you're doing here. I think you meant the space for 
the string,
instead of the value!

| detail_location[] is used to collect two location strings so they can be 
passed as one
| to trace_mc_event(). Instead of having an extra copy step, assemble the 
location string
| in other_detail[] from the beginning.


> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 39702bac5eaf..c18f16bc9e4d 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -23,8 +23,7 @@ struct ghes_edac_pvt {
>   struct mem_ctl_info *mci;
>  
>   /* Buffers for the error handling routine */
> - char detail_location[240];
> - char other_detail[160];
> + char other_detail[400];
>   char msg[80];
>  };
>  
> @@ -225,13 +224,14 @@ void ghes_edac_report_mem_error(int sev, struct 
> cper_sec_mem_err *mem_err)
>   memset(e, 0, sizeof (*e));
>   e->error_count = 1;
>   strcpy(e->label, "unknown label");
> - e->msg = pvt->msg;
> - e->other_detail = pvt->other_detail;
>   e->top_layer = -1;
>   e->mid_layer = -1;
>   e->low_layer = -1;
> - *pvt->other_detail = '\0';
> + e->msg = pvt->msg;
> + e->other_detail = pvt->other_detail;
> +
>   *pvt->msg = '\0';
> + *pvt->other_detail = '\0';

... so no change? Could you drop this hunk?

Regardless,
Reviewed-by: James Morse 


Thanks,

James

Re: [PATCH 12/21] EDAC, ghes: Add support for legacy API counters

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> The ghes driver is not able yet to count legacy API counters in sysfs,
> e.g.:
> 
>  /sys/devices/system/edac/mc/mc0/csrow2/ce_count
>  /sys/devices/system/edac/mc/mc0/csrow2/ch0_ce_count
>  /sys/devices/system/edac/mc/mc0/csrow2/ch1_ce_count
> 
> Make counting csrows/channels generic so that the ghes driver can use
> it too.

What for?

Is this for an arm64 system? Surely we don't have any systems that used to work 
with these
legacy counters. Aren't they legacy because we want new software to stop using 
them!

Thanks,

James

Re: [PATCH 11/21] EDAC, ghes: Unify trace_mc_event() code with edac_mc driver

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> Almost duplicate code, remove it.

almost?

> Note: there is a difference in the calculation of the grain_bits,
> using the edac_mc's version here.

But is it the right thing to do?

Is this an off-by-one bug being papered over as some cleanup?
If so could you post a separate fix that can be picked up for an rc.

Do Marvell have firmware that populates this field?

...

Unless the argument is no one cares about this...

>From ghes_edac_report_mem_error():
|   /* Error grain */
|   if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
|   e->grain = ~(mem_err->physical_addr_mask & ~PAGE_MASK);

Fishy, why would the kernel page-size be relevant here?

If physical_addr_mask were the same as PAGE_MASK this wouldn't this always give 
~0?
(masking logic like this always does my head in)

/me gives it ago:
| {1}[Hardware Error]:   physical_address: 0xdeadbeef
| {1}[Hardware Error]:   physical_address_mask: 0x
| {1}[Hardware Error]:   error_type: 6, master abort
| EDAC MC0: 1 CE Master abort on unknown label ( page:0xdead offset:0xbeef
| grain:-1 syndrome:0x0 - status(0x0001): reserved)

That 'grain:-1' is because the calculated e->grain was an unlikely 
0x.
Patch incoming, if you could test it on your platform that'd be great.

I don't think ghes_edac.c wants this '+1'.

Thanks,

James

Re: [PATCH 13/21] EDAC, ghes: Rework memory hierarchy detection

2019-05-29 Thread James Morse

Hi Robert,

On 29/05/2019 09:44, Robert Richter wrote:
> In a later patch we want add more information about the memory
> hierarchy (NUMA topology, DIMM label information). Rework memory
> hierarchy detection to make the code extendable for this.
> 
> The general approach is roughly like:
> 
>   mem_info_setup();
>   for_each_node(nid) {
>   mci = edac_mc_alloc(nid);
>   mci_add_dimm_info(mci);
>   edac_mc_add_mc(mci);
>   };
> 
> This patch introduces mem_info_setup() and mci_add_dimm_info().
> 
> All data of the memory hierarchy is collected in a local struct
> ghes_mem_info.
> 
> Note: Per (NUMA) node registration will be implemented in a later
> patch.


> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index ea4d53043199..50f4ee36b755 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -67,17 +67,38 @@ struct memdev_dmi_entry {
>   u16 conf_mem_clk_speed;
>  } __attribute__((__packed__));
>  
> -struct ghes_edac_dimm_fill {
> - struct mem_ctl_info *mci;
> - unsigned count;

> +struct ghes_dimm_info {
> + struct dimm_info dimm_info;
> + int idx;
> +};

> +struct ghes_mem_info {
> + int num_dimm;
> + struct ghes_dimm_info *dimms;
>  };
>  
> +struct ghes_mem_info mem_info;

static?


> @@ -94,18 +115,17 @@ static int get_dimm_smbios_index(u16 handle)
>  
>  static void ghes_edac_dmidecode(const struct dmi_header *dh, void *arg)
>  {
> - struct ghes_edac_dimm_fill *dimm_fill = arg;
> - struct mem_ctl_info *mci = dimm_fill->mci;
> -
>   if (dh->type == DMI_ENTRY_MEM_DEVICE) {
> + int *idx = arg;
>   struct memdev_dmi_entry *entry = (struct memdev_dmi_entry *)dh;
> - struct dimm_info *dimm = edac_get_dimm(mci, dimm_fill->count,
> -0, 0);
> + struct ghes_dimm_info *mi = _info.dimms[*idx];
> + struct dimm_info *dimm = >dimm_info;
>   u16 rdr_mask = BIT(7) | BIT(13);


> + mi->phys_handle = entry->phys_mem_array_handle;

Where did this come from, and what is it for?

...

Should this be in a later patch? (did you bisect build this?)


>   if (entry->size == 0x) {
> - pr_info("Can't get DIMM%i size\n",
> - dimm_fill->count);
> + pr_info("Can't get DIMM%i size\n", mi->idx);
>   dimm->nr_pages = MiB_TO_PAGES(32);/* Unknown */
>   } else if (entry->size == 0x7fff) {
>   dimm->nr_pages = MiB_TO_PAGES(entry->extended_size);


> +static int mem_info_setup(void)
> +{
> + int idx = 0;
> +
> + memset(_info, 0, sizeof(mem_info));

Is this necessary? Isn't mem_info in the BSS, it will zero'd already.


> + /* Get the number of DIMMs */
> + dmi_walk(ghes_edac_count_dimms, NULL);
> + if (!mem_info.num_dimm)
> + return -EINVAL;

> + mem_info.dimms = kcalloc(mem_info.num_dimm,
> + sizeof(*mem_info.dimms), GFP_KERNEL);
> + if (!mem_info.dimms)
> + return -ENOMEM;

> + ghes_dimm_info_init();

Could you move the kcalloc() into ghes_dimm_info_init()? This would save you 
having a
unnecessarily-different version in mem_info_setup_fake().


> + dmi_walk(ghes_edac_dmidecode, );
> +
> + return 0;
> +}

> +static int mem_info_setup_fake(void)
> +{
> + struct ghes_dimm_info *ghes_dimm;
> + struct dimm_info *dimm;
> +
> + memset(_info, 0, sizeof(mem_info));

Is this necessary? Its only been touched by mem_info_setup(), and you get in 
here because
mem_info.num_dimm == 0...


> + ghes_dimm = kzalloc(sizeof(*mem_info.dimms), GFP_KERNEL);
> + if (!ghes_dimm)
> + return -ENOMEM;

This is common with mem_info_setup(). If ghes_dimm_info_init() read 
mem_info.num_dimm and
did the rest, you'd avoid some duplication here.


> + mem_info.num_dimm = 1;
> + mem_info.dimms = ghes_dimm;
> +
> + ghes_dimm_info_init();
> +
> + dimm = _dimm->dimm_info;
> + dimm->nr_pages = 1;
> + dimm->grain = 128;
> + dimm->mtype = MEM_UNKNOWN;
> + dimm->dtype = DEV_UNKNOWN;
> + dimm->edac_mode = EDAC_SECDED;
> +
> + return 0;
> +}


> +static void mci_add_dimm_info(struct mem_ctl_info *mci)

(From the name I expected this to be in edac_mc.c)


> +{
> + struct dimm_info *mci_dimm, *dmi_dimm;
> + struct ghes_dimm_info *dimm;
> + int index = 0;
> +
> + for_each_dimm(dimm) {
> + dmi_dimm = >dimm_info;
> + mci_dimm = edac_get_dimm_by_index(mci, index);
> +
> + index++;
> + if (index > mci->tot_dimms)
> + break;
> +
> + mci_dimm->nr_pages  = dmi_dimm->nr_pages;
> + mci_dimm->mtype = dmi_dimm->mtype;
> + mci_dimm->edac_mode = dmi_dimm->edac_mode;
> +

Re: [PATCH v3] EDAC, mellanox: Add ECC support for BlueField DDR4

2019-05-23 Thread James Morse

Hi Junhan,

On 21/03/2019 14:31, Junhan Zhou wrote:
> Add ECC support for Mellanox BlueField SoC DDR controller.
> This requires SMC to the running Arm Trusted Firmware to report
> what is the current memory configuration.

Sorry for the delay on this, it slipped through the cracks. (Please don't reply 
with new
patches to the discussion of an old patch/series, this makes it look like 
ongoing
discussion on a v1, and v2 never arrives!)


> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 47eb4d1..404d853 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -504,4 +504,11 @@ config EDAC_ASPEED
> First, ECC must be configured in the bootloader. Then, this driver
> will expose error counters via the EDAC kernel framework.
>  
> +config EDAC_BLUEFIELD
> + tristate "Mellanox BlueField Memory ECC"
> + depends on ARM64 && ((MELLANOX_PLATFORM && ACPI) || COMPILE_TEST)

What is the MELLANOX_PLATFORM needed for? Is it just to turn off a set of 
drivers in one
go? I can't see what other infrastructure you depend on.


> + help
> +   Support for error detection and correction on the
> +   Mellanox BlueField SoCs.
> +
>  endif # EDAC


> diff --git a/drivers/edac/bluefield_edac.c b/drivers/edac/bluefield_edac.c
> new file mode 100644
> index 000..88f51f7
> --- /dev/null
> +++ b/drivers/edac/bluefield_edac.c
> @@ -0,0 +1,396 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Bluefield-specific EDAC driver.
> + *
> + * Copyright (c) 2019 Mellanox Technologies.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "edac_module.h"
> +
> +#define DRIVER_NAME  "bluefield-edac"
> +
> +/*
> + * Mellanox BlueField EMI (External Memory Interface) register definitions.
> + * Registers which have individual bitfields have a union defining the
> + * bitfields following the address define.
> + */
> +
> +#define MLXBF_EMI_DRAM_ECC_COUNT 0x340
> +
> +union mlxbf_emi_dram_ecc_count {
> + struct {
> + u32 single_error_count : 16;
> + u32 double_error_count : 16;
> + };
> +
> + u32 word;
> +};
> +
> +#define MLXBF_EMI_DRAM_ECC_ERROR 0x348
> +
> +union mlxbf_emi_dram_ecc_error {
> + struct {
> + u32 dram_ecc_single : 1;
> + u32 __reserved_0: 15;
> + u32 dram_ecc_double : 1;
> + u32 __reserved_1: 15;
> + };
> +
> + u32 word;
> +};
> +
> +#define MLXBF_EMI_DRAM_ECC_LATCH_SELECT 0x354
> +
> +union mlxbf_emi_dram_ecc_latch_select {
> + struct {
> + u32 dram_ecc_first : 1;
> + u32 __reserved_0   : 15;
> + u32 edge_sel   : 4;
> + u32 __reserved_1   : 4;
> + u32 start  : 1;
> + u32 __reserved_2   : 7;
> + };
> +
> + u32 word;
> +};
> +
> +#define MLXBF_EMI_DRAM_ERR_ADDR_0 0x358
> +
> +#define MLXBF_EMI_DRAM_ERR_ADDR_1 0x37c
> +
> +#define MLXBF_EMI_DRAM_SYNDROM 0x35c
> +
> +union mlxbf_emi_dram_syndrom {
> + struct {
> + u32 derr : 1;
> + u32 serr : 1;
> + u32 __reserved_0 : 14;
> + /* ECC syndrome (error bit according to the Hamming code). */
> + u32 syndrom  : 10;
> + u32 __reserved_1 : 6;
> + };
> +
> + u32 word;
> +};
> +
> +#define MLXBF_EMI_DRAM_ADDITIONAL_INFO_0 0x364
> +
> +union mlxbf_emi_dram_additional_info_0 {
> + struct {
> + u32 err_bank : 4;
> + u32 err_lrank: 2;
> + u32 __reserved_0 : 2;
> + u32 err_prank: 2;
> + u32 __reserved_1 : 6;
> + u32 err_edge : 8;
> + u32 __reserved_2 : 8;
> + };
> +
> + u32 word;
> +};

... you're expecting the compiler to pack this bitfield in exactly the same way 
your
hardware did. I don't think that's guaranteed.
It evidently works for your current compiler, but another compiler may pack 
this structure
differently. Toggling endianness will break this, (arm64 supports both). If 
your platform
supports aarch32, someone may want to get 32bit arm running, which may have 
different
compiler behaviour.

You are also using bitfields between hardware and firmware, so its currently 
possible the
firmware requires the kernel to be built with a compiler that means it can't 
interact with
the hardware...

When this has come up in the past, the advice was not to use bitfields:
https://lore.kernel.org/lkml/1077080607.1078.109.camel@gaston/

Please use shifts and masks.


> +#define MLXBF_EDAC_DIMM_PER_MC   2
> +#define MLXBF_EDAC_ERROR_GRAIN   8

If these numbers changed, would it still be a BlueField SoC?
(if next years made-up:BlueField2 supports more Dimms per MC, it would be 
better to make
this a property in the firmware table).


> +/*
> + * Request MLNX_SIP_GET_DIMM_INFO
> + *
> + * Retrieve information about DIMM on a certain slot.
> + *
>

Re: [PATCH v3 2/2] EDAC: add EDAC driver for DMC520

2019-05-23 Thread James Morse

Hi Lei,

(CC: +Rui Zhao)

On 16/05/2019 03:55, Lei Wang wrote:
> New driver supports error detection and correction on the devices with ARM 
> DMC-520 memory controller.

> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7d1246b..23894ac 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5573,6 +5573,12 @@ F: Documentation/driver-api/edac.rst
>  F:   drivers/edac/
>  F:   include/linux/edac.h
>  
> +EDAC-DMC520
> +M:   Rui Zhao 
> +L:   linux-e...@vger.kernel.org
> +S:   Supported
> +F:   drivers/edac/dmc520_edac.c

Hmm, you're listing someone else as maintainer of this driver.
I think we'd need to see an Ack from Rui Zhao...

This patch was previously posted by Rui Zhao, this version has your changes and 
you as
author. (But how you arrange the attribution is up to the two of you...)


> diff --git a/drivers/edac/dmc520_edac.c b/drivers/edac/dmc520_edac.c
> new file mode 100644
> index 000..c81bfcc
> --- /dev/null
> +++ b/drivers/edac/dmc520_edac.c

> +static irqreturn_t dmc520_edac_dram_all_isr(int irq, void *data, u32 
> interrupt_mask);
> +
> +#define DECLARE_ISR(index) \
> +static irqreturn_t dmc520_isr_##index (int irq, void *data) \
> +{ \
> + struct mem_ctl_info *mci; \
> + struct dmc520_edac *edac; \
> + mci = data; \
> + edac = mci->pvt_info; \
> + return dmc520_edac_dram_all_isr(irq, data, 
> edac->interrupt_masks[index]); \
> +}
> +
> +DECLARE_ISR(0)
> +DECLARE_ISR(1)

(Generating functions like this makes them hard to find when they appear in a 
backtrace)


> +/* More DECLARE_ISR(index) can be added to support more interrupt lines. */
> +
> +irq_handler_t dmc520_isr_array[] = {
> + dmc520_isr_0,
> + dmc520_isr_1
> + /* More dmc520_isr_index can be added to support more interrupt lines. 
> */
> +};

(I'd prefer it if this allocated memory for a 'struct edac_dmc520_interrupt' 
that held the
interrupt_mask and mci pointer. This would be runtime-allocation of memory, 
instead of
compile-time generation of these templates... But its just a matter of taste, 
and this works.)


> +static u32 dmc520_get_dram_ecc_error_count(struct dmc520_edac *edac,
> +bool is_ce)
> +{
> + u32 reg_offset_low, reg_offset_high;
> + u32 err_low, err_high;
> + u32 ce_count;
> +
> + reg_offset_low = is_ce ? REG_OFFSET_ECC_ERRC_COUNT_31_00 :
> +  REG_OFFSET_ECC_ERRD_COUNT_31_00;
> + reg_offset_high = is_ce ? REG_OFFSET_ECC_ERRC_COUNT_63_32 :
> +   REG_OFFSET_ECC_ERRD_COUNT_63_32;
> +
> + err_low = dmc520_read_reg(edac, reg_offset_low);
> + err_high = dmc520_read_reg(edac, reg_offset_high);
> + /* Reset error counters */
> + dmc520_write_reg(edac, 0, reg_offset_low);
> + dmc520_write_reg(edac, 0, reg_offset_high);
> +
> + ce_count = dmc520_calc_dram_ecc_error(err_low) +
> +dmc520_calc_dram_ecc_error(err_high);

(Nit: its a little odd to call this 'ce_count' when !is_ce)


> +
> + return ce_count;
> +}
> +
> +static bool dmc520_get_dram_ecc_error_info(struct dmc520_edac *edac,
> +bool is_ce,
> +struct ecc_error_info *info)
> +{
> + u32 reg_offset_low, reg_offset_high;
> + u32 reg_val_low, reg_val_high;
> + bool valid;
> +
> + reg_offset_low = is_ce ? REG_OFFSET_DRAM_ECC_ERRC_INT_INFO_31_00 :
> +  REG_OFFSET_DRAM_ECC_ERRD_INT_INFO_31_00;
> + reg_offset_high = is_ce ? REG_OFFSET_DRAM_ECC_ERRC_INT_INFO_63_32 :
> +   REG_OFFSET_DRAM_ECC_ERRD_INT_INFO_63_32;
> +
> + reg_val_low = dmc520_read_reg(edac, reg_offset_low);
> + reg_val_high = dmc520_read_reg(edac, reg_offset_high);
> +
> + valid = (FIELD_GET(REG_FIELD_ERR_INFO_LOW_VALID, reg_val_low) != 0) &&
> + (FIELD_GET(REG_FIELD_ERR_INFO_HIGH_VALID, reg_val_high) != 0);
> +
> + if (info) {

This has one caller, which passes info as  You don't need to test for 
NULL here.


> + if (valid) {
> + info->col = FIELD_GET(REG_FIELD_ERR_INFO_LOW_COL,
> +   reg_val_low);
> + info->row = FIELD_GET(REG_FIELD_ERR_INFO_LOW_ROW,
> +   reg_val_low);
> + info->rank = FIELD_GET(REG_FIELD_ERR_INFO_LOW_RANK,
> +reg_val_low);
> + info->bank = FIELD_GET(REG_FIELD_ERR_INFO_HIGH_BANK,
> +reg_val_high);
> + } else {
> + memset(info, 0, sizeof(struct ecc_error_info));
> + }
> + }
> +
> + return valid;


> +static bool dmc520_is_scrub_configured(struct dmc520_edac *edac)
> +{
> + int chan;
> + u32 scrub_control_offsets[] = {
> + REG_OFFSET_SCRUB_CONTROL0_NOW,
> +

Re: [PATCH v3 1/2] dt-bindings: edac: arm-dmc520.txt

2019-05-23 Thread James Morse

Hi Lei,

On 16/05/2019 03:35, Lei Wang wrote:
> From: Lei Wang 
> 
> This is the device tree bindings for new EDAC driver dmc520_edac.c.

> diff --git a/Documentation/devicetree/bindings/edac/arm-dmc520.txt 
> b/Documentation/devicetree/bindings/edac/arm-dmc520.txt
> new file mode 100644
> index 000..71e7aa3
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/arm-dmc520.txt
> @@ -0,0 +1,26 @@
> +* ARM DMC-520 EDAC node
> +

> +Required properties:
> +- compatible : "brcm,dmc-520", "arm,dmc-520".
> +- reg: Address range of the DMC-520 registers.
> +- interrupts : DMC-520 interrupt numbers. The example below specifies
> +   two interrupt lines for dram_ecc_errc_int and
> +   dram_ecc_errd_int.
> +- interrupt-config   : This is an array of interrupt masks. For each of the
> +   above interrupt line, add one interrupt mask element 
> to
> +   it. That is, there is a 1:1 mapping from each 
> interrupt
> +   line to an interrupt mask. An interrupt mask can 
> represent
> +   multiple interrupts being enabled.

Cunning!


> +   Refer to interrupt_control
> +   register in DMC-520 TRM for interrupt mapping. In the 
> example
> +   below, the interrupt configuration enables 
> dram_ecc_errc_int
> +   and dram_ecc_errd_int. And each interrupt is 
> connected to
> +   a separate interrupt line.
> +
> +Example:
> +
> +dmc0: dmc@20 {
> + compatible = "brcm,dmc-520", "arm,dmc-520";
> + reg = <0x20 0x8>;
> + interrupts = <0x0 0x349 0x4>, <0x0 0x34B 0x4>;
> + interrupt-config = <0x4>, <0x8>;
> +};
> 

For what its worth:
Acked-by: James Morse 


Thanks,

James

Re: [PATCH v2] edac: sifive: Add EDAC platform driver for SiFive SoCs

2019-05-22 Thread James Morse

Hi Boris,

On 21/05/2019 19:21, Borislav Petkov wrote:
> On Tue, May 21, 2019 at 11:00:59AM +0530, Yash Shah wrote:
>> The prerequisite patch (sifive_l2_cache driver) has been merged into
>> mainline v5.2-rc1
>> It should be OK to merge this edac driver now.
> 
> James?

Still fine by me:
Reviewed-by: James Morse 

(...this patch already has my reviewed-by on it...)

I commented that it couldn't be merged in pieces here:
https://lore.kernel.org/lkml/4072c812-d3bf-9ad5-2b30-6b2a5060b...@arm.com/T/#u

which is what Yash is replying to.


Thanks,

James

Re: [PATCH v3 2/3] arm64: implement update_fdt_pgprot()

2019-05-16 Thread James Morse

Hi!

On 16/05/2019 17:48, Hsin-Yi Wang wrote:
> On Thu, May 16, 2019 at 11:32 PM Rob Herring  wrote:
>> Doesn't kexec operate on a copy because it already does modifications.

It does!

> This patch is to assist "[PATCH v3 3/3] fdt: add support for rng-seed"
> (https://lkml.org/lkml/2019/5/16/257). I thought that by default
> second kernel would use original fdt, so I write new seed back to
> original fdt. Might be wrong.
> 
> ** "[PATCH v3 3/3] fdt: add support for rng-seed" is supposed to
> handle for adding new seed in kexec case, discussed in v2
> (https://lkml.org/lkml/2019/5/13/425)
> 
> By default (not considering user defines their own fdt), if second
> kernel uses copied fdt, when is it copied and can we modify that?

Regular kexec's user-space already updates the dtb for the cmdline and maybe 
the initrd.
For KASLR, it generates its own seed with getrandom():

https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-tools.git/tree/kexec/arch/arm64/kexec-arm64.c#n483

If user-space can do it, user-space should do it!


Thanks,

James

Re: [PATCH v7 10/13] selftests/resctrl: Add vendor detection mechanism

2019-05-15 Thread James Morse

Hi André,

On 14/05/2019 20:40, André Przywara wrote:
> On 14/05/2019 18:20, James Morse wrote:
>> On 10/05/2019 18:39, Andre Przywara wrote:
>>> On Sat,  9 Feb 2019 18:50:39 -0800
>>> Fenghua Yu  wrote:
>>>> From: Babu Moger 
>>>>
>>>> RESCTRL feature is supported both on Intel and AMD now. Some features
>>>> are implemented differently. Add vendor detection mechanism. Use the vendor
>>>> check where there are differences.
>>>
>>> I don't think vendor detection is the right approach. The Linux userland
>>> interface should be even architecture agnostic, not to speak of different
>>> vendors.
>>>
>>> But even if we need this for some reason ...

>> What do we need it for? Surely it indicates something is wrong with the 
>> kernel interface
>> if you need to know which flavour of CPU this is.
> 
> As you mentioned, we should not need it. I just couldn't find a better
> way (yet) to differentiate between L3 cache ID and physical package ID
> (see patch 11/13). So this is a kludge for now to not break this
> particular code.

[0]? That's broken. It needs to take the 'cache/index?/id' field, and not 
hard-code '3',
search each 'cache/index?/level' instead.

Documentation/x86/resctrl_ui.rst's "Cache IDs" section says:
| On current generation systems there is one L3 cache per socket and L2
| caches are generally just shared by the hyperthreads on a core, but this
| isn't an architectural requirement.
[...]
| So instead of using "socket" or "core" to define the set of logical cpus
| sharing a resource we use a "Cache ID"
[...]
| To find the ID for each logical CPU look in
| /sys/devices/system/cpu/cpu*/cache/index*/id

arch/x86/kernel/cpu/restrl/core.c:domain_add_cpu() pulls the domain-id out of 
struct
cacheinfo:
|   int id = get_cache_id(cpu, r->cache_level);

drivers/base/cacheinfo.c has some macro-foliage that exports this same field 
via sysfs,
and arch/x86/kernel/cpu/restrl/ctrlmondata.c:parse_line() matches that id 
against the
value user-space provides in the schemata.

(we've got some horrible code for arm64 to make this work without 'cache id' as 
a hardware
property!)

On x86 these numbers are of the order 0,1,2, so its very likely 
physical_package_id and
cache_id alias, and you get away with it.

> Out of curiosity: Is there any userland tool meant to control the
> resources? I guess poking around in sysfs is not how admins are expected
> to use this?

I've come across:
https://github.com/intel/intel-cmt-cat/

but I've never even cloned it. The rdtset man page has:
| With --iface-os (-I) parameter, rdtset uses resctrl filesystem 
(/sys/fs/resctrl)
| instead of accessing MSRs directly.

> This tool would surely run into the same problems, which somewhat tell
> me that the interface is not really right.

At the moment its not as-documented or as the kernel is using those numbers.

I assume this is something that changed in resctrl when it was merged, and this 
selftest
tool just needs updating.

Thanks,

James

[0] https://lkml.org/lkml/2019/2/9/384

Re: [PATCH v7 00/13] selftests/resctrl: Add resctrl selftest

2019-05-14 Thread James Morse

Hi Fenghua,

On 10/05/2019 20:20, Yu, Fenghua wrote:
>> On Friday, May 10, 2019 10:36 AM
>> Andre Przywara [mailto:andre.przyw...@arm.com] wrote:
>> On Sat,  9 Feb 2019 18:50:29 -0800
>> Fenghua Yu  wrote:

>>> With more and more resctrl features are being added by Intel, AMD and
>>> ARM, a test tool is becoming more and more useful to validate that
>>> both hardware and software functionalities work as expected.
>>
>> That's very much appreciated! We decided to use that tool here to detect
>> regressions in James' upcoming resctrl rework series. While doing so we
>> spotted some shortcomings:

>> - There is some unconditional x86 inline assembly which obviously breaks
>> the build on ARM.
> 
> Will fix this as much as possible.
> 
> BTW, does ARM support perf imc_count events which are used in CAT tests?

I've never heard of these. git-grep says its a powerpc pmu...

(after a quick chat with Andre), is this a cache-miss counter?
If so, its a bit murky, (and beware, I don't know much about perf).
The arch code's armv8_pmu has a 'PERF_COUNT_HW_CACHE_MISSES' map entry, so 
yes...
... but if we're measuring this on a cache outside the CPU, we'd need an 
'uncore' pmu
driver, so it would depend on what the manufacturer implemented.

I should admit that I'm expecting this selftest to test resctrl, and not depend 
on a lot
else. Couldn't it use the llc_occupancy value, 50% bitmap gives ~50% lower 
occupancy. Or
(some combination of) the mbm byte counters, (which would require a seeky 
workload to
cause repeat re-fills of the same line)...

Thanks,

James

Re: [PATCH v7 10/13] selftests/resctrl: Add vendor detection mechanism

2019-05-14 Thread James Morse

Hi Andre,

(thanks for digging into this!)

On 10/05/2019 18:39, Andre Przywara wrote:
> On Sat,  9 Feb 2019 18:50:39 -0800
> Fenghua Yu  wrote:
>> From: Babu Moger 
>>
>> RESCTRL feature is supported both on Intel and AMD now. Some features
>> are implemented differently. Add vendor detection mechanism. Use the vendor
>> check where there are differences.
> 
> I don't think vendor detection is the right approach. The Linux userland
> interface should be even architecture agnostic, not to speak of different
> vendors.
> 
> But even if we need this for some reason ...

>> diff --git a/tools/testing/selftests/resctrl/resctrl_tests.c 
>> b/tools/testing/selftests/resctrl/resctrl_tests.c
>> index 3959b2b0671a..1d9adcfbdb4c 100644
>> --- a/tools/testing/selftests/resctrl/resctrl_tests.c
>> +++ b/tools/testing/selftests/resctrl/resctrl_tests.c
>> @@ -14,6 +14,66 @@
>>  #define BENCHMARK_ARGS  64
>>  #define BENCHMARK_ARG_SIZE  64
>>  
>> +/**
>> + * This variables provides information about the vendor
>> + */
>> +int genuine_intel;
>> +int authentic_amd;
>> +
>> +void lcpuid(const unsigned int leaf, const unsigned int subleaf,
>> +struct cpuid_out *out)
> 
> There is a much easier way to detect the vendor, without resorting to
> (unchecked) inline assembly in userland:

> I changed this to scan /proc/cpuinfo for a line starting with vendor_id,
> then use the information there. This should work everywhere.

Everywhere x86. /proc/cpuinfo is unfortunately arch specific, arm64 spells that 
field 'CPU
implementer'. Short of invoking lscpu, I don't know a portable way of finding 
this string.

What do we need it for? Surely it indicates something is wrong with the kernel 
interface
if you need to know which flavour of CPU this is.

The only differences I've spotted are the 'MAX_MBA_BW' on Intel is a 
percentage, on AMD
its a number between 1 and 2K. If user-space needs to know we could add another 
file with
the 'max_bandwidth'. (I haven't spotted how the MBA default_ctrl gets exposed).

The other one is the non-contiguous CBM values, where user-space is expected to 
try it and
see [0]. If user-space needs to know, we could expose some 'sparse_bitmaps=[0 
1]', unaware
user-space keeps working.

Thanks,

James

[0] https://lore.kernel.org/patchwork/patch/991236/#1179944

Re: [PATCH] edac: sifive: Add EDAC platform driver for SiFive SoCs

2019-05-02 Thread James Morse

Hi Yash,

Sorry for the delay on the earlier version of this - I was trying to work out 
what happens
when multiple edac drivers probe based on DT...


On 02/05/2019 12:16, Yash Shah wrote:
> The initial ver of EDAC driver supports:
> - ECC event monitoring and reporting through the EDAC framework for SiFive
>   L2 cache controller.
> 

You probably don't want this bit preserved in the kernel log:
{

> This patch depends on patch
> 'RISC-V: sifive_l2_cache: Add L2 cache controller driver for SiFive SoCs'
> https://lkml.org/lkml/2019/5/2/309

}

> The EDAC driver registers for notifier events from the L2 cache controller
> driver (arch/riscv/mm/sifive_l2_cache.c) for L2 ECC events
> 
> Signed-off-by: Yash Shah 
> ---

(if you put it here, it gets discarded when the patch is applied)

Having an separately posted dependency like this is tricky, as this code can't 
be
used/tested until the other bits are merged.


>  MAINTAINERS|   6 +++
>  arch/riscv/Kconfig |   1 +
>  drivers/edac/Kconfig   |   6 +++
>  drivers/edac/Makefile  |   1 +
>  drivers/edac/sifive_edac.c | 121 
> +
>  5 files changed, 135 insertions(+)
>  create mode 100644 drivers/edac/sifive_edac.c

> diff --git a/drivers/edac/sifive_edac.c b/drivers/edac/sifive_edac.c
> new file mode 100644
> index 000..eb7a9b9
> --- /dev/null
> +++ b/drivers/edac/sifive_edac.c
> @@ -0,0 +1,121 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * SiFive Platform EDAC Driver
> + *
> + * Copyright (C) 2018-2019 SiFive, Inc.
> + *
> + * This driver is partially based on octeon_edac-pc.c
> + *
> + */
> +#include 
> +#include 
> +#include "edac_module.h"
> +
> +#define DRVNAME "sifive_edac"
> +
> +extern int register_sifive_l2_error_notifier(struct notifier_block *nb);
> +extern int unregister_sifive_l2_error_notifier(struct notifier_block *nb);

Ideally these would live in some header file.


> +struct sifive_edac_priv {
> + struct notifier_block notifier;
> + struct edac_device_ctl_info *dci;
> +};
> +
> +/**
> + * EDAC error callback
> + *
> + * @event: non-zero if unrecoverable.
> + */
> +static
> +int ecc_err_event(struct notifier_block *this, unsigned long event, void 
> *ptr)
> +{
> + const char *msg = (char *)ptr;
> + struct sifive_edac_priv *p;
> +
> + p = container_of(this, struct sifive_edac_priv, notifier);
> +
> + if (event)
> + edac_device_handle_ue(p->dci, 0, 0, msg);
> + else
> + edac_device_handle_ce(p->dci, 0, 0, msg);

This would be easier to read if your SIFIVE_L2_ERR_TYPE_UE were exposed via 
some header file.


> +
> + return NOTIFY_STOP;

Your notifier register calls are EXPORT_SYMBOL()d, but Kconfig forbids building 
this as a
module, so its not for this driver. If there is another user of this 
notifier-chain, won't
NOTIFY_STOP here break it?


> +}
> +
> +static int ecc_register(struct platform_device *pdev)
> +{
> + struct sifive_edac_priv *p;
> +
> + p = devm_kzalloc(>dev, sizeof(*p), GFP_KERNEL);
> + if (!p)
> + return -ENOMEM;
> +
> + p->notifier.notifier_call = ecc_err_event;
> + platform_set_drvdata(pdev, p);
> +
> + p->dci = edac_device_alloc_ctl_info(sizeof(*p), "sifive_ecc", 1,

sizeof(*p) here is how much space in struct edac_device_ctl_info you need for 
private
storage... but you never touch p->dci->pvt_info, so you aren't using it.

0?


> + "sifive_ecc", 1, 1, NULL, 0,
> + edac_device_alloc_index());
> + if (IS_ERR(p->dci))
> + return PTR_ERR(p->dci);
> +
> + p->dci->dev = >dev;
> + p->dci->mod_name = "Sifive ECC Manager";
> + p->dci->ctl_name = dev_name(>dev);
> + p->dci->dev_name = dev_name(>dev);
> +
> + if (edac_device_add_device(p->dci)) {
> + dev_err(p->dci->dev, "failed to register with EDAC core\n");
> + goto err;
> + }
> +
> + register_sifive_l2_error_notifier(>notifier);
> +
> + return 0;
> +
> +err:
> + edac_device_free_ctl_info(p->dci);
> +
> + return -ENXIO;
> +}

> +struct platform_device *sifive_pdev;

static?


> +static int __init sifive_edac_init(void)
> +{
> + int ret;
> +
> + sifive_pdev = platform_device_register_simple(DRVNAME, 0, NULL, 0);
> + if (IS_ERR(sifive_pdev))
> + return PTR_ERR(sifive_pdev);
> +
> + ret = ecc_register(sifive_pdev);
> + if (ret)
> + platform_device_unregister(sifive_pdev);
> +
> + return ret;
> +}
> +
> +static void __exit sifive_edac_exit(void)
> +{
> + ecc_unregister(sifive_pdev);
> + platform_device_unregister(sifive_pdev);
> +}

Looks good to me. I think this patch should go with its two dependencies, I'm 
not sure why
it got split off...

Reviewed-by: James Morse 


Thanks,

James

Re: [RFC PATCH] kprobes/arm64: Blacklist functions called in '_sdei_handler'

2019-04-25 Thread James Morse

Hi Xiongfeng Wang,

On 20/04/2019 10:06, Xiongfeng Wang wrote:
> Functions called in '_sdei_handler' are needed to be marked as
> 'nokprobe'.

+ "Because these functions are called in NMI context and neither the 
arch-code's debug
infrastructure nor kprobes core supports this."

> ---
> When I kprobe 'sdei_smccc_smc',

Heh, when debugging this I put printk()s on the other side.

> the cpu hungs. I am not sure if it is
> becase when SDEI events interrupt EL0, '__sdei_asm_entry_trampoline'
> didn't restore vbar_el1 with kernel vectors.

Yes, we don't expect exceptions from the NMI handler, so VBAR_EL1 is left as 
whatever its
current state is. Today it could be the EL0 trampoline, the kernel vectors 
proper, or KVMs
vectors. This may change in the future, all SDEI does with VBAR_EL1 is emulate 
the
architecture: read it in order to fake up an interrupt on exit.

> Or is it becasue brk
> exception corrupt elr_el1 and trust firmware didn't preserve it for us?

Yes, this too! "5.2.1.1 Client responsibilities" of [0]
| The handler may modify the accessible System registers, but these must be 
restored
| before the handler completes.

We don't do this, because its extra work, and we don't ever anticipate it being 
necessary.

> drivers/firmware/arm_sdei.c | 3 +++
> 1 file changed, 3 insertions(+)

Nit: Because this patch only touches this file, the subject prefix should 
really be
"firmware: arm_sdei:". Running 'git log --online' over a file should give you a 
hint at
what the expected style is. This lets people spot the patches they need to look 
at.

> diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c
> index e6376f9..9cd70d1 100644
> --- a/drivers/firmware/arm_sdei.c
> +++ b/drivers/firmware/arm_sdei.c
> @@ -165,6 +165,7 @@ static int invoke_sdei_fn(unsigned long function_id, 
> unsigned long arg0,
>  
>   return err;
>  }
> +NOKPROBE_SYMBOL(invoke_sdei_fn);

Ah! These are called via sdei_api_event_context(). Oops.

>  static struct sdei_event *sdei_event_find(u32 event_num)
>  {
> @@ -879,6 +880,7 @@ static void sdei_smccc_smc(unsigned long function_id,
>  {
>   arm_smccc_smc(function_id, arg0, arg1, arg2, arg3, arg4, 0, 0, res);
>  }
> +NOKPROBE_SYMBOL(sdei_smccc_smc);
>  
>  static void sdei_smccc_hvc(unsigned long function_id,
>  unsigned long arg0, unsigned long arg1,
> @@ -887,6 +889,7 @@ static void sdei_smccc_hvc(unsigned long function_id,
>  {
>   arm_smccc_hvc(function_id, arg0, arg1, arg2, arg3, arg4, 0, 0, res);
>  }
> +NOKPROBE_SYMBOL(sdei_smccc_hvc);
>  
>  int sdei_register_ghes(struct ghes *ghes, sdei_event_callback *normal_cb,
>  sdei_event_callback *critical_cb)

Thanks for making this more robust!

Reviewed-by: James Morse 

Thanks,

James

[0]
http://infocenter.arm.com/help/topic/com.arm.doc.den0054a/ARM_DEN0054A_Software_Delegated_Exception_Interface.pdf

Re: [RFC PATCH 2/3] sdei: enable dbg in '_sdei_handler'

2019-04-24 Thread James Morse

Hi Xiongfeng Wang,

On 12/04/2019 13:04, Xiongfeng Wang wrote:
> When we monitor a sdei_event callback using 'kprobe', the singlestep
> handler can not be called because dbg is masked in sdei_handler.

For a good reason: the debug hardware may be in use single-stepping the kernel, 
or worse:
in use by a KVM guest.

The DAIF flags are all set because none of these things are safe for an
SDEI-event/NMI-handler. We haven't done KVM's guest exit work yet, so things 
like the
debug hardware belong to the guest.

Its not safe to take an exception from NMI context, avoid it at all costs!

> This patch enable dbg in '_sdei_handler'.

This is very bad as the debug hardware may have been in use by something else. 
A malicious
guest can now cause you to take breakpoint/watchpoint exceptions.

> When SDEI events interrupt the userspace, 'vbar_el1' contains
> 'tramp_vectors' if CONFIG_UNMAP_KERNEL_AT_EL0 is enabled. So we need to
> restore 'vbar_el1' with kernel vectors, otherwise we will go to the
> wrong place when brk exception or dbg exception happens.

This is the tip of the iceberg. You may have been partway through KVM's 
world-switch,
you'd need to temporarily save/restore the host context. This ends up being a
parody-world-switch, which has to be kept in-sync with KVM. We didn't go this 
way because
its fragile and unmaintainable.

> SDEI events may interrupt 'kernel_entry' before we save 'spsr_el1' and
> 'elr_el1', and dbg exception will corrupts these two registers. So we
> also need to save and restore these two registers.

(or don't do anything that would cause them to get clobbered)

> If SDEI events interrupt an instruction being singlestepped, the
> instruction in '_sdei_handler' will begin to be singlestepped after we
> enable dbg. So we need to disable singlestep in the beginning of
> _sdei_handler if we find out we interrupt a singlestep process.

And now the arch code's do_debug_exception() is re-rentrant, which it doesn't 
expect.

The kprobes core code can't be kprobed, but it can be interrupted by NMI. If 
you can
kprobe the NMI, you've made the kprobes core re-entrant too. A quick look shows
raw_spin_lock() that could deadlock, and it passes values around with a per-cpu 
variable.

You could interrupt any part of the krpobes machinery with an SDEI event, take 
a kprobe,
then take a critical SDEI event, and a third kprobe.

I don't see how its possible to make this safe without re-writing kprobes to be 
NMI safe.

> diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c
> index ea94cf8..9dd9cf6 100644
> --- a/arch/arm64/kernel/sdei.c
> +++ b/arch/arm64/kernel/sdei.c

> - if (elr != read_sysreg(elr_el1)) {
> - /*
> -  * We took a synchronous exception from the SDEI handler.
> -  * This could deadlock, and if you interrupt KVM it will
> -  * hyp-panic instead.
> -  */
> - pr_warn("unsafe: exception during handler\n");
> - }

This is here to catch a WARN_ON() firing and not killing the system. Its not 
safe to take
an exception from the NMI handler.

Why do you need to kprobe an NMI handler?

Thanks,

James

Re: [RFC PATCH 0/3] Enable kprobe to monitor sdei event handler

2019-04-24 Thread James Morse

Hi Xiongfeng Wang,

On 12/04/2019 13:04, Xiongfeng Wang wrote:
> When I use kprobe to monitor a sdei event handler,

Don't do this! SDEI is like an NMI, it isn't safe to kprobe it as it can 
interrupt the
kprobe code, causing it become re-entrant.

> the CPU will hang. It's
> because when I probe the event handler, the instruction will be replaced with 
> brk instruction and brk exception is unmaskable. But 'vbar_el1' contains 
> 'tramp_vectors' in '_sdei_handler' when SDEI events interrupt userspace, so
> we will go to the wrong place if brk exception happens.

This was lucky! Its even more fun if the SDEI event interrupted a guest: the 
kvm vectors
will give you a hyp-panic.

The __kprobes and NOKPROBE_SYMBOL() litter should stop you doing this.

> I notice that 'ghes_sdei_normal_callback' call several funtions that are not
> marked as 'nokprobe'.

Bother. We should probably blacklist those too, its not safe.

> So I was wondering if we can enable kprobe in '_sdei_handler'.

I don't think this can be done safely.

If you need to monitor your SDEI event handler you can just use printk(). Once 
nmi_enter()
has been called these are safe as they stash data in a per-cpu buffer. The SDEI 
handler
will exit via the IRQ vector if it can, which will cause this buffer to be 
flushed to the
console in a timely manner.

Why do you need to kprobe an NMI handler?

Thanks!

James

Re: of_reserved_mem()/kexec interaction

2019-04-02 Thread James Morse

Hi Roy,

On 01/04/2019 20:09, Roy Pledge wrote:
> I'm trying to understand if memory reserved in the device tree via the
> "reserved-memory" facility is preserved during a kexec system call,
> i.e., is the memory at the same location with the contents undisturbed
> when the new kernel starts?

If the reservation is a static-allocation (so the address appears in the DT) 
then the new
kernel should know this is reserved too, and not touch it.

If its a dynamic-allocation, the address isn't in the DT, so the new kernel 
can't know. It
will dynamically allocate a new reservation, which may be in a different place. 
If the
data didn't matter at the first-boot, it probably doesn't matter over 
subsequent kexec either.

As an outlier: the gic has some funny requirements around this. It needs to 
dynamically
allocate a page during first-boot that is preserved over kexec. It does this 
using
efi_mem_reserve_persistent().

Thanks,

James

Re: [PATCH 1/2] edac: sifive: Add DT documentation for SiFive L2 cache Controller

2019-04-01 Thread James Morse

Hi Rob,

On 29/03/2019 14:11, Rob Herring wrote:
> On Thu, Mar 28, 2019 at 1:47 PM James Morse  wrote:
>> On 28/03/2019 13:16, Rob Herring wrote:
>>> On Tue, Mar 12, 2019 at 02:51:00PM +0530, Yash Shah wrote:
>>>> DT documentation for L2 cache controller added.

>>>> diff --git a/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt 
>>>> b/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt
>>>> new file mode 100644
>>>> index 000..abce09f
>>>> --- /dev/null
>>>> +++ b/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt
>>>> @@ -0,0 +1,31 @@
>>>> +SiFive L2 Cache EDAC driver device tree bindings
>>>> +-
>>>> +This driver uses the EDAC framework to report L2 cache controller ECC 
>>>> errors.
>>>
>>> Bindings are for h/w blocks, not drivers. (And Boris may want a single
>>> driver, but bindings should reflect the h/w, not what Linux (currently)
>>> wants.
>>
>> For h/w block compatibles and edac, I think all we need now is to ensure the 
>> DT contains
>> the three compatible strings: platform (if there is one), soc and ip-name 
>> (if its a
>> re-usable thing).
>> This is so that linux can pick the biggest of the three (usually platform) 
>> to probe the
>> driver from, as this lets us capture platform properties we only find out 
>> about later.
> 
> DT is not the only what to instantiate drivers. If the OS really wants
> to have a single driver for multiple h/w blocks, then it needs to
> instantiate a driver itself (based on the top-level compatible
> probably) and then that driver can find the DT nodes it needs itself.

I think this is where we are heading. (but I need to get my head round this 
top-level thing).

Can the OS do both, depending on the platform?
e.g. on a system with one component the driver runs 'standalone', whereas on a 
bigger
system with multiple components the same driver is used as a library by 
something else.

I don't see how this would work if the common component's DT entry looks the 
same on both
platforms. Wouldn't this depend on the order stuff is done in, or 'but not this 
one'
checks in the driver?

> In any case, it's all irrelevant to the DT binding. We don't design
> bindings around what some particular OS wants.

I agree.

What we want to do is spot the problems on the horizon so we either have the 
right
information in the DT today, or at least know what it looks like so we don't 
cause a
regression when a new platform makes previous behaviour generic/a-library.

Thanks,

James

Re: [PATCH 1/2] edac: sifive: Add DT documentation for SiFive L2 cache Controller

2019-03-28 Thread James Morse

Hi Rob, Yash,

On 28/03/2019 13:16, Rob Herring wrote:
> On Tue, Mar 12, 2019 at 02:51:00PM +0530, Yash Shah wrote:
>> DT documentation for L2 cache controller added.

>> diff --git a/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt 
>> b/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt
>> new file mode 100644
>> index 000..abce09f
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/edac/sifive-edac-l2.txt
>> @@ -0,0 +1,31 @@
>> +SiFive L2 Cache EDAC driver device tree bindings
>> +-
>> +This driver uses the EDAC framework to report L2 cache controller ECC 
>> errors.
> 
> Bindings are for h/w blocks, not drivers. (And Boris may want a single 
> driver, but bindings should reflect the h/w, not what Linux (currently) 
> wants.

For h/w block compatibles and edac, I think all we need now is to ensure the DT 
contains
the three compatible strings: platform (if there is one), soc and ip-name (if 
its a
re-usable thing).
This is so that linux can pick the biggest of the three (usually platform) to 
probe the
driver from, as this lets us capture platform properties we only find out about 
later.

The single-driver idea is because ras/edac gets done late, (its not necessary 
to boot
linux on the board), and the edac core has difficulty with multiple components 
feeding
into it.

I don't think we need platform-specific-drivers until someone has a platform 
that needs
one for this multiple-component issue. To let us do that later (and possibly 
your
customer's customer to do it), we'd like to avoid probing based on the smallest
compatible, and use the biggest instead.
This lets us remove the platform-compatible from the L2-cache-controller 
driver's list,
and let some platform driver use it as a library instead. This is to avoid 'but 
not this
one' checks in the driver.

Yes it results in more one-line patches to enable support in the kernel, but 
these are
also statements that this platform/soc supports ras/edac. Its possible to build 
a soc out
of components that all support ras, but not have it working end-to-end.
(oh its optional, was it turned on? Does firmware need to enable something? Is 
there a
side-band signal, was it wired up everywhere).

> Are the only controls for ECC? Are all the cache attributes discoverable 
> (size, ways, line size, level, etc.)? 

>> +Example:
>> +
>> +cache-controller@201 {
>> +compatible = "sifive,fu540-c000-ccache", "sifive,ccache0";

/me googles
fu540-c000 is the soc, but it doesn't have dram, so it must get used in other 
platforms.

If there is anything the platform/firmware can influence with this thing please 
ensure
they have properties in here. (firmware-privilege enables, or pins that have to 
be tied
high/lwo)

As an example of the problem we're tring to avoid: someone could re-use the 
design of
"sifive,ccache0" in another soc, where the DRAM controller also supports edac. 
From just
"sifive,ccache0", they can't tell their system (which needs a multi-component 
driver) from
yours, which just needs this one.

There may be a clever way of getting DT to do this...

>> +interrupt-parent = <>;
>> +interrupts = <1 2 3>;
>> +reg = <0x0 0x201 0x0 0x1000 0x0 0x800 0x0 0x200>;
>> +reg-names = "control", "sideband";
>> +};
>> -- 
>> 1.9.1
>>

Thanks,

James

Re: [PATCH v2 1/2] EDAC: add EDAC driver for DMC520

2019-03-25 Thread James Morse

Hi Rui,

On 07/03/2019 01:24, Rui Zhao wrote:
> From: Rui Zhao 
> New driver supports error detection and correction on
> the devices with ARM DMC-520 memory controller.


> diff --git a/drivers/edac/dmc520_edac.c b/drivers/edac/dmc520_edac.c
> new file mode 100644
> index 000..c70ce4e
> --- /dev/null
> +++ b/drivers/edac/dmc520_edac.c
> @@ -0,0 +1,619 @@
> +// SPDX-License-Identifier: GPL-2.0+
> +// EDAC driver for DMC-520

[..]

> +// DMC-520 types, masks and bitfields
> +#define MEMORY_TYPE_LPDDR3   0
> +#define MEMORY_TYPE_DDR3 1
> +#define MEMORY_TYPE_DDR4 2
> +#define MEMORY_TYPE_LPDDR4   3

(this might be better as an enum as it hints to the compiler that only these
values can ever happen. An example of where this is handy further down...)


> +#define MEMORY_DEV_WIDTH_X4  0
> +#define MEMORY_DEV_WIDTH_X8  1
> +#define MEMORY_DEV_WIDTH_X16 2
> +#define MEMORY_DEV_WIDTH_X32 3

(this might be better as an enum)


> +struct dmc520_edac {
> + void __iomem *reg_base;

> + char message[EDAC_MSG_BUF_SIZE];

Does there need to be an irq-safe lock for this mesage buffer?
Your snprintf() into it in dmc520_handle_dram_ecc_errors(), which is called 
from both
dmc520_edac_dram_ue_isr() and dmc520_edac_dram_ce_isr()... if these are wired 
up as
different irqs, they could happen on different CPUs concurrently...


> +};


> +static bool dmc520_get_dram_ecc_error_info(struct dmc520_edac *edac,
> +bool is_ce,
> +struct ecc_error_info *info)
> +{
> + u32 reg_offset_low, reg_offset_high;
> + u32 reg_val_low, reg_val_high;
> + bool valid;
> +
> + reg_offset_low = is_ce ? REG_OFFSET_DRAM_ECC_ERRC_INT_INFO_31_00 :
> +  REG_OFFSET_DRAM_ECC_ERRD_INT_INFO_31_00;
> + reg_offset_high = is_ce ? REG_OFFSET_DRAM_ECC_ERRC_INT_INFO_63_32 :
> +   REG_OFFSET_DRAM_ECC_ERRD_INT_INFO_63_32;
> +
> + reg_val_low = dmc520_read_reg(edac, reg_offset_low);
> + reg_val_high = dmc520_read_reg(edac, reg_offset_high);
> +
> + valid = (FIELD_GET(REG_FIELD_ERR_INFO_LOW_VALID, reg_val_low) != 0) &&
> + (FIELD_GET(REG_FIELD_ERR_INFO_HIGH_VALID, reg_val_high) != 0);
> +
> + if (info) {

I see one caller, dmc520_handle_dram_ecc_errors(), which has info on the 
stack... is it
possible for info to be NULL here?


> + if (valid) {
> + info->col = FIELD_GET(REG_FIELD_ERR_INFO_LOW_COL,
> +   reg_val_low);
> + info->row = FIELD_GET(REG_FIELD_ERR_INFO_LOW_ROW,
> +   reg_val_low);
> + info->rank = FIELD_GET(REG_FIELD_ERR_INFO_LOW_RANK,
> +reg_val_low);
> + info->bank = FIELD_GET(REG_FIELD_ERR_INFO_HIGH_BANK,
> +reg_val_high);
> + } else {
> + memset(info, 0, sizeof(struct ecc_error_info));
> + }
> + }
> +
> + return valid;
> +}

> +static enum mem_type dmc520_get_mtype(struct dmc520_edac *edac)
> +{
> + enum mem_type mt;
> + u32 reg_val, type;
> +
> + reg_val = dmc520_read_reg(edac, REG_OFFSET_MEMORY_TYPE_NOW);
> + type = FIELD_GET(REG_FIELD_MEMORY_TYPE, reg_val);
> +
> + switch (type) {
> + case MEMORY_TYPE_LPDDR3:
> + case MEMORY_TYPE_DDR3:
> + mt = MEM_DDR3;
> + break;
> +
> + case MEMORY_TYPE_DDR4:
> + case MEMORY_TYPE_LPDDR4:
> + default:

Is this default just to shut the compiler warning up?
If so, you could use an enum and the compiler won't warn for a switch() that 
has all the
cases covered.


> + mt = MEM_DDR4;
> + break;
> + }
> + return mt;
> +}


> +static irqreturn_t dmc520_edac_dram_ce_isr(int irq, void *data)
> +{
> + return dmc520_edac_dram_ecc_isr(irq, data, true);
> +}
> +
> +static irqreturn_t dmc520_edac_dram_ue_isr(int irq, void *data)
> +{
> + return dmc520_edac_dram_ecc_isr(irq, data, false);
> +}

How come these calls aren't folded into the one-size fits all irq handler below?
If the hardware always sets the status register, does the irqchip driver need 
to call a
different helper?


> +static irqreturn_t dmc520_edac_all_isr(int irq, void *data)
> +{
> + struct mem_ctl_info *mci;
> + struct dmc520_edac *edac;
> + u32 status;
> +
> + mci = data;
> + edac = mci->pvt_info;
> +
> + status = dmc520_read_reg(edac, REG_OFFSET_INTERRUPT_STATUS);
> +
> + if (status & DRAM_ECC_INT_CE_MASK)
> + dmc520_edac_dram_ce_isr(irq, data);
> +
> + if (status & DRAM_ECC_INT_UE_MASK)
> + dmc520_edac_dram_ue_isr(irq, data);
> +
> + // Other interrupt handlers can

Re: [PATCH v2 1/2] EDAC: add EDAC driver for DMC520

2019-03-25 Thread James Morse

Hi guys,

On 23/03/2019 09:23, Borislav Petkov wrote:
> On Thu, Mar 07, 2019 at 01:24:01AM +, Rui Zhao wrote:
>> From: Rui Zhao 
>>
>> New driver supports error detection and correction on
>> the devices with ARM DMC-520 memory controller.

A question/suggestion on the direction...

Could we avoid probing the driver based on the root hardware compatible?
Could we use the device/chip/platform specific one instead?

We want to avoid per-function edac drivers. If ${my_chip} has edac support for 
L3 and
memory, I should have a ${my_chip}_edac driver that pulls in the appropriate L3 
and memory
code, and presents a sensible view to edac_core.

Thinking out loud...

You have:

>> +static const struct of_device_id dmc520_edac_driver_id[] = {
>> +{ .compatible = "arm,dmc-520", },
>> +{ /* end of table */ }
>> +};
>> +

If you wanted to add another device with edac support, we'd ask you to create
${your_chip}_edac driver and pull in the DMC520 and the other device.

But probing the 'arm,dmc-520' compatible like this leaves us in a tricky place 
if someone
else does this: ${their_device} probes the dmc520 like this too, but they can't 
stop it on
their platform as it will break yours...

It's normal to have a specific compatible, vexpress has:
| compatible = "arm,vexpress,v2f-2xv6,ca7x3", "arm,vexpress,v2f-2xv6", 
"arm,vexpress";

Could we do the same here:
| compatible = "vendor,soc-name-dmc520", "arm,dmc-520";

or even:
| compatible = "microsoft,product-name-dmc520", "arm,dmc-520";
if there is some firmware/board configuration that means vendor/soc isn't 
precise enough.

Then we always probe the driver from "vendor,soc-name-dmc520", never from 
"arm,dmc520".

This means we grow a list of vendor/soc-name that are using this driver, but if 
one of
them wants to support a second edac device, we can remove their vendor/soc-name 
from the
list without affecting anyone else.

Thanks,

James

Re: [PATCH v2 2/2] dt-bindings: edac: arm-dmc520.txt

2019-03-25 Thread James Morse

Hi Rui,

On 07/03/2019 01:24, Rui Zhao wrote:
> From: Rui Zhao 
> dt-bindings for new EDAC driver dmc520_edac.c.

(minor nit, the DT folk prefer the binding to come first in the series, this 
makes it
easier to review)

> diff --git a/Documentation/devicetree/bindings/edac/arm-dmc520.txt 
> b/Documentation/devicetree/bindings/edac/arm-dmc520.txt
> new file mode 100644
> index 000..7bea7dd
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/edac/arm-dmc520.txt
> @@ -0,0 +1,21 @@
> +* ARM DMC-520 EDAC node
> +
> +Required properties:
> +- compatible : "arm,dmc-520".
> +- reg: Address range of the DMC-520 registers.
> +- interrupts : DMC-520 interrupt numbers.

Your example has two interrupts, what do they correspond to? (It needs to be 
clear from
the binding)

Because this thing has quite a few, it may be worth naming the ones you use. If 
someone
else's platform uses one of the others, they can add it without conflicting 
with DTs for
yours.

Looking through the TRM for things ending in _int, they seem to be:
* ram_ecc_erc
* ram_ecc_erd
* dram_ecc_erc
* dram_ecc_erd
* failed_access
* failed_prog
* link_err
* arch_fsm
* temperature_event
* phy_request
* combined_int

I think this is far too many to enumerate from day one, especially as your 
platform only
needs two. Could we name the two you need so that its clear which ones they 
are, and
others can be added when someone needs them.

> +- interrupt-mask : Interrupts to be enabled, refer to interrupt_control
> +   register in DMC-520 TRM for interrupt mapping.

This sounds like policy. It would be good to omit the interrupts that aren't 
wired up. If
there is a policy like 'use ram not dram on platform Y' we can get the edac 
driver to do
that based on of_machine_is_compatible() (as the altera edac driver already 
does).

> +Optional properties:
> +- interrupt-shared   : set this property if and only if all DMC-520
> +   interrupts share the interrupt number.

What if some of them are combined, and some aren't?

(this shared-interrupts was my example of why we need a documented binding to 
work out
what is specific to your platofrm)

I'm not sure how this usually gets described in a DT binding ... couldn't we 
spot this
from duplicate entries in the interrupts property? If we register them with 
IRQF_SHARED,
would it matter?

(We can always tell its our device from the status register, so I think we 
should use
IRQF_SHARED regardless.)

Thanks,

James

[PATCH v5] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-26 Thread James Morse

From: Zhang Lei 

On the Fujitsu-A64FX cores ver(1.0, 1.1), memory access may cause
an undefined fault (Data abort, DFSC=0b11). This fault occurs under
a specific hardware condition when a load/store instruction performs an
address translation. Any load/store instruction, except non-fault access
including Armv8 and SVE might cause this undefined fault.

The TCR_ELx.NFD1 bit is used by the kernel when CONFIG_RANDOMIZE_BASE
is enabled to mitigate timing attacks against KASLR where the kernel
address space could be probed using the FFR and suppressed fault on
SVE loads.

Since this erratum causes spurious exceptions, which may corrupt
the exception registers, we clear the TCR_ELx.NFDx=1 bits when
booting on an affected CPU.

Signed-off-by: Zhang Lei 
[Generated MIDR value/mask for __cpu_setup(), removed spurious-fault handler
 and always disabled the NFDx bits on affected CPUs]
Signed-off-by: James Morse 
Tested-by: zhang.lei 

---
(and since posted inline: removed the stray ; on
 MIDR_FUJITSU_ERRATUM_010001_MASK)

Changes since [v4]
 * Generated MIDR value/mask for __cpu_setup(),
 * removed spurious-fault handler,
 * always disabled the NFDx bits on affected CPUs

Changes since [v3]
 * Add description of the patch.
 * Add dependency to Kconfig.
  - Set default value of FUJITSU_ERRATUM_010001 depends on RANDOMIZE_BASE.

Changes since [v2]
 * Change TCR_ELx.NFD1.
  - Set TCR_ELx.NFD1 to 0 when entry kernel.
  - Set TCR_ELx.NFD1 to 1 when exit kernel.

Changes since [v1]
 * Use the errata framework to work around for Fujitsu A64FX erratum 010001.


 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 19 +++
 arch/arm64/include/asm/assembler.h | 20 
 arch/arm64/include/asm/cputype.h   |  9 +
 arch/arm64/include/asm/pgtable-hwdef.h |  1 +
 arch/arm64/mm/proc.S   |  1 +
 6 files changed, 51 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d043d086..26d64e9f3a35 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_FALKOR_ERRATUM_1009|
 | Qualcomm Tech. | QDF2400 ITS | E0065   | 
QCOM_QDF2400_ERRATUM_0065   |
 | Qualcomm Tech. | Falkor v{1,2}   | E1041   | 
QCOM_FALKOR_ERRATUM_1041|
+| Fujitsu| A64FX   | E#010001| FUJITSU_ERRATUM_010001  
|
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index a4168d366127..b0b7f1c4e816 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -643,6 +643,25 @@ config QCOM_FALKOR_ERRATUM_E1041
 
  If unsure, say Y.
 
+config FUJITSU_ERRATUM_010001
+   bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
+   default y
+   help
+ This option adds workaround for Fujitsu-A64FX erratum E#010001.
+ On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1), memory
+ accesses may cause undefined fault (Data abort, DFSC=0b11).
+ This fault occurs under a specific hardware condition when a
+ load/store instruction performs an address translation using:
+ case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
+ case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
+ case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
+ case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
+
+ The workaround is to ensure these bits are clear in TCR_ELx.
+ The workaround only affect the Fujitsu-A64FX.
+
+ If unsure, say Y.
+
 endmenu
 
 
diff --git a/arch/arm64/include/asm/assembler.h 
b/arch/arm64/include/asm/assembler.h
index 4feb6119c3c9..128d0fbfcb24 100644
--- a/arch/arm64/include/asm/assembler.h
+++ b/arch/arm64/include/asm/assembler.h
@@ -27,6 +27,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -604,6 +605,25 @@ USER(\label, icivau, \tmp2)// 
invalidate I line PoU
 #endif
.endm
 
+/*
+ * tcr_clear_errata_bits - Clear TCR bits that trigger an errata on this CPU.
+ */
+   .macro  tcr_clear_errata_bits, tcr, tmp1, tmp2
+#ifdef CONFIG_FUJITSU_ERRATUM_010001
+   mrs \tmp1, midr_el1
+
+   mov_q   \tmp2, MIDR_FUJITSU_ERRATUM_010001_MASK
+   and \tmp1, \tmp1, \tmp2
+   mov_q   \tmp2, MIDR_FUJITSU_ERRATUM_010001
+   cmp \tmp1, \tmp2
+   b.ne10f
+
+   mov_q   \tmp2, TCR_CLEAR_FUJITSU_ERRATUM_010001
+   bic \tcr, \tcr, \tmp2
+10:
+#endif /* CONFIG_FUJITSU_ERRATUM_010001 */
+   .endm
+
 /**
  * Errata workaround prior to disable MMU. Insert an ISB immediately prior
  * to executing the MSR that will change SCTLR_ELn[M] from a value of 1 to 0.
diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h
index 951ed1a4e5c9..c6c6b4de0688 100644
--- a/arch/arm64/include/asm/cputype.h
+++ b/arch/arm64/i

Re: [PATCH v6 5/6] arm64/kvm: control accessibility of ptrauth key registers

2019-02-26 Thread James Morse

Hi Amit,

On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
> According to userspace settings, ptrauth key registers are conditionally
> present in guest system register list based on user specified flag
> KVM_ARM_VCPU_PTRAUTH.
> 
> Reset routines still sets these registers to default values but they are
> left like that as they are conditionally accessible (set/get).

What problem is this patch solving?

I think it's that now we have ptrauth support, we have a glut of new registers 
visible to
user-space. This breaks migration and save/resume if the target machine doesn't 
have
pointer-auth configured, even if the guest wasn't using it.
Because we've got a VCPU bit, we can be smarter about this, and only expose the 
registers
if user-space was able to enable ptrauth.


> ---
> This patch needs patch [1] by Dave Martin and adds feature to manage 
> accessibility in a scalable way.
> 
> [1]: 
> https://lore.kernel.org/linux-arm-kernel/1547757219-19439-13-git-send-email-dave.mar...@arm.com/
>  

This is v4. Shortly before you posted this there was a v5 (but the subject 
changed, easily
missed). V5 has subsequently been reviewed. As we can't have both, could you 
rebase onto
that v5 so that there is one way of doing this?

(in general if you could re-post the version you develop/tested with then it 
makes it
clear what is going on)


> diff --git a/Documentation/arm64/pointer-authentication.txt 
> b/Documentation/arm64/pointer-authentication.txt
> index 0529a7d..996e435 100644
> --- a/Documentation/arm64/pointer-authentication.txt
> +++ b/Documentation/arm64/pointer-authentication.txt
> @@ -87,3 +87,7 @@ created by passing a flag (KVM_ARM_VCPU_PTRAUTH) requesting 
> this feature
>  to be enabled. Without this flag, pointer authentication is not enabled
>  in KVM guests and attempted use of the feature will result in an UNDEFINED
>  exception being injected into the guest.
> +
> +Additionally, when KVM_ARM_VCPU_PTRAUTH is not set then KVM will filter
> +out the Pointer Authentication system key registers from KVM_GET/SET_REG_*
> +ioctls.
> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index f7bcc60..c2f4974 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1005,8 +1005,13 @@ static bool trap_ptrauth(struct kvm_vcpu *vcpu,
>   return false;
>  }
>  
> +static bool check_ptrauth(const struct kvm_vcpu *vcpu, const struct 
> sys_reg_desc *rd)
> +{
> + return kvm_arm_vcpu_ptrauth_allowed(vcpu);
> +}
> +
>  #define __PTRAUTH_KEY(k) \
> - { SYS_DESC(SYS_## k), trap_ptrauth, reset_unknown, k }
> + { SYS_DESC(SYS_## k), trap_ptrauth, reset_unknown, k , .check_present = 
> check_ptrauth}


Looks good. I'm pretty sure the changes due to Dave's v5 map neatly.


Thanks,

James

Re: [PATCH v6 4/6] arm64/kvm: add a userspace option to enable pointer authentication

2019-02-26 Thread James Morse

Hi Amit,

On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
> This feature will allow the KVM guest to allow the handling of
> pointer authentication instructions or to treat them as undefined
> if not set. It uses the existing vcpu API KVM_ARM_VCPU_INIT to
> supply this parameter instead of creating a new API.
> 
> A new register is not created to pass this parameter via
> SET/GET_ONE_REG interface as just a flag (KVM_ARM_VCPU_PTRAUTH)
> supplied is enough to enable this feature.

and an attempt to restore the id register with the other version would fail.


> diff --git a/Documentation/arm64/pointer-authentication.txt 
> b/Documentation/arm64/pointer-authentication.txt
> index a25cd21..0529a7d 100644
> --- a/Documentation/arm64/pointer-authentication.txt
> +++ b/Documentation/arm64/pointer-authentication.txt
> @@ -82,7 +82,8 @@ pointers).
>  Virtualization
>  --
>  
> -Pointer authentication is not currently supported in KVM guests. KVM
> -will mask the feature bits from ID_AA64ISAR1_EL1, and attempted use of
> -the feature will result in an UNDEFINED exception being injected into
> -the guest.

> +Pointer authentication is enabled in KVM guest when virtual machine is
> +created by passing a flag (KVM_ARM_VCPU_PTRAUTH)

(This is still mixing VM and VCPU)


> + requesting this feature to be enabled.

.. on each vcpu?


> +Without this flag, pointer authentication is not enabled
> +in KVM guests and attempted use of the feature will result in an UNDEFINED
> +exception being injected into the guest.

'guests' here suggests its a VM property. If you set it on some VCPU but not 
others KVM
will generate undefs instead of enabling the feature. (which is the right thing 
to do)

I think it needs to be clear this is a per-vcpu property.


> diff --git a/arch/arm64/include/uapi/asm/kvm.h 
> b/arch/arm64/include/uapi/asm/kvm.h
> index 97c3478..5f82ca1 100644
> --- a/arch/arm64/include/uapi/asm/kvm.h
> +++ b/arch/arm64/include/uapi/asm/kvm.h
> @@ -102,6 +102,7 @@ struct kvm_regs {
>  #define KVM_ARM_VCPU_EL1_32BIT   1 /* CPU running a 32bit VM */
>  #define KVM_ARM_VCPU_PSCI_0_22 /* CPU uses PSCI v0.2 */
>  #define KVM_ARM_VCPU_PMU_V3  3 /* Support guest PMUv3 */

> +#define KVM_ARM_VCPU_PTRAUTH 4 /* VCPU uses address authentication */

Just address authentication? I agree with Mark we should have two bits to match 
what gets
exposed to EL0. One would then be address, the other generic.


> diff --git a/arch/arm64/kvm/hyp/ptrauth-sr.c b/arch/arm64/kvm/hyp/ptrauth-sr.c
> index 528ee6e..6846a23 100644
> --- a/arch/arm64/kvm/hyp/ptrauth-sr.c
> +++ b/arch/arm64/kvm/hyp/ptrauth-sr.c
> @@ -93,9 +93,23 @@ void kvm_arm_vcpu_ptrauth_reset(struct kvm_vcpu *vcpu)

> +/**
> + * kvm_arm_vcpu_ptrauth_allowed - checks if ptrauth feature is allowed by 
> user
> + *
> + * @vcpu: The VCPU pointer
> + *
> + * This function will be used to check userspace option to have ptrauth or 
> not
> + * in the guest kernel.
> + */
> +bool kvm_arm_vcpu_ptrauth_allowed(const struct kvm_vcpu *vcpu)
> +{
> + return kvm_supports_ptrauth() &&
> + test_bit(KVM_ARM_VCPU_PTRAUTH, vcpu->arch.features);
> +}

This isn't used from world-switch, could it be moved to guest.c?


> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
> index 12529df..f7bcc60 100644
> --- a/arch/arm64/kvm/sys_regs.c
> +++ b/arch/arm64/kvm/sys_regs.c
> @@ -1055,7 +1055,7 @@ static bool access_cntp_cval(struct kvm_vcpu *vcpu,
>  }
>  
>  /* Read a sanitised cpufeature ID register by sys_reg_desc */
> -static u64 read_id_reg(struct sys_reg_desc const *r, bool raz)
> +static u64 read_id_reg(struct kvm_vcpu *vcpu, struct sys_reg_desc const *r, 
> bool raz)

(It might be easier on the reviewer to move these mechanical changes to an 
earlier patch)


Looks good,

Thanks,

James

Re: [PATCH v6 3/6] arm64/kvm: context-switch ptrauth registers

2019-02-26 Thread James Morse

Hi Amit,

On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
> From: Mark Rutland 
> 
> When pointer authentication is supported, a guest may wish to use it.
> This patch adds the necessary KVM infrastructure for this to work, with
> a semi-lazy context switch of the pointer auth state.
> 
> Pointer authentication feature is only enabled when VHE is built
> in the kernel and present into CPU implementation so only VHE code
> paths are modified.

> When we schedule a vcpu, we disable guest usage of pointer
> authentication instructions and accesses to the keys. While these are
> disabled, we avoid context-switching the keys. When we trap the guest
> trying to use pointer authentication functionality, we change to eagerly
> context-switching the keys, and enable the feature. The next time the
> vcpu is scheduled out/in, we start again.

> However the host key registers
> are saved in vcpu load stage as they remain constant for each vcpu
> schedule.

(I think we can get away with doing this later ... with the hope of doing it 
never!)


> Pointer authentication consists of address authentication and generic
> authentication, and CPUs in a system might have varied support for
> either. Where support for either feature is not uniform, it is hidden
> from guests via ID register emulation, as a result of the cpufeature
> framework in the host.
> 
> Unfortunately, address authentication and generic authentication cannot
> be trapped separately, as the architecture provides a single EL2 trap
> covering both. If we wish to expose one without the other, we cannot
> prevent a (badly-written) guest from intermittently using a feature
> which is not uniformly supported (when scheduled on a physical CPU which
> supports the relevant feature). Hence, this patch expects both type of
> authentication to be present in a cpu.


> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 2f1bb86..1bacf78 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -146,6 +146,18 @@ enum vcpu_sysreg {

> +static inline bool kvm_supports_ptrauth(void)
> +{
> + return has_vhe() && system_supports_address_auth() &&
> + system_supports_generic_auth();
> +}

Do we intend to support the implementation defined algorithm? I'd assumed not.

system_supports_address_auth() is defined as:
| static inline bool system_supports_address_auth(void)
| {
|   return IS_ENABLED(CONFIG_ARM64_PTR_AUTH) &&
|   (cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH_ARCH) ||
|   cpus_have_const_cap(ARM64_HAS_ADDRESS_AUTH_IMP_DEF));
| }


So we could return true from kvm_supports_ptrauth() even if we only support the 
imp-def
algorithm.

I think we should hide the imp-def ptrauth support as KVM hides all other 
imp-def
features. This lets us avoid trying to migrate values that have been signed 
with the
imp-def algorithm.

I'm worried that it could include some value that we can't migrate (e.g. the 
SoC serial
number). Does the ARM-ARM say this can't happen?

All I can find is D5.1.5 "Pointer authentication in AArch64 state" of 
DDI0487D.a which has
this clause for the imp-def algorithm:
| For a set of arguments passed to the function, must give the same result for 
all PEs
| that a thread of execution could migrate between.

... with KVM we've extended the scope of migration significantly.


Could we check the cpus_have_const_cap() values for the two architected 
algorithms directly?


> diff --git a/arch/arm64/include/asm/kvm_hyp.h 
> b/arch/arm64/include/asm/kvm_hyp.h
> index 6e65cad..09e061a 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -153,6 +153,13 @@ bool __fpsimd_enabled(void);
>  void activate_traps_vhe_load(struct kvm_vcpu *vcpu);
>  void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu);
>  
> +void __ptrauth_switch_to_guest(struct kvm_vcpu *vcpu,
> +struct kvm_cpu_context *host_ctxt,
> +struct kvm_cpu_context *guest_ctxt);
> +void __ptrauth_switch_to_host(struct kvm_vcpu *vcpu,
> +   struct kvm_cpu_context *guest_ctxt,
> +   struct kvm_cpu_context *host_ctxt);


Why do you need the vcpu and the guest_ctxt?
Would it be possible for these to just take the vcpu, and to pull the host 
context from
the per-cpu variable?
This would avoid any future bugs where the ctxt's are the wrong way round, 
taking two is
unusual in KVM, but necessary here.

As you're calling these from asm you want the compiler to do as much of the 
type mangling
as possible.


> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 4e2fb87..5cac605 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -749,6 +749,7 @@ static const char *esr_class_str[] = {
>   [ESR_ELx_EC_CP14_LS]= "CP14 LDC/STC",
>   [ESR_ELx_EC_FP_ASIMD]   = "ASIMD",
>

Re: [PATCH v6 0/6] Add ARMv8.3 pointer authentication for kvm guest

2019-02-26 Thread James Morse

Hi Amit,

On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
> This patch series adds pointer authentication support for KVM guest and
> is based on top of Linux 5.0-rc6. The basic patches in this series was
> originally posted by Mark Rutland earlier[1,2] and contains some history
> of this work.
> 
> Extension Overview:
> =
> 
> The ARMv8.3 pointer authentication extension adds functionality to detect
> modification of pointer values, mitigating certain classes of attack such as
> stack smashing, and making return oriented programming attacks harder.
> 
> The extension introduces the concept of a pointer authentication code (PAC),
> which is stored in some upper bits of pointers. Each PAC is derived from the
> original pointer, another 64-bit value (e.g. the stack pointer), and a secret
> 128-bit key.
> 
> New instructions are added which can be used to:
> 
> * Insert a PAC into a pointer
> * Strip a PAC from a pointer
> * Authenticate and strip a PAC from a pointer
> 
> The detailed description of ARMv8.3 pointer authentication support in
> userspace/kernel and can be found in Kristina's generic pointer authentication
> patch series[3].

> This patch series is based on just a single patch from Dave Martin [8] which 
> add
> control checks for accessing sys registers. 

Ooeer, If you miss this patch, (like I did) the series still applies to rc6, it 
just
doesn't build. If you depend on extra patches like this, please re-post them as 
part of
the series. (you need to add your Signed-off-by if picked the patch up from the 
list).

This lets people apply the series from the list (everyone has a script to to do 
this),
without having to go and find the dependencies.

> [8]: 
> https://lore.kernel.org/linux-arm-kernel/1547757219-19439-13-git-send-email-dave.mar...@arm.com/

This is v4 of Dave's patch. He changed the subject and posted a v5 here:
https://lore.kernel.org/linux-arm-kernel/1550519559-15915-13-git-send-email-dave.mar...@arm.com/

Re-posting the patch you tested with would avoid someone accidentally pickup 
v5, then
trying to work out how its supposed to work with your series. (check_present() 
was
replaced by a restrictions() bitmask).

As we can't have both, and v5 of that patch has been reviewed, could you rebase 
onto it?
You'll need to pick up any tags and make any changes reviewers asked for. If 
you could
note 'this v7 patch is Dave's v5 with $changes', then it makes it clear what is 
going on.

Thanks,

James

Re: [PATCH v6 1/6] arm64/kvm: preserve host HCR_EL2 value

2019-02-26 Thread James Morse

Hi Amit,

On 25/02/2019 17:39, James Morse wrote:
> On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
>> From: Mark Rutland 
>> When restoring HCR_EL2 for the host, KVM uses HCR_HOST_VHE_FLAGS, which
>> is a constant value. This works today, as the host HCR_EL2 value is
>> always the same, but this will get in the way of supporting extensions
>> that require HCR_EL2 bits to be set conditionally for the host.
>>
>> To allow such features to work without KVM having to explicitly handle
>> every possible host feature combination, this patch has KVM save/restore
>> for the host HCR when switching to/from a guest HCR. The saving of the
>> register is done once during cpu hypervisor initialization state and is
>> just restored after switch from guest.
>>
>> For fetching HCR_EL2 during kvm initialisation, a hyp call is made using
>> kvm_call_hyp and is helpful in NHVE case.
>>
>> For the hyp TLB maintenance code, __tlb_switch_to_host_vhe() is updated
>> to toggle the TGE bit with a RMW sequence, as we already do in
>> __tlb_switch_to_guest_vhe().
>>
>> The value of hcr_el2 is now stored in struct kvm_cpu_context as both host
>> and guest can now use this field in a common way.

>> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
>> index 9e350fd3..8e18f7f 100644
>> --- a/virt/kvm/arm/arm.c
>> +++ b/virt/kvm/arm/arm.c
>> @@ -1328,6 +1328,7 @@ static void cpu_hyp_reinit(void)
>>  cpu_init_hyp_mode(NULL);
>>  
>>  kvm_arm_init_debug();
>> +__cpu_copy_hyp_conf();
> 
> Your commit message says:
> | The saving of the register is done once during cpu hypervisor 
> initialization state
> 
> But cpu_hyp_reinit() is called each time secondary CPUs come online. Its also 
> called as
> part of the cpu-idle mechanism via hyp_init_cpu_pm_notifier(). cpu-idle can 
> ask the
> firmware to power-off the CPU until an interrupt becomes pending for it. 
> KVM's EL2 state
> disappears when this happens, these calls take care of setting it back up 
> again. On Juno,
> this can happen tens of times a second, and this adds an extra call to EL2.

The bit I missed was the MDCR_EL2 copy is behind kvm_arm_init_debug(), so we 
already have
an unnecessary EL2 call here, so its nothing new.

Assuming the deactivate_traps_vhe_put() vcpu isn't needed, and with Mark's 
comments addressed:
Reviewed-by: James Morse 


If we can avoid repeated calls to EL2 once we've got HCR_EL2+MDCR_EL2, even 
better!


Thanks,

James

Re: [PATCH v6 1/6] arm64/kvm: preserve host HCR_EL2 value

2019-02-25 Thread James Morse

Hi Amit,

On 19/02/2019 09:24, Amit Daniel Kachhap wrote:
> From: Mark Rutland 
> 
> When restoring HCR_EL2 for the host, KVM uses HCR_HOST_VHE_FLAGS, which
> is a constant value. This works today, as the host HCR_EL2 value is
> always the same, but this will get in the way of supporting extensions
> that require HCR_EL2 bits to be set conditionally for the host.
> 
> To allow such features to work without KVM having to explicitly handle
> every possible host feature combination, this patch has KVM save/restore
> for the host HCR when switching to/from a guest HCR. The saving of the
> register is done once during cpu hypervisor initialization state and is
> just restored after switch from guest.
> 
> For fetching HCR_EL2 during kvm initialisation, a hyp call is made using
> kvm_call_hyp and is helpful in NHVE case.
> 
> For the hyp TLB maintenance code, __tlb_switch_to_host_vhe() is updated
> to toggle the TGE bit with a RMW sequence, as we already do in
> __tlb_switch_to_guest_vhe().
> 
> The value of hcr_el2 is now stored in struct kvm_cpu_context as both host
> and guest can now use this field in a common way.


> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index ca56537..05706b4 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -273,6 +273,8 @@ static inline void __cpu_init_stage2(void)
>   kvm_call_hyp(__init_stage2_translation);
>  }
>  
> +static inline void __cpu_copy_hyp_conf(void) {}
> +

I agree Mark's suggestion of adding 'host_ctxt' in here makes it clearer what 
it is.


> diff --git a/arch/arm64/include/asm/kvm_emulate.h 
> b/arch/arm64/include/asm/kvm_emulate.h
> index 506386a..0dbe795 100644
> --- a/arch/arm64/include/asm/kvm_emulate.h
> +++ b/arch/arm64/include/asm/kvm_emulate.h

Hmmm, there is still a fair amount of churn due to moving the struct 
definition, but its
easy enough to ignore as its mechanical. A preparatory patch that switched as 
may as
possible to '*vcpu_hcr() = ' would cut the churn down some more, but I don't 
think its
worth the extra effort.


> diff --git a/arch/arm64/include/asm/kvm_hyp.h 
> b/arch/arm64/include/asm/kvm_hyp.h
> index a80a7ef..6e65cad 100644
> --- a/arch/arm64/include/asm/kvm_hyp.h
> +++ b/arch/arm64/include/asm/kvm_hyp.h
> @@ -151,7 +151,7 @@ void __fpsimd_restore_state(struct user_fpsimd_state 
> *fp_regs);
>  bool __fpsimd_enabled(void);
>  
>  void activate_traps_vhe_load(struct kvm_vcpu *vcpu);
> -void deactivate_traps_vhe_put(void);
> +void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu);

I've forgotten why this is needed. You don't add a user of vcpu to
deactivate_traps_vhe_put() in this patch.


> diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c
> index b0b1478..006bd33 100644
> --- a/arch/arm64/kvm/hyp/switch.c
> +++ b/arch/arm64/kvm/hyp/switch.c
> @@ -191,7 +194,7 @@ void activate_traps_vhe_load(struct kvm_vcpu *vcpu)

> -void deactivate_traps_vhe_put(void)
> +void deactivate_traps_vhe_put(struct kvm_vcpu *vcpu)
>  {
>   u64 mdcr_el2 = read_sysreg(mdcr_el2);
>  

Why does deactivate_traps_vhe_put() need the vcpu?


> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 7732d0b..1b2e05b 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -458,6 +459,16 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>
>  static inline void __cpu_init_stage2(void) {}
>
> +/**
> + * __cpu_copy_hyp_conf - copy the boot hyp configuration registers
> + *
> + * It is called once per-cpu during CPU hyp initialisation.
> + */

Is it just the boot cpu?


> +static inline void __cpu_copy_hyp_conf(void)
> +{
> + kvm_call_hyp(__kvm_populate_host_regs);
> +}
> +


> diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c
> index 68d6f7c..68ddc0f 100644
> --- a/arch/arm64/kvm/hyp/sysreg-sr.c
> +++ b/arch/arm64/kvm/hyp/sysreg-sr.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

... what's kvm_mmu.h needed for?
The __hyp_this_cpu_ptr() you add comes from kvm_asm.h.

/me tries it.

Heh, hyp_symbol_addr(). kvm_asm.h should include this, but can't because the
kvm_ksym_ref() dependency is the other-way round. This is just going to bite us 
somewhere
else later!
If we want to fix it now, moving hyp_symbol_addr() to kvm_asm.h would fix it. 
It's
generating adrp/add so the 'asm' label is fair, and it really should live with 
its EL1
counterpart kvm_ksym_ref().


> @@ -294,7 +295,7 @@ void kvm_vcpu_put_sysregs(struct kvm_vcpu *vcpu)
>   if (!has_vhe())
>   return;
>  
> - deactivate_traps_vhe_put();
> + deactivate_traps_vhe_put(vcpu);
>  
>   __sysreg_save_el1_state(guest_ctxt);
>   __sysreg_save_user_state(guest_ctxt);
> @@ -316,3 +317,21 @@ void __hyp_text __kvm_enable_ssbs(void)
>   "msrsctlr_el2, %0"
>   : "=" (tmp) : "L" (SCTLR_ELx_DSSBS));
>  }
> +
> +/**
> + *

Re: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-25 Thread James Morse

Hi Zhang,

On 23/02/2019 13:06, Zhang, Lei wrote:
> Zhang, Lei wrote:
>> I think you mean it may be a problem to modify the KPTI trampoline because
>> some patches about KPTI will be merged to mainline in the near future.
>> I understood that.
>> I should discuss with my colleagues whether we can set NFDx=0 all of time on
>> A64FX.
> 
> The result of our investigation also supports your suggestion. 
> We surely agree with you that your proposed method (never set NFDx=1 on A64FX)
> is the best to resolve this erratum.
>   
> For this erratum, James's patch should be merged to mainline
> instead of my previous patches (v1 to v4).
> Since KPTI fully covers the effect of NFD1 for A64FX, KPTI is
> recommended to be used in conjunction with James’s patch.

>> And thanks for your patch.
>> If we can set NFDx=0 all of time, I will review, test and report the result.
> 
> I have already tested James's patch on A64FX, and the result is no problem at 
> all.
> 
> Tested-by:zhang.lei

Thanks, I'll post it properly with this tag.


>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
>> a4168d366127..b0b7f1c4e816 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -643,6 +643,25 @@ config QCOM_FALKOR_ERRATUM_E1041
>>
>>If unsure, say Y.
>>
>> +config FUJITSU_ERRATUM_010001
>> +bool "Fujitsu-A64FX erratum E#010001: Undefined fault may occur wrongly"
>> +default y
>> +help
>> +  This option adds workaround for Fujitsu-A64FX erratum E#010001.
>> +  On some variants of the Fujitsu-A64FX cores version (1.0, 1.1), memory
>> +  accesses may cause undefined fault (Data abort, DFSC=0b11).
>> +  This fault occurs under a specific hardware condition when a
>> +  load/store instruction performs an address translation using:
>> +  case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
>> +  case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
>> +  case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
>> +  case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
>> +
>> +  The workaround is to ensure these bits are clear in TCR_ELx.
>> +  The workaround only affect the Fujitsu-A64FX.
> 
> I think it is better to add a notice here as follows:
> 
>   Recommend to enable KPTI (UNMAP_KERNEL_AT_EL0 = y).

That unmap option is on by default, you can't turn it off without 
CONFIG_EXPERT. While I
agree, I don't think we need to spell this out.


Thanks,

James

Re: [PATCH] trace: skip hwasan

2019-02-21 Thread James Morse

Hi!

On 18/02/2019 13:59, Will Deacon wrote:
> [+James, who knows how to decode these things]

Decode is a strong term!

This stuff is printed by Cavium's secure-world software. All I'm doing is 
spotting the
bits that vary between the out we've seen!


> On Mon, Feb 18, 2019 at 02:56:47PM +0100, Dmitry Vyukov wrote:
>> On Mon, Feb 18, 2019 at 2:27 PM Qian Cai  wrote:
>>> On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
 On Sun, Feb 17, 2019 at 5:34 AM Qian Cai  wrote:
>
> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
> because there is a burst of too much pointer access, and then KASAN will
> dereference each byte of the shadow address for the tag checking which
> will kill all the CPUs.

 Could you please elaborate what exactly happens and who/why kills
 CPUs? Number of memory accesses should not make any difference.
 With hardware support (MTE) it won't be possible to disable
 instrumentation (loads and stores check tags themselves), so it would
 be useful to keep track of exact reasons we disable instrumentation to
 know how to deal with them with hardware support.
 It would be useful to keep this info in the comment in the Makefile.
>>>
>>> It turns out sometimes it will trigger a hardware error.
>>
>> Please add this to the comment that there is that error, reason is
>> unknown, happens from time to time.
>> "Too much pointer access" is confusing and does not seem to be the
>> root cause (there are lots of source files that cause lots of pointer
>> accesses).

> I don't think this is directly related to KASAN, as I'm sure we've seen this
> RAS error before.

Not quite like this. I've had one choke on some PCIe transaction[0].

This looks like corruption detected in a cache associated with a CPU. 'Write 
back' and
'Physical Address' suggests its the data cache:


>>>  Node 0 NBU 0 Error report :
>>>  NBU BAR Error
[..]
>>>   Physical Address : 0x40011ff00
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>>  Node 0 NBU 1 Error report :
>>>  NBU BAR Error
[..]
>>>   Physical Address : 0x40011ff40
>>>
>>> NBU BAR Error : Decoded info :
>>> Agent info : CPU
>>> Core ID : 21
>>> Thread ID : 1
>>> Requ: type : 4 : Write Back
>>>  Node 0 NBU 2 Error report :
>>>  NBU BAR Error
[..]
>>>   Physical Address : 0x40011ff80

If you can reproduce it, and it always affects Core:21,Thread:1 I'd suggest 
offline-ing
all the threads/CPUs in that core. It may be one cache is close to some 
threshold, and you
can offline the core that its part of.


Thanks,

James


[0] For comparison, I've had one of these during kexec:
# NBU BAR Error : Decoded info :
#Agent info : IO
#   : PCIE0
#Requ: type : 2 : Read

Re: [PATCH v4] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-14 Thread James Morse

ault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -666,6 +666,20 @@ static int do_sea(unsigned long addr, unsigned int esr, 
>> struct pt_regs *regs)
>>  return 0;
>>  }
>>  
>> +static int do_bad_unknown_63(unsigned long addr, unsigned int esr, struct 
>> pt_regs *regs)
>> +{
>> +/*
>> + * On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
>> + * memory accesses may spuriously trigger data aborts with
>> + * DFSC=0b11.
>> + */
>> +if (IS_ENABLED(CONFIG_FUJITSU_ERRATUM_010001) &&
>> +cpus_have_cap(ARM64_WORKAROUND_FUJITSU_A64FX_011))
>> +return 0;
>> +return do_bad(addr, esr, regs);
>> +}
> 
> If we always left NFD1 clear on A64FX, we would not need this exception
> handler, as this should never occur.

I think we should do this: never set NFDx on A64FX. I don't think we can 
maintain the TCR
swivel before any memory access in the KPTI trampoline. (It already uses the 
FAR as a
scratch register!)

The errata means we can't use these bits. Its simpler than trying to work 
around the symptoms.


>> diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S
>> index 73886a5..75f7d99 100644
>> --- a/arch/arm64/mm/proc.S
>> +++ b/arch/arm64/mm/proc.S
>> @@ -453,9 +453,29 @@ ENTRY(__cpu_setup)
>>   * Set/prepare TCR and TTBR. We use 512GB (39-bit) address range for
>>   * both user and kernel.
>>   */
>> +#ifdef CONFIG_FUJITSU_ERRATUM_010001
>>  ldr x10, =TCR_TxSZ(VA_BITS) | TCR_CACHE_FLAGS | TCR_SMP_FLAGS | \
>>  TCR_TG_FLAGS | TCR_KASLR_FLAGS | TCR_ASID16 | \
>>  TCR_TBI0 | TCR_A1 | TCR_KASAN_FLAGS
>> +/* Can use x19/x20/x5 */
>> +mrs x19, midr_el1
>> +/* ERRATA_MIDR_RANGE(MIDR_FUJITSU_A64FX, 0, 0, 1, 0) */
>> +mov w20, #0x10  //#16
>> +movkw20, #0x460f, lsl #16
>> +mov w5, #0xffef

Hmmm this is masking out the least-significant variant bit, to make it match
your 1.0 and 1.1. This could be much more readable with some pre-processor 
trickery.

We should also move it out to a macro so the average reader of __cpu_setup() 
doesn't need
to parse it.


> Please do this detection and reconfiguration of TCR in C code, rather
> than in the __cpu_setup code.
> 
> In proc.S you can have:
> 
> #if defined(CONFIG_RANDOMIZE_BASE) && !defined(CONFIG_FUJITSU_ERRATUM_010001)
> #define KASLR_FLAGS   TCR_NFD1
> #else
> #define KASLR_FLAGS   0
> #endif
> 
> ... and in cpu_errata.c you can enable NFD1 on any CPU which is not
> A64FX.

I think having an errata callback that runs on unaffected cores is tricky. ("CPU
feature: Not affected by E010001").
As a halfway house, I have the below[0] simplified version of Zhang Lei's 
patch. It
doesn't do the TCR swivel in C, just masks out the TCR values on affected CPUs. 
I
obviously haven't tested this on an affected platform.


Thanks,

James

%<
>From 41543072930af63f3229a5384061fe3bc1efd19a Mon Sep 17 00:00:00 2001
From: Zhang Lei 
Date: Thu, 14 Feb 2019 07:26:24 +
Subject: [PATCH] arm64: Add workaround for Fujitsu A64FX erratum 010001

On the Fujitsu-A64FX cores ver(1.0, 1.1), memory access may cause
an undefined fault (Data abort, DFSC=0b11). This fault occurs under
a specific hardware condition when a load/store instruction performs an
address translation. Any load/store instruction, except non-fault access
including Armv8 and SVE might cause this undefined fault.

The TCR_ELx.NFD1 bit is used by the kernel when CONFIG_RANDOMIZE_BASE
is enabled to mitigate timing attacks against KASLR where the kernel
address space could be probed using the FFR and suppressed fault on
SVE loads.

Since this erratum causes spurious exceptions, which may overwrite
the exception registers, we clear the TCR_ELx.NFDx=1 bits when
booting on an affected CPU.

Signed-off-by: Zhang Lei 
[Generated MIDR value/mask for __cpu_setup(), removed spurious-fault handler
 and always disabled the NFDx bits on affected CPUs]
Signed-off-by: James Morse 
---
 Documentation/arm64/silicon-errata.txt |  1 +
 arch/arm64/Kconfig | 19 +++
 arch/arm64/include/asm/assembler.h | 20 
 arch/arm64/include/asm/cputype.h   |  9 +
 arch/arm64/include/asm/pgtable-hwdef.h |  1 +
 arch/arm64/mm/proc.S   |  1 +
 6 files changed, 51 insertions(+)

diff --git a/Documentation/arm64/silicon-errata.txt 
b/Documentation/arm64/silicon-errata.txt
index 1f09d043d086..26d64e9f3a35 100644
--- a/Documentation/arm64/silicon-errata.txt
+++ b/Documentation/arm64/silicon-errata.txt
@@ -80,3 +80,4 @@ stable kernels.
 | Qualcomm Tech. | Falkor v1   | E1009   | 
QCOM_

Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

2019-02-05 Thread James Morse

Hi Rui,

On 23/01/2019 22:08, Rui Zhao wrote:
> On Wednesday, January 23, 2019 10:36 AM, James Morse wrote:
>>> If firmware enables it, they're suppose to handle the interrupt.

>> Ah, so you still have resident firmware!
>> How come your firmware trusts linux not to turn off the memory controller?!
>> These things are usually protected by trust zone so the OS can't pull the 
>> memory from under firmware's feet.

> We have firmware to config the memory controller and want to have an EDAC 
> driver to report ECC status.

> Could you please elaborate a bit on the security concern on this approach? 
> Like some malicious app/driver can access
> memory controller registers can cause issue? 

I'm remembering this:
https://lore.kernel.org/linux-arm-kernel/9b9c4cd5-4428-c08d-d4a3-7352c6c80...@arm.com/

Robin Murphy wrote:
| [ For anyone interested, it puts the DRAM controller into sleep mode.
| The kernel can't even panic if all the memory suddenly disappears :D ]

This would be a problem if you need your Secure-world software needs to keep
working, and depends on the memory behind this controller.

It might be that your secure-world software only uses some other memory, in
which case this wouldn't matter.
It may be linux _is_ your secure-world software, in which case it wouldn't
matter either.

> What's the recommend approach if Linux won't be able to access memory 
> controller
> registers? Have firmware do the ECC 
> status monitoring and some sort of driver to query ECC status from firmware?

If Linux runs in the normal-world, can't you use trust-zone to prevent Linux
from accessing the memory controller?

If you did this, you'd need to handle the UE interrupts in firmware, and
wouldn't be able to use this driver in linux. Your platform hasn't gone this
way, so I guess one of the above cases applies.

Thanks,

James

Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

2019-02-05 Thread James Morse

Hi Boris,

On 23/01/2019 18:46, Borislav Petkov wrote:
> On Wed, Jan 23, 2019 at 06:36:23PM +0000, James Morse wrote:
>>> Would like to know what's the impact if this error happens, and how to fit 
>>> it
>>> with current reporting in EDAC core.
>>
>> At a guess the interrupt triggers when link_err_count increases. (link_err 
>> has
>> an overflow bit, so the interrupt must be related to a counter).
>>
>> If we could associate a link with a layer in edac, we could report errors
>> against that point. But I've no idea how 'links' correspond with 'ranks and 
>> banks'!


> Well, I have no clue what kind of links you guys are talking but if
> those are per-chance coherent links used by cores to communicate in a
> coherent fabric, or cores and devices, what would showing those errors
> to the user bring ya?

(I mentioned this because its the next interrupt in the register, its an example
of something that may be added for another platform in the future, which affects
the DT and probing)


> Or are ya talking about different kinds of links?

... whatever the manual means by 'link', good point, it could be the
interconnect side.

'alert_mode_next', in the feature control register talks about DIMM training,
and says 'dfi_err' is treated a a link error. DFI is defined earlier as the 'DDR
PHY interface', so these must be links between the DMC520 and DDR.


> In any case, the first question to ask would be, can some agent or the
> user do something with the information that X or Y link errors happened?
> 
> If not, then why bother?
> If yes, then that's a different story.

I agree. Surely if the DIMMs are socketed link-errors are another reason to
replace the DIMM.

It looks like this doesn't matter on Rui's platform,


Thanks,

James

Re: [PATCH] MAINTAINERS: Add James Morse to the list of APEI reviewers

2019-02-05 Thread James Morse

Hi guys,

On 05/02/2019 08:38, Borislav Petkov wrote:
> From: Borislav Petkov 
> 
> Add James to the list of reviewers of the firmware-assisted RAS glue.

Fine by me,

Acked-by: James Morse 


Thanks,

James

Re: [PATCH v5 3/5] arm64/kvm: context-switch ptrauth register

2019-01-31 Thread James Morse

Hi Amit,

On 28/01/2019 06:58, Amit Daniel Kachhap wrote:
> When pointer authentication is supported, a guest may wish to use it.
> This patch adds the necessary KVM infrastructure for this to work, with
> a semi-lazy context switch of the pointer auth state.
> 
> Pointer authentication feature is only enabled when VHE is built
> into the kernel and present into CPU implementation so only VHE code

~s/into/in the/?

> paths are modified.
> 
> When we schedule a vcpu, we disable guest usage of pointer
> authentication instructions and accesses to the keys. While these are
> disabled, we avoid context-switching the keys. When we trap the guest
> trying to use pointer authentication functionality, we change to eagerly
> context-switching the keys, and enable the feature. The next time the
> vcpu is scheduled out/in, we start again.

> Pointer authentication consists of address authentication and generic
> authentication, and CPUs in a system might have varied support for
> either. Where support for either feature is not uniform, it is hidden
> from guests via ID register emulation, as a result of the cpufeature
> framework in the host.


> Unfortunately, address authentication and generic authentication cannot
> be trapped separately, as the architecture provides a single EL2 trap
> covering both. If we wish to expose one without the other, we cannot
> prevent a (badly-written) guest from intermittently using a feature
> which is not uniformly supported (when scheduled on a physical CPU which
> supports the relevant feature). When the guest is scheduled on a
> physical CPU lacking the feature, these attempts will result in an UNDEF
> being taken by the guest.

This won't be fun. Can't KVM check that both are supported on all CPUs to avoid
this? ...


> diff --git a/arch/arm64/include/asm/cpufeature.h 
> b/arch/arm64/include/asm/cpufeature.h
> index dfcfba7..e1bf2a5 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -612,6 +612,11 @@ static inline bool system_supports_generic_auth(void)
>cpus_have_const_cap(ARM64_HAS_GENERIC_AUTH_IMP_DEF));
>  }
>  
> +static inline bool kvm_supports_ptrauth(void)
> +{
> + return system_supports_address_auth() && system_supports_generic_auth();
> +}

... oh you do check. Could you cover this in the commit message? (to avoid an
UNDEF being taken by the guest we ... )

cpufeature.h is a strange place to put this, there are no other kvm symbols in
there. But there are users of system_supports_foo() in kvm_host.h.


> diff --git a/arch/arm64/kvm/hyp/ptrauth-sr.c b/arch/arm64/kvm/hyp/ptrauth-sr.c
> new file mode 100644
> index 000..0576c01
> --- /dev/null
> +++ b/arch/arm64/kvm/hyp/ptrauth-sr.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * arch/arm64/kvm/hyp/ptrauth-sr.c: Guest/host ptrauth save/restore
> + *
> + * Copyright 2018 Arm Limited
> + * Author: Mark Rutland 
> + * Amit Daniel Kachhap 
> + */
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static __always_inline bool __hyp_text __ptrauth_is_enabled(struct kvm_vcpu 
> *vcpu)

Why __always_inline? Doesn't the compiler decide for 'static' symbols in C 
files?


> +{
> + return IS_ENABLED(CONFIG_ARM64_PTR_AUTH) &&
> + vcpu->arch.ctxt.hcr_el2 & (HCR_API | HCR_APK);
> +}
> +
> +void __no_ptrauth __hyp_text __ptrauth_switch_to_guest(struct kvm_vcpu *vcpu,
> +   struct kvm_cpu_context *host_ctxt,
> +   struct kvm_cpu_context *guest_ctxt)
> +{
> + if (!__ptrauth_is_enabled(vcpu))
> + return;
> +

> + ptrauth_keys_store((struct ptrauth_keys *) 
> _ctxt->sys_regs[APIAKEYLO_EL1]);

We can't cast part of an array to a structure like this. What happens if the
compiler inserts padding in struct-ptrauth_keys, or the struct randomization
thing gets hold of it: https://lwn.net/Articles/722293/

If we want to use the helpers that take a struct-ptrauth_keys, we need to keep
the keys in a struct-ptrauth_keys. To do this we'd need to provide accessors so
that GET_ONE_REG() of APIAKEYLO_EL1 comes from the struct-ptrauth_keys, instead
of the sys_reg array.


Wouldn't the host keys be available somewhere else? (they must get transfer to
secondary CPUs somehow). Can we skip the save step when switching from the host?


> + ptrauth_keys_switch((struct ptrauth_keys *) 
> _ctxt->sys_regs[APIAKEYLO_EL1]);
> +}

> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index 1f2d237..c798d0c 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -439,6 +451,18 @@ static inline bool kvm_arch_requires_vhe(void)
>   return false;
>  }
>
> +void kvm_arm_vcpu_ptrauth_enable(struct kvm_vcpu *vcpu);
> +void kvm_arm_vcpu_ptrauth_disable(struct kvm_vcpu *vcpu);
> +
> +static inline void

Re: [PATCH v5 4/5] arm64/kvm: add a userspace option to enable pointer authentication

2019-01-31 Thread James Morse

Hi Amit,

On 28/01/2019 06:58, Amit Daniel Kachhap wrote:
> This feature will allow the KVM guest to allow the handling of
> pointer authentication instructions or to treat them as undefined
> if not set. It uses the existing vcpu API KVM_ARM_VCPU_INIT to
> supply this parameter instead of creating a new API.
> 
> A new register is not created to pass this parameter via
> SET/GET_ONE_REG interface as just a flag (KVM_ARM_VCPU_PTRAUTH)
> supplied is enough to enable this feature.


> diff --git a/Documentation/arm64/pointer-authentication.txt
b/Documentation/arm64/pointer-authentication.txt
> index a25cd21..0529a7d 100644
> --- a/Documentation/arm64/pointer-authentication.txt
> +++ b/Documentation/arm64/pointer-authentication.txt
> @@ -82,7 +82,8 @@ pointers).
>  Virtualization
>  --
>
> -Pointer authentication is not currently supported in KVM guests. KVM
> -will mask the feature bits from ID_AA64ISAR1_EL1, and attempted use of
> -the feature will result in an UNDEFINED exception being injected into
> -the guest.
> +Pointer authentication is enabled in KVM guest when virtual machine is
> +created by passing a flag (KVM_ARM_VCPU_PTRAUTH)

Isn't that a VCPU flag? Shouldn't this be when each VCPU is created?


> requesting this feature
> +to be enabled. Without this flag, pointer authentication is not enabled
> +in KVM guests and attempted use of the feature will result in an UNDEFINED
> +exception being injected into the guest.

... what happens if KVM's user-space enables ptrauth on some vcpus, but not on
others?

You removed the id-register suppression in the previous patch, but it doesn't
get hooked up to kvm_arm_vcpu_ptrauth_allowed() here. (you could add
kvm_arm_vcpu_ptrauth_allowed() earlier, and default it to true to make it 
easier).

Doesn't this mean that if the CPU supports pointer auth, but user-space doesn't
specify this flag, the guest gets mysterious undef's whenever it tries to use
the advertised feature?

(whether we support big/little virtual-machines is probably a separate issue,
but the id registers need to be consistent with our trap-and-undef behaviour)


> diff --git a/arch/arm64/include/asm/kvm_host.h 
> b/arch/arm64/include/asm/kvm_host.h
> index c798d0c..4a6ec40 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -453,14 +453,15 @@ static inline bool kvm_arch_requires_vhe(void)
>  
>  void kvm_arm_vcpu_ptrauth_enable(struct kvm_vcpu *vcpu);
>  void kvm_arm_vcpu_ptrauth_disable(struct kvm_vcpu *vcpu);
> +bool kvm_arm_vcpu_ptrauth_allowed(struct kvm_vcpu *vcpu);
>  
>  static inline void kvm_arm_vcpu_ptrauth_reset(struct kvm_vcpu *vcpu)
>  {
>   /* Disable ptrauth and use it in a lazy context via traps */
> - if (has_vhe() && kvm_supports_ptrauth())
> + if (has_vhe() && kvm_supports_ptrauth()
> + && kvm_arm_vcpu_ptrauth_allowed(vcpu))
>   kvm_arm_vcpu_ptrauth_disable(vcpu);
>  }
> -
>  void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu);
>  

> diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> index 5b980e7..c0e5dcd 100644
> --- a/arch/arm64/kvm/handle_exit.c
> +++ b/arch/arm64/kvm/handle_exit.c
> @@ -179,7 +179,8 @@ static int handle_sve(struct kvm_vcpu *vcpu, struct 
> kvm_run *run)
>   */
>  void kvm_arm_vcpu_ptrauth_trap(struct kvm_vcpu *vcpu)
>  {
> - if (has_vhe() && kvm_supports_ptrauth())
> + if (has_vhe() && kvm_supports_ptrauth()
> + && kvm_arm_vcpu_ptrauth_allowed(vcpu))

Duplication. If has_vhe() moved into kvm_supports_ptrauth(), and
kvm_supports_ptrauth() was called from kvm_arm_vcpu_ptrauth_allowed() it would
be clearer that use of this feature was becoming user-controlled policy.

(We don't need to list the dependencies at every call site)


> diff --git a/arch/arm64/kvm/hyp/ptrauth-sr.c b/arch/arm64/kvm/hyp/ptrauth-sr.c
> index 0576c01..369624f 100644
> --- a/arch/arm64/kvm/hyp/ptrauth-sr.c
> +++ b/arch/arm64/kvm/hyp/ptrauth-sr.c
> @@ -42,3 +42,16 @@ void __no_ptrauth __hyp_text 
> __ptrauth_switch_to_host(struct kvm_vcpu *vcpu,
>   ptrauth_keys_store((struct ptrauth_keys *) 
> _ctxt->sys_regs[APIAKEYLO_EL1]);
>   ptrauth_keys_switch((struct ptrauth_keys *) 
> _ctxt->sys_regs[APIAKEYLO_EL1]);
>  }
> +
> +/**
> + * kvm_arm_vcpu_ptrauth_allowed - checks if ptrauth feature is present in 
> vcpu

('enabled by KVM's user-space' may be clearer. 'Present in vcpu' could be down
to a cpufeature thing)


> + *
> + * @vcpu: The VCPU pointer
> + *
> + * This function will be used to enable/disable ptrauth in guest as 
> configured

... but it just tests the bit ...

> + * by the KVM userspace API.
> + */
> +bool kvm_arm_vcpu_ptrauth_allowed(struct kvm_vcpu *vcpu)
> +{
> + return test_bit(KVM_ARM_VCPU_PTRAUTH, vcpu->arch.features);
> +}


Thanks,

James

Re: [PATCH v5 2/5] arm64/kvm: preserve host HCR_EL2/MDCR_EL2 value

2019-01-31 Thread James Morse

Hi Amit,

On 28/01/2019 06:58, Amit Daniel Kachhap wrote:
> When restoring HCR_EL2 for the host, KVM uses HCR_HOST_VHE_FLAGS, which
> is a constant value. This works today, as the host HCR_EL2 value is
> always the same, but this will get in the way of supporting extensions
> that require HCR_EL2 bits to be set conditionally for the host.
> 
> To allow such features to work without KVM having to explicitly handle
> every possible host feature combination, this patch has KVM save/restore
> the host HCR when switching to/from a guest HCR. The saving of the
> register is done once during cpu hypervisor initialization state and is
> just restored after switch from guest.
> 
> For fetching HCR_EL2 during kvm initialisation, a hyp call is made using
> kvm_call_hyp and is helpful in NHVE case.

> For the hyp TLB maintenance code, __tlb_switch_to_host_vhe() is updated
> to toggle the TGE bit with a RMW sequence, as we already do in
> __tlb_switch_to_guest_vhe().


> While at it, host MDCR_EL2 value is fetched in a similar way and restored
> after every switch from host to guest. There should not be any change in
> functionality due to this.

Could this step be done as a separate subsequent patch? It would make review
easier! The MDCR stuff would be a simplification if done second, done in one go
like this its pretty noisy.

There ought to be some justification for moving hcr/mdcr into the cpu_context in
the commit message.


If you're keeping Mark's 'Signed-off-by' its would be normal to keep Mark as the
author in git. This shows up a an extra 'From:' when you post the patch, and
gets picked up when the maintainer runs git-am.

This patch has changed substantially from Mark's version:
https://lkml.org/lkml/2017/11/27/675

If you keep the signed-off-by, could you add a [note] in the signed-off area
with a terse summary. Something like:
> Signed-off-by: Mark Rutland 
[ Move hcr to cpu_context, added __cpu_copy_hyp_conf()]
> Signed-off-by: Amit Daniel Kachhap 

(9c06602b1b92 is a good picked-at-random example for both of these)


> diff --git a/arch/arm64/include/asm/kvm_asm.h 
> b/arch/arm64/include/asm/kvm_asm.h
> index f5b79e9..2da6e43 100644
> --- a/arch/arm64/include/asm/kvm_asm.h
> +++ b/arch/arm64/include/asm/kvm_asm.h
> @@ -80,6 +80,8 @@ extern void __vgic_v3_init_lrs(void);
>  
>  extern u32 __kvm_get_mdcr_el2(void);
>  
> +extern u64 __kvm_get_hcr_el2(void);

Do we need these in separate helpers? For non-vhe this means two separate trips
to EL2. Something like kvm_populate_host_context(void), and an __ version for
the bit at EL2?

We don't need to pass the host-context to EL2 as once kvm is loaded we can
access host per-cpu variables at EL2 using __hyp_this_cpu_read(). This will save
passing the vcpu around.


> @@ -458,6 +457,25 @@ int kvm_arm_vcpu_arch_has_attr(struct kvm_vcpu *vcpu,
>  
>  static inline void __cpu_init_stage2(void) {}
>  
> +/**
> + * __cpu_copy_hyp_conf - copy the boot hyp configuration registers
> + *
> + * It is called once per-cpu during CPU hyp initialisation.
> + */
> +static inline void __cpu_copy_hyp_conf(void)
> +{
> + kvm_cpu_context_t *host_cxt = this_cpu_ptr(_host_cpu_state);
> +
> + host_cxt->hcr_el2 = kvm_call_hyp(__kvm_get_hcr_el2);
> +
> + /*
> +  * Retrieve the initial value of mdcr_el2 so we can preserve
> +  * MDCR_EL2.HPMN which has presumably been set-up by some
> +  * knowledgeable bootcode.
> +  */
> + host_cxt->mdcr_el2 = kvm_call_hyp(__kvm_get_mdcr_el2);
> +}

Its strange to make this an inline in a header. kvm_arm_init_debug() is a
static-inline for arch/arm, but a regular C function for arch/arm64. Can't we do
the same?


> diff --git a/arch/arm64/kvm/hyp/sysreg-sr.c b/arch/arm64/kvm/hyp/sysreg-sr.c
> index 68d6f7c..22c854a 100644
> --- a/arch/arm64/kvm/hyp/sysreg-sr.c
> +++ b/arch/arm64/kvm/hyp/sysreg-sr.c
> @@ -316,3 +316,14 @@ void __hyp_text __kvm_enable_ssbs(void)
>   "msrsctlr_el2, %0"
>   : "=" (tmp) : "L" (SCTLR_ELx_DSSBS));
>  }
> +
> +/**
> + * __read_hyp_hcr_el2 - Returns hcr_el2 register value
> + *
> + * This function acts as a function handler parameter for kvm_call_hyp and
> + * may be called from EL1 exception level to fetch the register value.
> + */
> +u64 __hyp_text __kvm_get_hcr_el2(void)
> +{
> + return read_sysreg(hcr_el2);
> +}

While I'm all in favour of kernel-doc comments for functions, it may be
over-kill in this case!


> diff --git a/virt/kvm/arm/arm.c b/virt/kvm/arm/arm.c
> index 9e350fd3..2d65ada 100644
> --- a/virt/kvm/arm/arm.c
> +++ b/virt/kvm/arm/arm.c
> @@ -1327,10 +1327,10 @@ static void cpu_hyp_reinit(void)
>   else
>   cpu_init_hyp_mode(NULL);
>  
> - kvm_arm_init_debug();
> -
>   if (vgic_present)
>   kvm_vgic_init_cpu_hardware();
> +
> + __cpu_copy_hyp_conf();
>  }

Was there a reason to make this call later than it originally was?
(kvm_vgic_init_cpu_hardware() doesn't use any of those values, so its fine, just
curious!)


Thanks,

Re: [PATCH v5 1/5] arm64: Add utilities to save restore pointer authentication keys

2019-01-31 Thread James Morse

Hi Amit,

On 28/01/2019 06:58, Amit Daniel Kachhap wrote:
> The keys can be switched either inside an assembly or such
> functions which do not have pointer authentication checks, so a GCC
> attribute is added to enable it.
> 
> A function ptrauth_keys_store is added which is similar to existing
> function ptrauth_keys_switch but saves the key values in memory.
> This may be useful for save/restore scenarios when CPU changes
> privilege levels, suspend/resume etc.


> diff --git a/arch/arm64/include/asm/pointer_auth.h 
> b/arch/arm64/include/asm/pointer_auth.h
> index 15d4951..98441ce 100644
> --- a/arch/arm64/include/asm/pointer_auth.h
> +++ b/arch/arm64/include/asm/pointer_auth.h
> @@ -11,6 +11,13 @@
>  
>  #ifdef CONFIG_ARM64_PTR_AUTH
>  /*
> + * Compile the function without pointer authentication instructions. This
> + * allows pointer authentication to be enabled/disabled within the function
> + * (but leaves the function unprotected by pointer authentication).
> + */
> +#define __no_ptrauth __attribute__((target("sign-return-address=none")))

The documentation[0] for this says 'none' is the default. Will this only
take-effect once the kernel supports pointer-auth for the host? (Is this just
documentation until then?)

('noptrauth' would fit with 'notrace' slightly better)


Thanks,

James

[0]
https://gcc.gnu.org/onlinedocs/gcc/AArch64-Function-Attributes.html#AArch64-Function-Attributes

Re: [PATCH v4 3/6] arm64/kvm: add a userspace option to enable pointer authentication

2019-01-31 Thread James Morse

Hi Amit,

On 09/01/2019 10:13, Amit Daniel Kachhap wrote:
> On Sat, Jan 5, 2019 at 12:05 AM James Morse  wrote:
>> On 18/12/2018 07:56, Amit Daniel Kachhap wrote:
>>> This feature will allow the KVM guest to allow the handling of
>>> pointer authentication instructions or to treat them as undefined
>>> if not set. It uses the existing vcpu API KVM_ARM_VCPU_INIT to
>>> supply this parameter instead of creating a new API.
>>>
>>> A new register is not created to pass this parameter via
>>> SET/GET_ONE_REG interface as just a flag (KVM_ARM_VCPU_PTRAUTH)
>>> supplied is enough to select this feature.
>>
>> What is the motivation for doing this? Doesn't ptrauth 'just' need turning 
>> on?
>> It doesn't need additional setup to be useable, or rely on some qemu support 
>> to
>> work properly. There isn't any hidden state that can't be migrated in the 
>> usual way.
>> Is it just because we don't want to commit to the ABI?

> This allows migration of guest to non pointer authenticated supported
> systems and hides the extra ptrauth registers.

The MIDR already has to match, so the hardware must be the same. I guess this
lets us hide the new feature so old-guests can migrate to a new-kernel without a
write to the id registers failing.


> Basically this suggestion was given by Christoffer
> (https://lore.kernel.org/lkml/20180206123847.GY21802@cbox/).

Ah, Christoffer asked for it, that's reason enough!


Thanks,

James

Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-30 Thread James Morse


Hi!

On 01/29/2019 12:29 PM, Zhang, Lei wrote:

On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort, DFSC=0b11).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition
when a load/store instruction perform an address translation using:
   case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
   case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
   case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
   case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.

Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
past 4.17, the case-3 or case-4 may happen.


e03e61c3173c ("arm64: kaslr: Set TCR_EL1.NFD1 when 
CONFIG_RANDOMIZE_BASE=y") ?


So you'd never see it if you disabled CONFIG_RANDOMIZE_BASE?


Thanks,

James

Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-30 Thread James Morse


Hi guys,

On 01/29/2019 06:10 PM, Catalin Marinas wrote:

Could you please copy the whole description from the cover letter to the
actual patch and only send one email (full description as in here
together with the patch)? If we commit this to the kernel, it would be
useful to have the information in the log for reference later on.

More comments below:

On Tue, Jan 29, 2019 at 12:29:58PM +, Zhang, Lei wrote:

On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
memory accesses may cause undefined fault (Data abort, DFSC=0b11).
This problem will be fixed by next version of Fujitsu-A64FX.

This fault occurs under a specific hardware condition
when a load/store instruction perform an address translation using:
   case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
   case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
   case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
   case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
And this fault occurs completely spurious.


So this looks like new information on the hardware behaviour since the
v2 of the patch. Can this fault occur for any type of instruction
accessing the memory or only for SVE instructions?


Since TCR_ELx.NFD1 is set to '1' at the kernel in versions
past 4.17, the case-3 or case-4 may happen.

This fault can be taken only at stage-1,
so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1,
or from EL2 to EL2.

I would like to post a workaround to avoid this problem on
existing Fujitsu-A64FX version.


How likely is it to trigger this erratum? In other words, aren't we
better off with a spurious fault that we ignore rather than toggling the
TCR_ELx.NFD1 bit?


It sounds like the spurious fault can occur as a result of load/store. 
('there is no load/store instruction between'...).


If this can happen in kernel_enter it will overwrite the exception 
registers, and we lose the original ELR.


If load/store trigger it, I don't think we can ignore it.

Thanks,

James

Re: [PATCH v9 01/26] arm64: Fix HCR.TGE status for NMI contexts

2019-01-28 Thread James Morse

Hi Julien,

On 21/01/2019 15:33, Julien Thierry wrote:
> When using VHE, the host needs to clear HCR_EL2.TGE bit in order
> to interract with guest TLBs, switching from EL2&0 translation regime

(interact)


> to EL1&0.
> 
> However, some non-maskable asynchronous event could happen while TGE is
> cleared like SDEI. Because of this address translation operations
> relying on EL2&0 translation regime could fail (tlb invalidation,
> userspace access, ...).
> 
> Fix this by properly setting HCR_EL2.TGE when entering NMI context and
> clear it if necessary when returning to the interrupted context.

Yes please. This would not have been fun to debug!

Reviewed-by: James Morse 



I was looking for why we need core code to do this, instead of updating the
arch's call sites. Your 'irqdesc: Add domain handlers for NMIs' patch (pointed
to from the cover letter) is the reason: core-code calls nmi_enter()/nmi_exit()
itself.


Thanks,

James


> diff --git a/arch/arm64/include/asm/hardirq.h 
> b/arch/arm64/include/asm/hardirq.h
> index 1473fc2..94b7481 100644
> --- a/arch/arm64/include/asm/hardirq.h
> +++ b/arch/arm64/include/asm/hardirq.h
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

percpu.h?
sysreg.h?
barrier.h?


> @@ -37,6 +38,33 @@
>  
>  #define __ARCH_IRQ_EXIT_IRQS_DISABLED1
>  
> +struct nmi_ctx {
> + u64 hcr;
> +};
> +
> +DECLARE_PER_CPU(struct nmi_ctx, nmi_contexts);
> +
> +#define arch_nmi_enter() 
> \
> + do {
> \
> + if (is_kernel_in_hyp_mode()) {  
> \
> + struct nmi_ctx *nmi_ctx = this_cpu_ptr(_contexts);  
> \
> + nmi_ctx->hcr = read_sysreg(hcr_el2);
> \
> + if (!(nmi_ctx->hcr & HCR_TGE)) {
> \
> + write_sysreg(nmi_ctx->hcr | HCR_TGE, hcr_el2);  
> \
> + isb();  
> \
> + }   
> \
> + }   
> \
> + } while (0)
> +
> +#define arch_nmi_exit()  
> \
> + do {
> \
> + if (is_kernel_in_hyp_mode()) {  
> \
> + struct nmi_ctx *nmi_ctx = this_cpu_ptr(_contexts);  
> \
> + if (!(nmi_ctx->hcr & HCR_TGE))  
> \
> + write_sysreg(nmi_ctx->hcr, hcr_el2);
> \
> + }   
> \
> + } while (0)
> +
>  static inline void ack_bad_irq(unsigned int irq)
>  {
>   extern unsigned long irq_err_count;



> diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> index 0fbbcdf..da0af63 100644
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -60,8 +60,14 @@ static inline void rcu_nmi_exit(void)
>   */
>  extern void irq_exit(void);
>  
> +#ifndef arch_nmi_enter
> +#define arch_nmi_enter() do { } while (0)
> +#define arch_nmi_exit()  do { } while (0)
> +#endif
> +
>  #define nmi_enter()  \
>   do {\
> + arch_nmi_enter();   \
>   printk_nmi_enter(); \
>   lockdep_off();  \
>   ftrace_nmi_enter(); \
> @@ -80,6 +86,7 @@ static inline void rcu_nmi_exit(void)
>   ftrace_nmi_exit();  \
>   lockdep_on();   \
>   printk_nmi_exit();  \
> + arch_nmi_exit();\
>   } while (0)
>  
>  #endif /* LINUX_HARDIRQ_H */
>

Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

2019-01-23 Thread James Morse

Hi Rui,

On 23/01/2019 00:42, Rui Zhao wrote:
> On Monday, January 21, 2019 9:09 AM, James Morse wrote:
>> It would be good if we can make this generic, so it works on all platforms 
>> with
>> a DMC520, possibly along with other components. (e.g. the a15 L2 driver 
>> posted
>> recently).
>>
>> This will mostly be getting the DT right, as we can refactor the code when a
>> second user comes along, but can't change the DT format.

> Agreed. It'd be good if we can move all platform specific settings to DT such
> that a second user can update the code and won't break the DT for original
> device. I'll think about the DT format to make it more generic.

When the time comes, could you post a dt-binding as the first patch? These add
the documentation under Documentation/device-tree/bindings, and need to be CC'd
to the DT folks.

>> The TRM describes 'a set of interrupts', which ones does the binding 
>> anticipate
>> are wired up for this? There are separate status bits for corrected and
>> uncorrected, and one pair for dram versus ram. (not sure what this 
>> corresponds
>> with).

>> Do we care about the link-error interrupt?

> Will add ram interrupts handling, and make it configurable in DT.

I don't think the code needs to change, we just need to make it clear these are
the dram interrupts, and they are different numbers.
This means someone can add the ram interrupts later, and not have to
disambiguate them to keep your platform going.

(its this stuff that the binding document describes)

> Could you please
> share a bit more info on link-error interrupt? TRM doesn't have detailed info 
> about
> it.

I only have the TRM too! The section titled 'RAS' on page 2-23 talks about 'link
error protection', so it might be relevant so someone. We don't care about this
now, but if someone does in the future, we will need to have a way of adding it
to the DT.

> Would like to know what's the impact if this error happens, and how to fit it
> with current reporting in EDAC core.

At a guess the interrupt triggers when link_err_count increases. (link_err has
an overflow bit, so the interrupt must be related to a counter).

If we could associate a link with a layer in edac, we could report errors
against that point. But I've no idea how 'links' correspond with 'ranks and 
banks'!

>>> + layers[1].type = EDAC_MC_LAYER_CHANNEL;
>>> + layers[1].size = DMC520_EDAC_CHANS;
>>> + layers[1].is_virt_csrow = false;

>> If you can read the rank count, why hard code the bank?
> 
>> (which is what I assume channels corresponds to ... although this has 
>> confused
>> me before [0]).

> I misunderstood what channel meant. For bank, we can read config from the 
> register.

Its an assumption. Channel seems to be an overloaded term with different 
meanings.
Fortunately edac drivers get to pick what their layers mean. It looks like all
this controls is the names that come out of sysfs.

rank/bank are part of the address decoding, as we can read both sizes I think it
makes sense to use both as it should further localise the error.

>>> + mci->scrub_cap = SCRUB_HW_SRC;
>>> + mci->scrub_mode = SCRUB_NONE;

>> Is this saying the device supports error scrubbing, but its disabled?
>> Do we know that?

>> Can the user try and turn it on? (I can't find anything that reads this!)

> We don’t want to change what’s already configured.

I don't think we could if we wanted to. This thing has 'READY' and 'CONFIG'
modes, the scrubbing can only be written to in 'CONFIG' mode. I'm willing to bet
we need it to be in 'READY' to be executing the kernel from dram.

> Will change scrub_mode to HW.

Do we know its enabled? This is something firmware has to set up, someone else's
platform may do it differently.

I think should read one of the scrub control registers to find out if its 
turned on.

But, I can't find what uses this value ...

>>> + /* Enable interrupts */
>>> + dmc520_write_reg(edac,
>>> +  DRAM_ECC_INT_CE_MASK | DRAM_ECC_INT_UE_MASK,
>>> +  REG_OFFSET_INTERRUPT_CONTROL);

>> What if they were enabled before? (e.g. enabled by firmware, the bootloader 
>> or
>> kdump). If they're already enabled, can we race with the interrupt handler on
>> another CPU and get double reporting of the error counter?

> We don't expect interrupts to be enabled by firmware or bootloader if this 
> driver
> is enabled.

What about a previous instance of this driver? Linux supports kexec, kdump and
hibernate, all of which cause us to inherit a slightly used platform.

> If firmware enables it, they're suppose to handle the interrupt.

Ah, so you still have resident firmware!
How come your firmware trusts linux not to turn off the memory controller?!
These things are usually protected by trust zone so the OS can't pull the memory
from under firmware's feet.

Thanks,

James

Re: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

2019-01-22 Thread James Morse

Hello,

On 22/01/2019 02:05, Zhang, Lei wrote:
> Mark Rutland wrote:
>> * How often does this fault occur?
> In my test, this fault occurs once every several times
> in the OS boot sequence, and after the completion of OS boot,
> this fault have never occurred.
> In my opinion, this fault rarely occurs
> after the completion of OS boot.

Can you share anything about why this is? You mention a hardware-condition
that is reset at exception entry

>> I'm a bit surprised by the single retry. Is there any guarantee that a
>> thread will eventually stop delivering this fault code?

> I guarantee that a thread will stop delivering this 
> fault code by the this patch.
> The hardware condition which cause this fault is 
> reset at exception entry, therefore execution of at 
> least one instruction is guaranteed by this single retry.

... so its possible to take this fault during kernel_entry when we've taken an 
irq?

This will overwrite the ELR and SPSR, (and possibly the FAR and ESR), meaning 
we've
 lost that information and can't return to the point in the kernel that took the
irq.

If we try, we might end up spinning through the irq handler, as the ELR might
now point
to el1_irq's kernel_entry.

We can spot we took an exception from the entry text ... but all we can do then
is panic().
I'm not sure its worth working around this if its just a matter of time before 
this
happens. (you mention its less likely after boot, it would be good to know 
why...)

Thanks,

James

Re: [PATCH] EDAC: Add James Morse as a reviewer

2019-01-21 Thread James Morse

Hi Boris,

On 21/01/2019 12:39, Borislav Petkov wrote:
> From: Borislav Petkov 
> 
> With the growing amount of ARM[,64] EDAC enablement happening, add James
> as a reviewer for the ARM bits.

Sure,


> Signed-off-by: Borislav Petkov 

Acked-by: James Morse 


Thanks,

James

Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

2019-01-21 Thread James Morse

Hi Sasha, Rui,

On 18/01/2019 16:23, Sasha Levin wrote:
> From: Rui Zhao 
> New driver supports DRAM error detection and correction on DMC520
> controller.

> Validated on actual hardware: DRAM errors showed up once the DDR core
> voltage was lowered down by 200+mV using test tool.

That's quite cool!

> ---
>  MAINTAINERS|   6 +
>  drivers/edac/Kconfig   |   7 +
>  drivers/edac/Makefile  |   1 +
>  drivers/edac/dmc520_edac.c | 495 +
>  4 files changed, 509 insertions(+)
>  create mode 100644 drivers/edac/dmc520_edac.c

Where do I find the dt-binding for this?

It would be good if we can make this generic, so it works on all platforms with
a DMC520, possibly along with other components. (e.g. the a15 L2 driver posted
recently).

This will mostly be getting the DT right, as we can refactor the code when a
second user comes along, but can't change the DT format.

The TRM describes 'a set of interrupts', which ones does the binding anticipate
are wired up for this? There are separate status bits for corrected and
uncorrected, and one pair for dram versus ram. (not sure what this corresponds
with).

Do we care about the link-error interrupt?

It looks you're platform has wired corrected and uncorrected dram up as separate
SPI, but not touched the ram interrupts. These are choices the soc designers
made, we need to capture this stuff in the binding.

A system may have multiple memory controllers, they may share the interrupts.

(It looks like you've folded these corners out by not using IRQF_SHARED: this
driver only works for independent interrupts, which can run concurrently, but
work fine because the only register they both touch is interrupt_clr, which is
write-only.)

For these pre-v8.2-RAS things the expectation is firmware handles all this
stuff. I'm surprised your platform hasn't made the memory-controller
'secure-only', so only platform-firmware can touch it.

I can't see how this could be used on v8/aarch64, it would imply there is no
firmware, which suggests this is a UP system. (or firmware trusts linux not to
mess it up!)

I'm guessing this is for 32bit, where on your platform linux is running in
'secure'. This is a significant platform policy, to make this generic we'd need
to find a way of describing it. i.e., other platforms may have a dmc520, the DT
may describe where it is, but linux can't touch it.
I believe that today this depends on the bootloader knowing which nodes to
remove when starting linux.

I think this policy-bit can be done by having a soc-family/vendor specific
driver that knows which of the edac components can be used on this platform.
The altera driver does something along these lines in altr_sdram_probe().
Obviously if we can have all the data come from DT that is better.

> diff --git a/drivers/edac/dmc520_edac.c b/drivers/edac/dmc520_edac.c
> new file mode 100644
> index ..5f14889074af
> --- /dev/null
> +++ b/drivers/edac/dmc520_edac.c
> @@ -0,0 +1,495 @@

> +/* Driver settings */
> +#define DMC520_EDAC_CHANS1
> +#define DMC520_EDAC_ERR_GRAIN1
> +#define DMC520_EDAC_INT_COUNT2
> +#define DMC520_EDAC_BUS_WIDTH8

Should these be in the DT?
If someone else has a dmc520 configured slightly differently, how can we get
both systems going, without having to change your system's DT?

> +static bool dmc520_is_ecc_enabled(struct dmc520_edac *edac)
> +{
> + u32 reg_val = dmc520_read_reg(edac, REG_OFFSET_FEATURE_CONFIG);
> +
> + return (FIELD_GET(REG_FIELD_DRAM_ECC_ENABLED, reg_val) != 0);
> +}

Ah, so there is firmware that sets this up and enables it ...

> +static int dmc520_edac_probe(struct platform_device *pdev)
> +{
> + struct dmc520_edac *edac;
> + struct mem_ctl_info *mci;
> + struct edac_mc_layer layers[2];
> + int ret, irq;
> + struct resource *res;
> + void __iomem *reg_base;
> +
> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> + reg_base = devm_ioremap_resource(>dev, res);
> + if (IS_ERR(reg_base))
> + return PTR_ERR(reg_base);
> +
> + layers[0].type = EDAC_MC_LAYER_CHIP_SELECT;
> + layers[0].size = dmc520_get_rank_count(reg_base);
> + layers[0].is_virt_csrow = true;
> +
> + layers[1].type = EDAC_MC_LAYER_CHANNEL;
> + layers[1].size = DMC520_EDAC_CHANS;
> + layers[1].is_virt_csrow = false;

If you can read the rank count, why hard code the bank?
(which is what I assume channels corresponds to ... although this has confused
me before [0]).

> + platform_set_drvdata(pdev, mci);
> +
> + mci->pdev = >dev;
> + mci->mtype_cap = MEM_FLAG_DDR3 | MEM_FLAG_DDR4;
> + mci->edac_ctl_cap = EDAC_FLAG_NONE | EDAC_FLAG_SECDED;

> + mci->scrub_cap = SCRUB_HW_SRC;
> + mci->scrub_mode = SCRUB_NONE;

Is this saying the device supports error scrubbing, but its disabled?
Do we know that?

Can the user try and turn

< 1 2 3 4 5 6 7 8 9 10 >

201 - 300 of 1082 matches

Mail list logo