date:20170914

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-09-14 Thread Greg Kurz

On Fri, 15 Sep 2017 07:52:49 +0200
Greg Kurz  wrote:

> Dang! The mail relay at OVH has blacklisted Paul's address :-\
> 
> : host smtp.samba.org[144.76.82.148] said: 550-blacklisted 
> at
> zen.spamhaus.org 550 https://www.spamhaus.org/sbl/query/SBL370982 (in 
> reply
> to RCPT TO command)
> 

Dumb me! It's the opposite... OVH is blacklisted by smtp.samba.org :-\

Sigh.


pgpbGVYSAUr7G.pgp
Description: OpenPGP digital signature

[PATCH v2] net/ethernet/freescale: fix warning for ucc_geth

2017-09-14 Thread Valentin Longchamp

uf_info.regs is resource_size_t i.e. phys_addr_t that can be either u32
or u64 according to CONFIG_PHYS_ADDR_T_64BIT.

The printk format is thus adaptet to u64 and the regs value cast to u64
to take both u32 and u64 into account.

Signed-off-by: Valentin Longchamp 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index f77ba9fa257b..a96b838cffce 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3857,8 +3857,9 @@ static int ucc_geth_probe(struct platform_device* ofdev)
}
 
if (netif_msg_probe(&debug))
-   pr_info("UCC%1d at 0x%8x (irq = %d)\n",
-   ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs,
+   pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
+   ug_info->uf_info.ucc_num + 1,
+   (u64)ug_info->uf_info.regs,
ug_info->uf_info.irq);
 
/* Create an ethernet device instance */
-- 
2.13.5

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-09-14 Thread Greg Kurz

Dang! The mail relay at OVH has blacklisted Paul's address :-\

: host smtp.samba.org[144.76.82.148] said: 550-blacklisted at
zen.spamhaus.org 550 https://www.spamhaus.org/sbl/query/SBL370982 (in reply
to RCPT TO command)

Cc'ing Paul at ozlabs.org

On Fri, 15 Sep 2017 10:48:39 +1000
David Gibson  wrote:

> On Thu, Sep 14, 2017 at 11:56:25PM +0200, Greg Kurz wrote:
> > The following program causes a kernel oops:
> > 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > 
> > main()
> > {
> > int fd = open("/dev/kvm", O_RDWR);
> > ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
> > }
> > 
> > This happens because when using the global KVM fd with
> > KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
> > called with a NULL kvm argument, which gets dereferenced
> > in is_kvmppc_hv_enabled(). Spotted while reading the code.
> > 
> > Let's use the hv_enabled fallback variable, like everywhere
> > else in this function.
> > 
> > Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
> > Cc: sta...@vger.kernel.org # v4.7+
> > Signed-off-by: Greg Kurz   
> 
> I don't think this is right.  I'm pretty sure you want to fall back to
> hv_enabled *only when* kvm is NULL.  Otherwise if you have a PR guest
> on an HV capable machine, this will give the wrong answer, when called
> for that specific VM.
> 

Hmmm... this is what we get with this patch applied:

open("/dev/kvm", O_RDWR)= 3
ioctl(3, KVM_CHECK_EXTENSION, 0x84) = 1 <== if HV is present
ioctl(3, KVM_CREATE_VM, 0x1)= 4 <== HV
ioctl(4, KVM_CHECK_EXTENSION, 0x84) = 1
ioctl(3, KVM_CREATE_VM, 0x2)= 5 <== PR
ioctl(5, KVM_CHECK_EXTENSION, 0x84) = 0

The hv_enabled variable is set as follows:

/* Assume we're using HV mode when the HV module is loaded */
int hv_enabled = kvmppc_hv_ops ? 1 : 0;

if (kvm) {
/*
 * Hooray - we know which VM type we're running on. Depend on
 * that rather than the guess above.
 */
hv_enabled = is_kvmppc_hv_enabled(kvm);
}

so we're good. :)

The last sentence in the commit message is maybe^wprobably not comprehensive
enough...

What about the following ?

The hv_enabled variable is initialized to 1 if HV is loaded or 0 otherwise.
In the case KVM_CHECK_EXTENSION is used with a VM fd, hv_enabled is updated
to is_kvmppc_hv_enabled(kvm). Let's use it here, like everywhere else in this
function.

> > ---
> >  arch/powerpc/kvm/powerpc.c |3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 3480faaf1ef8..ee279c7f4802 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> > ext)
> > break;
> >  #endif
> > case KVM_CAP_PPC_HTM:
> > -   r = cpu_has_feature(CPU_FTR_TM_COMP) &&
> > -   is_kvmppc_hv_enabled(kvm);
> > +   r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled;
> > break;
> > default:
> > r = 0;
> >   
> 



pgpCqjd5b_a7V.pgp
Description: OpenPGP digital signature

[PATCH] KVM: PPC: Book3S HV: Check for updated HDSISR on P9 HDSI exception

2017-09-14 Thread Michael Neuling

On POWER9 DD2.1 and below, sometimes on a Hypervisor Data Storage
Interrupt (HDSI) the HDSISR is not be updated at all.

To work around this we put a canary value into the HDSISR before
returning to a guest and then check for this canary when we take a
HDSI. If we find the canary on a HDSI, we know the hardware didn't
update the HDSISR. In this case we return to the guest to retake the
HDSI which should correctly update the HDSISR the second time HDSI
entry.

After talking to Paulus we've applied this workaround to all POWER9
CPUs. The workaround of returning to the guest shouldn't ever be
triggered on well behaving CPU. The extra instructions should have
negligible performance impact.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/kvm/book3s_hv_rmhandlers.S | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 663a4a861e..70dca60569 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -1118,6 +1118,13 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 BEGIN_FTR_SECTION
mtspr   SPRN_PPR, r0
 END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
+
+/* Move canary into DSISR to check for later */
+BEGIN_FTR_SECTION
+   li  r0, 0x7fff
+   mtspr   SPRN_HDSISR, r0
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+
ld  r0, VCPU_GPR(R0)(r4)
ld  r4, VCPU_GPR(R4)(r4)
 
@@ -1947,9 +1954,14 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)
 kvmppc_hdsi:
ld  r3, VCPU_KVM(r9)
lbz r0, KVM_RADIX(r3)
-   cmpwi   r0, 0
mfspr   r4, SPRN_HDAR
mfspr   r6, SPRN_HDSISR
+BEGIN_FTR_SECTION
+   /* Look for DSISR canary. If we find it, retry instruction */
+   cmpdi   r6, 0x7fff
+   beq 6f
+END_FTR_SECTION_IFSET(CPU_FTR_ARCH_300)
+   cmpwi   r0, 0
bne .Lradix_hdsi/* on radix, just save DAR/DSISR/ASDR */
/* HPTE not found fault or protection fault? */
andis.  r0, r6, (DSISR_NOHPTE | DSISR_PROTFAULT)@h
-- 
2.11.0

[PATCH 1/2] powerpc: Add workaround for P9 vector CI load issue

2017-09-14 Thread Michael Neuling

POWER9 DD2.1 and earlier has an issue where some cache inhibited
vector load will return bad data. The workaround is two part, one
firmware/microcode part triggers HMI interrupts when hitting such
loads, the other part is this patch which then emulates the
instructions in Linux.

The affected instructions are limited to lxvd2x, lxvw4x, lxvb16x and
lxvh8x.

When an instruction triggers the HMI, all threads in the core will be
sent to the HMI handler, not just the one running the vector load.

In general, these spurious HMIs are detected by the emulation code and
we just return back to the running process. Unfortunately, if a
spurious interrupt occurs on a vector load that's to normal memory we
have no way to detect that it's spurious (unless we walk the page
tables, which is very expensive). In this case we emulate the load but
we need do so using a vector load itself to ensure 128bit atomicity is
preserved.

Some additional debugfs emulated instruction counters are added also.

Signed-off-by: Michael Neuling 
Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/include/asm/emulated_ops.h |   4 +
 arch/powerpc/include/asm/paca.h |   1 +
 arch/powerpc/include/asm/uaccess.h  |  17 +++
 arch/powerpc/kernel/exceptions-64s.S|  16 ++-
 arch/powerpc/kernel/mce.c   |  30 -
 arch/powerpc/kernel/traps.c | 201 
 arch/powerpc/platforms/powernv/smp.c|   7 ++
 7 files changed, 271 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/emulated_ops.h 
b/arch/powerpc/include/asm/emulated_ops.h
index f00e10e2a3..651e135449 100644
--- a/arch/powerpc/include/asm/emulated_ops.h
+++ b/arch/powerpc/include/asm/emulated_ops.h
@@ -55,6 +55,10 @@ extern struct ppc_emulated {
struct ppc_emulated_entry mfdscr;
struct ppc_emulated_entry mtdscr;
struct ppc_emulated_entry lq_stq;
+   struct ppc_emulated_entry lxvw4x;
+   struct ppc_emulated_entry lxvh8x;
+   struct ppc_emulated_entry lxvd2x;
+   struct ppc_emulated_entry lxvb16x;
 #endif
 } ppc_emulated;
 
diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 04b60af027..1e06310ccc 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -210,6 +210,7 @@ struct paca_struct {
 */
u16 in_mce;
u8 hmi_event_available; /* HMI event is available */
+   u8 hmi_p9_special_emu;  /* HMI P9 special emulation */
 #endif
 
/* Stuff for accurate time accounting */
diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 9c0e60ca16..e34f15e727 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -173,6 +173,23 @@ do {   
\
 
 extern long __get_user_bad(void);
 
+/*
+ * This does an atomic 128 byte aligned load from userspace.
+ * Upto caller to do enable_kernel_vmx() before calling!
+ */
+#define __get_user_atomic_128_aligned(kaddr, uaddr, err)   \
+   __asm__ __volatile__(   \
+   "1: lvx  0,0,%1 # get user\n"   \
+   "   stvx 0,0,%2 # put kernel\n" \
+   "2:\n"  \
+   ".section .fixup,\"ax\"\n"  \
+   "3: li %0,%3\n" \
+   "   b 2b\n" \
+   ".previous\n"   \
+   EX_TABLE(1b, 3b)\
+   : "=r" (err)\
+   : "b" (uaddr), "b" (kaddr), "i" (-EFAULT), "0" (err))
+
 #define __get_user_asm(x, addr, err, op)   \
__asm__ __volatile__(   \
"1: "op" %1,0(%2)   # get_user\n"   \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 48da0f5d2f..8c32dd4cac 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -1010,6 +1010,8 @@ TRAMP_REAL_BEGIN(hmi_exception_early)
EXCEPTION_PROLOG_COMMON_3(0xe60)
addir3,r1,STACK_FRAME_OVERHEAD
BRANCH_LINK_TO_FAR(hmi_exception_realmode) /* Function call ABI */
+   cmpdi   cr0,r3,0
+
/* Windup the stack. */
/* Move original HSRR0 and HSRR1 into the respective regs */
ld  r9,_MSR(r1)
@@ -1026,10 +1028,15 @@ TRAMP_REAL_BEGIN(hmi_exception_early)
REST_8GPRS(2, r1)
REST_GPR(10, r1)
ld  r11,_CCR(r1)
+   REST_2GPRS(12, r1)
+   bne 1f
mtcrr11
REST_GPR(11, r1)
-   REST_2GPRS(12, r1)
-   /* restore original r1. */
+   ld  r1,GPR1(r1)
+   hrfid
+
+1: mtcrr11
+   REST_GPR(11, r1)
ld  r1,GPR1(r1)
 
/*
@@ -1042,8 +1049,9 @@ hmi_exception_after_realmode:
EXCEPTION_P

[PATCH 2/2] powerpc: Handle MCE on POWER9 with only DSISR bit 33 set

2017-09-14 Thread Michael Neuling

On POWER9 DD2.1 and below, it's possible to get Machine Check
Exception (MCE) where only DSISR bit 33 is set. This will result in
the linux MCE handler seeing an unknown event, which triggers linux to
crash.

We change this by detecting unknown events in the MCE handler and
marking them as handled so that we no longer crash. We do this only on
chip revisions known to have this problem.

Signed-off-by: Michael Neuling 
---
 arch/powerpc/kernel/mce_power.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/arch/powerpc/kernel/mce_power.c b/arch/powerpc/kernel/mce_power.c
index b76ca198e0..72ec667136 100644
--- a/arch/powerpc/kernel/mce_power.c
+++ b/arch/powerpc/kernel/mce_power.c
@@ -595,6 +595,7 @@ static long mce_handle_error(struct pt_regs *regs,
uint64_t addr;
uint64_t srr1 = regs->msr;
long handled;
+   unsigned long pvr;
 
if (SRR1_MC_LOADSTORE(srr1))
handled = mce_handle_derror(regs, dtable, &mce_err, &addr);
@@ -604,6 +605,20 @@ static long mce_handle_error(struct pt_regs *regs,
if (!handled && mce_err.error_type == MCE_ERROR_TYPE_UE)
handled = mce_handle_ue_error(regs);
 
+   /*
+* On POWER9 DD2.1 and below, it's possible to get machine
+* check where only DSISR bit 33 is set. This will result in
+* the MCE handler seeing an unknown event and us crashing.
+* Change this to mark as handled on these revisions.
+*/
+   pvr = mfspr(SPRN_PVR);
+   if (((PVR_VER(pvr) == PVR_POWER9) &&
+(PVR_CFG(pvr) == 2) &&
+(PVR_MIN(pvr) <= 1)) || cpu_has_feature(CPU_FTR_POWER9_DD1))
+   /* DD2.1 and below */
+   if (mce_err.error_type == MCE_ERROR_TYPE_UNKNOWN)
+   handled = 1;
+
save_mce_event(regs, handled, &mce_err, regs->nip, addr);
 
return handled;
-- 
2.11.0

[PATCH 3/3] powerpc/5200: dts: digsy_mtc.dts: fix rv3029 compatible

2017-09-14 Thread Alexandre Belloni

The proper compatible for rv3029 is microcrystal,rv3029.

Signed-off-by: Alexandre Belloni 
---
 arch/powerpc/boot/dts/digsy_mtc.dts | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/dts/digsy_mtc.dts 
b/arch/powerpc/boot/dts/digsy_mtc.dts
index c280e75c86bf..c3922fc03e0b 100644
--- a/arch/powerpc/boot/dts/digsy_mtc.dts
+++ b/arch/powerpc/boot/dts/digsy_mtc.dts
@@ -78,7 +78,7 @@
};
 
rtc@56 {
-   compatible = "mc,rv3029c2";
+   compatible = "microcrystal,rv3029";
reg = <0x56>;
};
 
-- 
2.14.1

[PATCH 2/3] ARM: dts: at91: usb_a9g20: fix rtc node

2017-09-14 Thread Alexandre Belloni

The rv3029 compatible is missing its vendor string, add it.
Also fix the node name to be a proper generic name.

Signed-off-by: Alexandre Belloni 
---
 arch/arm/boot/dts/usb_a9g20_common.dtsi | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/usb_a9g20_common.dtsi 
b/arch/arm/boot/dts/usb_a9g20_common.dtsi
index 088c2c3685ab..81c3fe0465d9 100644
--- a/arch/arm/boot/dts/usb_a9g20_common.dtsi
+++ b/arch/arm/boot/dts/usb_a9g20_common.dtsi
@@ -20,8 +20,8 @@
};
 
i2c-gpio-0 {
-   rv3029c2@56 {
-   compatible = "rv3029c2";
+   rtc@56 {
+   compatible = "microcrystal,rv3029";
reg = <0x56>;
};
};
-- 
2.14.1

[PATCH 1/3] RTC: rv3029: fix vendor string

2017-09-14 Thread Alexandre Belloni

The vendor string for Microcrystal is microcrystal.

Signed-off-by: Alexandre Belloni 
---
 Documentation/devicetree/bindings/trivial-devices.txt | 2 +-
 drivers/rtc/rtc-rv3029c2.c| 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/Documentation/devicetree/bindings/trivial-devices.txt 
b/Documentation/devicetree/bindings/trivial-devices.txt
index af284fbd4d23..aae37352c574 100644
--- a/Documentation/devicetree/bindings/trivial-devices.txt
+++ b/Documentation/devicetree/bindings/trivial-devices.txt
@@ -72,7 +72,6 @@ isil,isl29030 Intersil ISL29030 Ambient Light and 
Proximity Sensor
 maxim,ds1050   5 Bit Programmable, Pulse-Width Modulator
 maxim,max1237  Low-Power, 4-/12-Channel, 2-Wire Serial, 12-Bit ADCs
 maxim,max6625  9-Bit/12-Bit Temperature Sensors with I²C-Compatible 
Serial Interface
-mc,rv3029c2Real Time Clock Module with I2C-Bus
 mcube,mc3230   mCube 3-axis 8-bit digital accelerometer
 memsic,mxc6225 MEMSIC 2-axis 8-bit digital accelerometer
 microchip,mcp4531-502  Microchip 7-bit Single I2C Digital Potentiometer (5k)
@@ -141,6 +140,7 @@ microchip,mcp4662-503   Microchip 8-bit Dual I2C 
Digital Potentiometer with NV Mem
 microchip,mcp4662-104  Microchip 8-bit Dual I2C Digital Potentiometer with NV 
Memory (100k)
 microchip,tc654PWM Fan Speed Controller With Fan Fault 
Detection
 microchip,tc655PWM Fan Speed Controller With Fan Fault 
Detection
+microcrystal,rv3029Real Time Clock Module with I2C-Bus
 miramems,da226 MiraMEMS DA226 2-axis 14-bit digital accelerometer
 miramems,da280 MiraMEMS DA280 3-axis 14-bit digital accelerometer
 miramems,da311 MiraMEMS DA311 3-axis 12-bit digital accelerometer
diff --git a/drivers/rtc/rtc-rv3029c2.c b/drivers/rtc/rtc-rv3029c2.c
index aa09771de04f..cfe3aece51d1 100644
--- a/drivers/rtc/rtc-rv3029c2.c
+++ b/drivers/rtc/rtc-rv3029c2.c
@@ -876,6 +876,8 @@ static const struct i2c_device_id rv3029_id[] = {
 MODULE_DEVICE_TABLE(i2c, rv3029_id);
 
 static const struct of_device_id rv3029_of_match[] = {
+   { .compatible = "microcrystal,rv3029" },
+   /* Backward compatibility only, do not use compatibles below: */
{ .compatible = "rv3029" },
{ .compatible = "rv3029c2" },
{ .compatible = "mc,rv3029c2" },
-- 
2.14.1

Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Pavel Tatashin


Hi Mark,

Thank you for looking at this. We can't do this because page table is 
not set until cpu_replace_ttbr1() is called. So, we can't do memset() on 
this memory until then.


Pasha

Re: [PATCH v2] powerpc/tm: Flush TM only if CPU has TM feature

2017-09-14 Thread Cyril Bur

On Wed, 2017-09-13 at 22:13 -0400, Gustavo Romero wrote:
> Commit cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
> added code to access TM SPRs in flush_tmregs_to_thread(). However
> flush_tmregs_to_thread() does not check if TM feature is available on
> CPU before trying to access TM SPRs in order to copy live state to
> thread structures. flush_tmregs_to_thread() is indeed guarded by
> CONFIG_PPC_TRANSACTIONAL_MEM but it might be the case that kernel
> was compiled with CONFIG_PPC_TRANSACTIONAL_MEM enabled and ran on
> a CPU without TM feature available, thus rendering the execution
> of TM instructions that are treated by the CPU as illegal instructions.
> 
> The fix is just to add proper checking in flush_tmregs_to_thread()
> if CPU has the TM feature before accessing any TM-specific resource,
> returning immediately if TM is no available on the CPU. Adding
> that checking in flush_tmregs_to_thread() instead of in places
> where it is called, like in vsr_get() and vsr_set(), is better because
> avoids the same problem cropping up elsewhere.
> 
> Cc: sta...@vger.kernel.org # v4.13+
> Fixes: cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
> Signed-off-by: Gustavo Romero 

Keeping in mind I reviewed cd63f3c and feeling a bit sheepish having
missed this.

Reviewed-by: Cyril Bur 

> ---
>  arch/powerpc/kernel/ptrace.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index 07cd22e..f52ad5b 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -131,7 +131,7 @@ static void flush_tmregs_to_thread(struct task_struct 
> *tsk)
>* in the appropriate thread structures from live.
>*/
>  
> - if (tsk != current)
> + if ((!cpu_has_feature(CPU_FTR_TM)) || (tsk != current))
>   return;
>  
>   if (MSR_TM_SUSPENDED(mfmsr())) {

Re: [PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Mark Rutland

On Thu, Sep 14, 2017 at 06:35:16PM -0400, Pavel Tatashin wrote:
> To optimize the performance of struct page initialization,
> vmemmap_populate() will no longer zero memory.
> 
> We must explicitly zero the memory that is allocated by vmemmap_populate()
> for kasan, as this memory does not go through struct page initialization
> path.
> 
> Signed-off-by: Pavel Tatashin 
> Reviewed-by: Steven Sistare 
> Reviewed-by: Daniel Jordan 
> Reviewed-by: Bob Picco 
> ---
>  arch/arm64/mm/kasan_init.c | 42 ++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
> index 81f03959a4ab..e78a9ecbb687 100644
> --- a/arch/arm64/mm/kasan_init.c
> +++ b/arch/arm64/mm/kasan_init.c
> @@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
>   set_pgd(pgd_offset_k(start), __pgd(0));
>  }
>  
> +/*
> + * Memory that was allocated by vmemmap_populate is not zeroed, so we must
> + * zero it here explicitly.
> + */
> +static void
> +zero_vmemmap_populated_memory(void)
> +{
> + struct memblock_region *reg;
> + u64 start, end;
> +
> + for_each_memblock(memory, reg) {
> + start = __phys_to_virt(reg->base);
> + end = __phys_to_virt(reg->base + reg->size);
> +
> + if (start >= end)
> + break;
> +
> + start = (u64)kasan_mem_to_shadow((void *)start);
> + end = (u64)kasan_mem_to_shadow((void *)end);
> +
> + /* Round to the start end of the mapped pages */
> + start = round_down(start, SWAPPER_BLOCK_SIZE);
> + end = round_up(end, SWAPPER_BLOCK_SIZE);
> + memset((void *)start, 0, end - start);
> + }
> +
> + start = (u64)kasan_mem_to_shadow(_text);
> + end = (u64)kasan_mem_to_shadow(_end);
> +
> + /* Round to the start end of the mapped pages */
> + start = round_down(start, SWAPPER_BLOCK_SIZE);
> + end = round_up(end, SWAPPER_BLOCK_SIZE);
> + memset((void *)start, 0, end - start);
> +}

I really don't see the need to duplicate the existing logic to iterate over
memblocks, calculate the addresses, etc.

Why can't we just have a zeroing wrapper? e.g. something like the below.

I really don't see why we couldn't have a generic function in core code to do
this, even if vmemmap_populate() doesn't.

Thanks,
Mark.

>8
diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f0395..698d065 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -135,6 +135,17 @@ static void __init clear_pgds(unsigned long start,
set_pgd(pgd_offset_k(start), __pgd(0));
 }
 
+void kasan_populate_shadow(unsigned long shadow_start, unsigned long 
shadow_end,
+  nid_t nid)
+{
+   shadow_start = round_down(shadow_start, SWAPPER_BLOCK_SIZE);
+   shadow_end = round_up(shadow_end, SWAPPER_BLOCK_SIZE);
+
+   vmemmap_populate(shadow_start, shadow_end, nid);
+
+   memset((void *)shadow_start, 0, shadow_end - shadow_start);
+}
+
 void __init kasan_init(void)
 {
u64 kimg_shadow_start, kimg_shadow_end;
@@ -161,8 +172,8 @@ void __init kasan_init(void)
 
clear_pgds(KASAN_SHADOW_START, KASAN_SHADOW_END);
 
-   vmemmap_populate(kimg_shadow_start, kimg_shadow_end,
-pfn_to_nid(virt_to_pfn(lm_alias(_text;
+   kasah_populate_shadow(kimg_shadow_start, kimg_shadow_end,
+ pfn_to_nid(virt_to_pfn(lm_alias(_text;
 
/*
 * vmemmap_populate() has populated the shadow region that covers the
@@ -191,9 +202,9 @@ void __init kasan_init(void)
if (start >= end)
break;
 
-   vmemmap_populate((unsigned long)kasan_mem_to_shadow(start),
-   (unsigned long)kasan_mem_to_shadow(end),
-   pfn_to_nid(virt_to_pfn(start)));
+   kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(start),
+ (unsigned long)kasan_mem_to_shadow(end),
+ pfn_to_nid(virt_to_pfn(start)));
}
 
/*

Re: [PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-09-14 Thread David Gibson

On Thu, Sep 14, 2017 at 11:56:25PM +0200, Greg Kurz wrote:
> The following program causes a kernel oops:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> main()
> {
> int fd = open("/dev/kvm", O_RDWR);
> ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
> }
> 
> This happens because when using the global KVM fd with
> KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
> called with a NULL kvm argument, which gets dereferenced
> in is_kvmppc_hv_enabled(). Spotted while reading the code.
> 
> Let's use the hv_enabled fallback variable, like everywhere
> else in this function.
> 
> Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
> Cc: sta...@vger.kernel.org # v4.7+
> Signed-off-by: Greg Kurz 

I don't think this is right.  I'm pretty sure you want to fall back to
hv_enabled *only when* kvm is NULL.  Otherwise if you have a PR guest
on an HV capable machine, this will give the wrong answer, when called
for that specific VM.

> ---
>  arch/powerpc/kvm/powerpc.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 3480faaf1ef8..ee279c7f4802 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   break;
>  #endif
>   case KVM_CAP_PPC_HTM:
> - r = cpu_has_feature(CPU_FTR_TM_COMP) &&
> - is_kvmppc_hv_enabled(kvm);
> + r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled;
>   break;
>   default:
>   r = 0;
> 

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature

[PATCH] KVM: PPC: fix oops when checking KVM_CAP_PPC_HTM

2017-09-14 Thread Greg Kurz

The following program causes a kernel oops:

#include 
#include 
#include 
#include 
#include 

main()
{
int fd = open("/dev/kvm", O_RDWR);
ioctl(fd, KVM_CHECK_EXTENSION, KVM_CAP_PPC_HTM);
}

This happens because when using the global KVM fd with
KVM_CHECK_EXTENSION, kvm_vm_ioctl_check_extension() gets
called with a NULL kvm argument, which gets dereferenced
in is_kvmppc_hv_enabled(). Spotted while reading the code.

Let's use the hv_enabled fallback variable, like everywhere
else in this function.

Fixes: 23528bb21ee2 ("KVM: PPC: Introduce KVM_CAP_PPC_HTM")
Cc: sta...@vger.kernel.org # v4.7+
Signed-off-by: Greg Kurz 
---
 arch/powerpc/kvm/powerpc.c |3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 3480faaf1ef8..ee279c7f4802 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -644,8 +644,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
break;
 #endif
case KVM_CAP_PPC_HTM:
-   r = cpu_has_feature(CPU_FTR_TM_COMP) &&
-   is_kvmppc_hv_enabled(kvm);
+   r = cpu_has_feature(CPU_FTR_TM_COMP) && hv_enabled;
break;
default:
r = 0;

[PATCH v8 00/11] complete deferred page initialization

2017-09-14 Thread Pavel Tatashin


Copy paste error, changing the subject for the header to v8 from v7.

On 09/14/2017 06:35 PM, Pavel Tatashin wrote:

Changelog:
v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
   separately
- Fixed bug reported by kbuild test robot new patch:
   mm: zero reserved and unavailable struct pages
- Removed patch
   x86/mm: reserve only exiting low pages
   As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
   page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
   iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
   whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
   suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
   memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
 * Splited changes to platforms into 4 patches
 * Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
   keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
 TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
 TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (11):
   x86/mm: setting fields in deferred pages
   sparc64/mm: setting fields in deferred pages
   mm: deferred_init_memmap improvements
   sparc64: simplify vmemmap_populate
   mm: defining memblock_virt_alloc_try_nid_raw
   mm: zero struct pages during initialization
   sparc64: optimized struct page zeroing
   mm: zero reserved and unavailable struct pages
   x86/kasan: explicitly zero kasan shadow memory
   arm64/kasan: explicitly zero kasan shadow memory
   mm: stop zeroing memory during allocation in vmemmap

  arch/arm64/mm/kasan_init.c  |  42 
  arch/sparc/include/asm/pgtable_64.h |  30 ++
  arch/sparc/mm/init_64.c |  31 +++---
  arch/x86/mm/init_64.c   |   9 +-
  arch/x86/mm/kasan_init_64.c |  66 
  include/linux/bootmem.h |  27 +
  include/linux/memblock.h|  16 +++
  include/linux/mm.h  |  26 +
  mm/memblock.c   |  60 +--
  mm/page_alloc.c | 207 ++

[PATCH v7 00/11] complete deferred page initialization

2017-09-14 Thread Pavel Tatashin

Changelog:
v8 - v7
- Added Acked-by's from Dave Miller for SPARC changes
- Fixed a minor compiling issue on tile architecture reported by kbuild

v7 - v6
- Addressed comments from Michal Hocko
- memblock_discard() patch was removed from this series and integrated
  separately
- Fixed bug reported by kbuild test robot new patch:
  mm: zero reserved and unavailable struct pages
- Removed patch
  x86/mm: reserve only exiting low pages
  As, it is not needed anymore, because of the previous fix
- Re-wrote deferred_init_memmap(), found and fixed an existing bug, where
  page variable is not reset when zone holes present.
- Merged several patches together per Michal request
- Added performance data including raw logs

v6 - v5
- Fixed ARM64 + kasan code, as reported by Ard Biesheuvel
- Tested ARM64 code in qemu and found few more issues, that I fixed in this
  iteration
- Added page roundup/rounddown to x86 and arm zeroing routines to zero the
  whole allocated range, instead of only provided address range.
- Addressed SPARC related comment from Sam Ravnborg
- Fixed section mismatch warnings related to memblock_discard().

v5 - v4
- Fixed build issues reported by kbuild on various configurations

v4 - v3
- Rewrote code to zero sturct pages in __init_single_page() as
  suggested by Michal Hocko
- Added code to handle issues related to accessing struct page
  memory before they are initialized.

v3 - v2
- Addressed David Miller comments about one change per patch:
* Splited changes to platforms into 4 patches
* Made "do not zero vmemmap_buf" as a separate patch

v2 - v1
- Per request, added s390 to deferred "struct page" zeroing
- Collected performance data on x86 which proofs the importance to
  keep memset() as prefetch (see below).

SMP machines can benefit from the DEFERRED_STRUCT_PAGE_INIT config option,
which defers initializing struct pages until all cpus have been started so
it can be done in parallel.

However, this feature is sub-optimal, because the deferred page
initialization code expects that the struct pages have already been zeroed,
and the zeroing is done early in boot with a single thread only.  Also, we
access that memory and set flags before struct pages are initialized. All
of this is fixed in this patchset.

In this work we do the following:
- Never read access struct page until it was initialized
- Never set any fields in struct pages before they are initialized
- Zero struct page at the beginning of struct page initialization


==
Performance improvements on x86 machine with 8 nodes:
Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz and 1T of memory:
TIME  SPEED UP
base no deferred:   95.796233s
fix no deferred:79.978956s19.77%

base deferred:  77.254713s
fix deferred:   55.050509s40.34%
==
SPARC M6 3600 MHz with 15T of memory
TIME  SPEED UP
base no deferred:   358.335727s
fix no deferred:302.320936s   18.52%

base deferred:  237.534603s
fix deferred:   182.103003s   30.44%
==
Raw dmesg output with timestamps:
x86 base no deferred:https://hastebin.com/ofunepurit.scala
x86 base deferred:   https://hastebin.com/ifazegeyas.scala
x86 fix no deferred: https://hastebin.com/pegocohevo.scala
x86 fix deferred:https://hastebin.com/ofupevikuk.scala
sparc base no deferred:  https://hastebin.com/ibobeteken.go
sparc base deferred: https://hastebin.com/fariqimiyu.go
sparc fix no deferred:   https://hastebin.com/muhegoheyi.go
sparc fix deferred:  https://hastebin.com/xadinobutu.go

Pavel Tatashin (11):
  x86/mm: setting fields in deferred pages
  sparc64/mm: setting fields in deferred pages
  mm: deferred_init_memmap improvements
  sparc64: simplify vmemmap_populate
  mm: defining memblock_virt_alloc_try_nid_raw
  mm: zero struct pages during initialization
  sparc64: optimized struct page zeroing
  mm: zero reserved and unavailable struct pages
  x86/kasan: explicitly zero kasan shadow memory
  arm64/kasan: explicitly zero kasan shadow memory
  mm: stop zeroing memory during allocation in vmemmap

 arch/arm64/mm/kasan_init.c  |  42 
 arch/sparc/include/asm/pgtable_64.h |  30 ++
 arch/sparc/mm/init_64.c |  31 +++---
 arch/x86/mm/init_64.c   |   9 +-
 arch/x86/mm/kasan_init_64.c |  66 
 include/linux/bootmem.h |  27 +
 include/linux/memblock.h|  16 +++
 include/linux/mm.h  |  26 +
 mm/memblock.c   |  60 +--
 mm/page_alloc.c | 207 
 mm/sparse-vmemmap.c |  15 ++-
 mm/sparse.c |   6 +-
 12 files changed, 406 insert

[PATCH v8 02/11] sparc64/mm: setting fields in deferred pages

2017-09-14 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled there is a case where we set some
fields prior to initializing:

mem_init() {
 register_page_bootmem_info();
 free_all_bootmem();
 ...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

mem_init
register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

free_all_bootmem()
free_low_memory_core_early()
 for_each_reserved_mem_region()
  reserve_bootmem_region()
   init_reserved_page() <- Only if this is deferred reserved page
__init_single_pfn()
 __init_single_page()
  memset(0) <-- Loose the set fields here

We end-up with similar issue as in the previous patch, where currently we
do not observe problem as memory is zeroed. But, if flag asserts are
changed we can start hitting issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
---
 arch/sparc/mm/init_64.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index b2ba410b26f4..078f1352736e 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2539,9 +2539,15 @@ void __init mem_init(void)
 {
high_memory = __va(last_valid_pfn << PAGE_SHIFT);
 
-   register_page_bootmem_info();
free_all_bootmem();
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/*
 * Set up the zero page, mark it reserved, so that page count
 * is not manipulated when freeing the page from user ptes.
-- 
2.14.1

[PATCH v8 07/11] sparc64: optimized struct page zeroing

2017-09-14 Thread Pavel Tatashin

Add an optimized mm_zero_struct_page(), so struct page's are zeroed without
calling memset(). We do eight to ten regular stores based on the size of
struct page. Compiler optimizes out the conditions of switch() statement.

SPARC-M6 with 15T of memory, single thread performance:

   BASEFIX  OPTIMIZED_FIX
bootmem_init   28.440467985s   2.305674818s   2.305161615s
free_area_init_nodes  202.845901673s 225.343084508s 172.556506560s
  
Total 231.286369658s 227.648759326s 174.861668175s

BASE:  current linux
FIX:   This patch series without "optimized struct page zeroing"
OPTIMIZED_FIX: This patch series including the current patch.

bootmem_init() is where memory for struct pages is zeroed during
allocation. Note, about two seconds in this function is a fixed time: it
does not increase as memory is increased.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
---
 arch/sparc/include/asm/pgtable_64.h | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h 
b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..8ed478abc630 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -230,6 +230,36 @@ extern unsigned long _PAGE_ALL_SZ_BITS;
 extern struct page *mem_map_zero;
 #define ZERO_PAGE(vaddr)   (mem_map_zero)
 
+/* This macro must be updated when the size of struct page grows above 80
+ * or reduces below 64.
+ * The idea that compiler optimizes out switch() statement, and only
+ * leaves clrx instructions
+ */
+#definemm_zero_struct_page(pp) do {
\
+   unsigned long *_pp = (void *)(pp);  \
+   \
+/* Check that struct page is either 64, 72, or 80 bytes */ \
+   BUILD_BUG_ON(sizeof(struct page) & 7);  \
+   BUILD_BUG_ON(sizeof(struct page) < 64); \
+   BUILD_BUG_ON(sizeof(struct page) > 80); \
+   \
+   switch (sizeof(struct page)) {  \
+   case 80:\
+   _pp[9] = 0; /* fallthrough */   \
+   case 72:\
+   _pp[8] = 0; /* fallthrough */   \
+   default:\
+   _pp[7] = 0; \
+   _pp[6] = 0; \
+   _pp[5] = 0; \
+   _pp[4] = 0; \
+   _pp[3] = 0; \
+   _pp[2] = 0; \
+   _pp[1] = 0; \
+   _pp[0] = 0; \
+   }   \
+} while (0)
+
 /* PFNs are real physical page numbers.  However, mem_map only begins to record
  * per-page information starting at pfn_base.  This is to handle systems where
  * the first physical page in the machine is at some huge physical address,
-- 
2.14.1

[PATCH v8 09/11] x86/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

We must explicitly zero the memory that is allocated by vmemmap_populate()
for kasan, as this memory does not go through struct page initialization
path.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 arch/x86/mm/kasan_init_64.c | 66 +
 1 file changed, 66 insertions(+)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index bc84b73684b7..cc0399032673 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -84,6 +84,66 @@ static struct notifier_block kasan_die_notifier = {
 };
 #endif
 
+/*
+ * x86 variant of vmemmap_populate() uses either PMD_SIZE pages or base pages
+ * to map allocated memory.  This routine determines the page size for the 
given
+ * address from vmemmap.
+ */
+static u64 get_vmemmap_pgsz(u64 addr)
+{
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   pgd = pgd_offset_k(addr);
+   BUG_ON(pgd_none(*pgd) || pgd_large(*pgd));
+
+   p4d = p4d_offset(pgd, addr);
+   BUG_ON(p4d_none(*p4d) || p4d_large(*p4d));
+
+   pud = pud_offset(p4d, addr);
+   BUG_ON(pud_none(*pud) || pud_large(*pud));
+
+   pmd = pmd_offset(pud, addr);
+   BUG_ON(pmd_none(*pmd));
+
+   if (pmd_large(*pmd))
+   return PMD_SIZE;
+   return PAGE_SIZE;
+}
+
+/*
+ * Memory that was allocated by vmemmap_populate is not zeroed, so we must
+ * zero it here explicitly.
+ */
+static void
+zero_vmemmap_populated_memory(void)
+{
+   u64 i, start, end;
+
+   for (i = 0; i < E820_MAX_ENTRIES && pfn_mapped[i].end; i++) {
+   void *kaddr_start = pfn_to_kaddr(pfn_mapped[i].start);
+   void *kaddr_end = pfn_to_kaddr(pfn_mapped[i].end);
+
+   start = (u64)kasan_mem_to_shadow(kaddr_start);
+   end = (u64)kasan_mem_to_shadow(kaddr_end);
+
+   /* Round to the start end of the mapped pages */
+   start = rounddown(start, get_vmemmap_pgsz(start));
+   end = roundup(end, get_vmemmap_pgsz(start));
+   memset((void *)start, 0, end - start);
+   }
+
+   start = (u64)kasan_mem_to_shadow(_stext);
+   end = (u64)kasan_mem_to_shadow(_end);
+
+   /* Round to the start end of the mapped pages */
+   start = rounddown(start, get_vmemmap_pgsz(start));
+   end = roundup(end, get_vmemmap_pgsz(start));
+   memset((void *)start, 0, end - start);
+}
+
 void __init kasan_early_init(void)
 {
int i;
@@ -146,6 +206,12 @@ void __init kasan_init(void)
load_cr3(init_top_pgt);
__flush_tlb_all();
 
+   /*
+* vmemmap_populate does not zero the memory, so we need to zero it
+* explicitly
+*/
+   zero_vmemmap_populated_memory();
+
/*
 * kasan_zero_page has been used as early shadow memory, thus it may
 * contain some garbage. Now we can clear and write protect it, since
-- 
2.14.1

[PATCH v8 05/11] mm: defining memblock_virt_alloc_try_nid_raw

2017-09-14 Thread Pavel Tatashin

* A new variant of memblock_virt_alloc_* allocations:
memblock_virt_alloc_try_nid_raw()
- Does not zero the allocated memory
- Does not panic if request cannot be satisfied

* optimize early system hash allocations

Clients can call alloc_large_system_hash() with flag: HASH_ZERO to specify
that memory that was allocated for system hash needs to be zeroed,
otherwise the memory does not need to be zeroed, and client will initialize
it.

If memory does not need to be zero'd, call the new
memblock_virt_alloc_raw() interface, and thus improve the boot performance.

* debug for raw alloctor

When CONFIG_DEBUG_VM is enabled, this patch sets all the memory that is
returned by memblock_virt_alloc_try_nid_raw() to ones to ensure that no
places excpect zeroed memory.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 include/linux/bootmem.h | 27 ++
 mm/memblock.c   | 60 +++--
 mm/page_alloc.c | 15 ++---
 3 files changed, 87 insertions(+), 15 deletions(-)

diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
index e223d91b6439..ea30b3987282 100644
--- a/include/linux/bootmem.h
+++ b/include/linux/bootmem.h
@@ -160,6 +160,9 @@ extern void *__alloc_bootmem_low_node(pg_data_t *pgdat,
 #define BOOTMEM_ALLOC_ANYWHERE (~(phys_addr_t)0)
 
 /* FIXME: Move to memblock.h at a point where we remove nobootmem.c */
+void *memblock_virt_alloc_try_nid_raw(phys_addr_t size, phys_addr_t align,
+ phys_addr_t min_addr,
+ phys_addr_t max_addr, int nid);
 void *memblock_virt_alloc_try_nid_nopanic(phys_addr_t size,
phys_addr_t align, phys_addr_t min_addr,
phys_addr_t max_addr, int nid);
@@ -176,6 +179,14 @@ static inline void * __init memblock_virt_alloc(
NUMA_NO_NODE);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   return memblock_virt_alloc_try_nid_raw(size, align, BOOTMEM_LOW_LIMIT,
+   BOOTMEM_ALLOC_ACCESSIBLE,
+   NUMA_NO_NODE);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -257,6 +268,14 @@ static inline void * __init memblock_virt_alloc(
return __alloc_bootmem(size, align, BOOTMEM_LOW_LIMIT);
 }
 
+static inline void * __init memblock_virt_alloc_raw(
+   phys_addr_t size,  phys_addr_t align)
+{
+   if (!align)
+   align = SMP_CACHE_BYTES;
+   return __alloc_bootmem_nopanic(size, align, BOOTMEM_LOW_LIMIT);
+}
+
 static inline void * __init memblock_virt_alloc_nopanic(
phys_addr_t size, phys_addr_t align)
 {
@@ -309,6 +328,14 @@ static inline void * __init 
memblock_virt_alloc_try_nid(phys_addr_t size,
  min_addr);
 }
 
+static inline void * __init memblock_virt_alloc_try_nid_raw(
+   phys_addr_t size, phys_addr_t align,
+   phys_addr_t min_addr, phys_addr_t max_addr, int nid)
+{
+   return ___alloc_bootmem_node_nopanic(NODE_DATA(nid), size, align,
+   min_addr, max_addr);
+}
+
 static inline void * __init memblock_virt_alloc_try_nid_nopanic(
phys_addr_t size, phys_addr_t align,
phys_addr_t min_addr, phys_addr_t max_addr, int nid)
diff --git a/mm/memblock.c b/mm/memblock.c
index 91205780e6b1..1f299fb1eb08 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -1327,7 +1327,6 @@ static void * __init memblock_virt_alloc_internal(
return NULL;
 done:
ptr = phys_to_virt(alloc);
-   memset(ptr, 0, size);
 
/*
 * The min_count is set to 0 so that bootmem allocated blocks
@@ -1340,6 +1339,45 @@ static void * __init memblock_virt_alloc_internal(
return ptr;
 }
 
+/**
+ * memblock_virt_alloc_try_nid_raw - allocate boot memory block without zeroing
+ * memory and without panicking
+ * @size: size of memory block to be allocated in bytes
+ * @align: alignment of the region and block's size
+ * @min_addr: the lower bound of the memory region from where the allocation
+ *   is preferred (phys address)
+ * @max_addr: the upper bound of the memory region from where the allocation
+ *   is preferred (phys address), or %BOOTMEM_ALLOC_ACCESSIBLE to
+ *   allocate only from memory limited by memblock.current_limit value
+ * @nid: nid of the free area to find, %NUMA_NO_NODE for any node
+ *
+ * Public function, provides additional debug information (including caller
+ * info), if enabled. Does not zero allocated mem

[PATCH v8 11/11] mm: stop zeroing memory during allocation in vmemmap

2017-09-14 Thread Pavel Tatashin

vmemmap_alloc_block() will no longer zero the block, so zero memory
at its call sites for everything except struct pages.  Struct page memory
is zero'd by struct page initialization.

Replace allocators in sprase-vmemmap to use the non-zeroing version. So,
we will get the performance improvement by zeroing the memory in parallel
when struct pages are zeroed.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 include/linux/mm.h  | 11 +++
 mm/sparse-vmemmap.c | 15 +++
 mm/sparse.c |  6 +++---
 3 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7bba4ce79ba..25848764570f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2501,6 +2501,17 @@ static inline void *vmemmap_alloc_block_buf(unsigned 
long size, int node)
return __vmemmap_alloc_block_buf(size, node, NULL);
 }
 
+static inline void *vmemmap_alloc_block_zero(unsigned long size, int node)
+{
+   void *p = vmemmap_alloc_block(size, node);
+
+   if (!p)
+   return NULL;
+   memset(p, 0, size);
+
+   return p;
+}
+
 void vmemmap_verify(pte_t *, int, unsigned long, unsigned long);
 int vmemmap_populate_basepages(unsigned long start, unsigned long end,
   int node);
diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index d1a39b8051e0..c2f5654e7c9d 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -41,7 +41,7 @@ static void * __ref __earlyonly_bootmem_alloc(int node,
unsigned long align,
unsigned long goal)
 {
-   return memblock_virt_alloc_try_nid(size, align, goal,
+   return memblock_virt_alloc_try_nid_raw(size, align, goal,
BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
@@ -54,9 +54,8 @@ void * __meminit vmemmap_alloc_block(unsigned long size, int 
node)
if (slab_is_available()) {
struct page *page;
 
-   page = alloc_pages_node(node,
-   GFP_KERNEL | __GFP_ZERO | __GFP_RETRY_MAYFAIL,
-   get_order(size));
+   page = alloc_pages_node(node, GFP_KERNEL | __GFP_RETRY_MAYFAIL,
+   get_order(size));
if (page)
return page_address(page);
return NULL;
@@ -183,7 +182,7 @@ pmd_t * __meminit vmemmap_pmd_populate(pud_t *pud, unsigned 
long addr, int node)
 {
pmd_t *pmd = pmd_offset(pud, addr);
if (pmd_none(*pmd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pmd_populate_kernel(&init_mm, pmd, p);
@@ -195,7 +194,7 @@ pud_t * __meminit vmemmap_pud_populate(p4d_t *p4d, unsigned 
long addr, int node)
 {
pud_t *pud = pud_offset(p4d, addr);
if (pud_none(*pud)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pud_populate(&init_mm, pud, p);
@@ -207,7 +206,7 @@ p4d_t * __meminit vmemmap_p4d_populate(pgd_t *pgd, unsigned 
long addr, int node)
 {
p4d_t *p4d = p4d_offset(pgd, addr);
if (p4d_none(*p4d)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
p4d_populate(&init_mm, p4d, p);
@@ -219,7 +218,7 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, 
int node)
 {
pgd_t *pgd = pgd_offset_k(addr);
if (pgd_none(*pgd)) {
-   void *p = vmemmap_alloc_block(PAGE_SIZE, node);
+   void *p = vmemmap_alloc_block_zero(PAGE_SIZE, node);
if (!p)
return NULL;
pgd_populate(&init_mm, pgd, p);
diff --git a/mm/sparse.c b/mm/sparse.c
index 83b3bf6461af..d22f51bb7c79 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -437,9 +437,9 @@ void __init sparse_mem_maps_populate_node(struct page 
**map_map,
}
 
size = PAGE_ALIGN(size);
-   map = memblock_virt_alloc_try_nid(size * map_count,
- PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
- BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
+   map = memblock_virt_alloc_try_nid_raw(size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS),
+ BOOTMEM_ALLOC_ACCESSIBLE, nodeid);
if (map) {
for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
if (!present_section_nr(pnum))
-- 
2.14.1

[PATCH v8 03/11] mm: deferred_init_memmap improvements

2017-09-14 Thread Pavel Tatashin

This patch fixes two issues in deferred_init_memmap

=
In deferred_init_memmap() where all deferred struct pages are initialized
we have a check like this:

if (page->flags) {
VM_BUG_ON(page_zone(page) != zone);
goto free_range;
}

This way we are checking if the current deferred page has already been
initialized. It works, because memory for struct pages has been zeroed, and
the only way flags are not zero if it went through __init_single_page()
before.  But, once we change the current behavior and won't zero the memory
in memblock allocator, we cannot trust anything inside "struct page"es
until they are initialized. This patch fixes this.

The deferred_init_memmap() is re-written to loop through only free memory
ranges provided by memblock.

=
This patch fixes another existing issue on systems that have holes in
zones i.e CONFIG_HOLES_IN_ZONE is defined.

In for_each_mem_pfn_range() we have code like this:

if (!pfn_valid_within(pfn)
goto free_range;

Note: 'page' is not set to NULL and is not incremented but 'pfn' advances.
Thus means if deferred struct pages are enabled on systems with these kind
of holes, linux would get memory corruptions. I have fixed this issue by
defining a new macro that performs all the necessary operations when we
free the current set of pages.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 mm/page_alloc.c | 161 +++-
 1 file changed, 78 insertions(+), 83 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c841af88836a..d132c801d2c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1410,14 +1410,17 @@ void clear_zone_contiguous(struct zone *zone)
 }
 
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
-static void __init deferred_free_range(struct page *page,
-   unsigned long pfn, int nr_pages)
+static void __init deferred_free_range(unsigned long pfn,
+  unsigned long nr_pages)
 {
-   int i;
+   struct page *page;
+   unsigned long i;
 
-   if (!page)
+   if (!nr_pages)
return;
 
+   page = pfn_to_page(pfn);
+
/* Free a large naturally-aligned chunk if possible */
if (nr_pages == pageblock_nr_pages &&
(pfn & (pageblock_nr_pages - 1)) == 0) {
@@ -1443,19 +1446,82 @@ static inline void __init 
pgdat_init_report_one_done(void)
complete(&pgdat_init_all_done_comp);
 }
 
+#define DEFERRED_FREE(nr_free, free_base_pfn, page)\
+({ \
+   unsigned long nr = (nr_free);   \
+   \
+   deferred_free_range((free_base_pfn), (nr)); \
+   (free_base_pfn) = 0;\
+   (nr_free) = 0;  \
+   page = NULL;\
+   nr; \
+})
+
+static unsigned long deferred_init_range(int nid, int zid, unsigned long pfn,
+unsigned long end_pfn)
+{
+   struct mminit_pfnnid_cache nid_init_state = { };
+   unsigned long nr_pgmask = pageblock_nr_pages - 1;
+   unsigned long free_base_pfn = 0;
+   unsigned long nr_pages = 0;
+   unsigned long nr_free = 0;
+   struct page *page = NULL;
+
+   for (; pfn < end_pfn; pfn++) {
+   /*
+* First we check if pfn is valid on architectures where it is
+* possible to have holes within pageblock_nr_pages. On systems
+* where it is not possible, this function is optimized out.
+*
+* Then, we check if a current large page is valid by only
+* checking the validity of the head pfn.
+*
+* meminit_pfn_in_nid is checked on systems where pfns can
+* interleave within a node: a pfn is between start and end
+* of a node, but does not belong to this memory node.
+*
+* Finally, we minimize pfn page lookups and scheduler checks by
+* performing it only once every pageblock_nr_pages.
+*/
+   if (!pfn_valid_within(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!(pfn & nr_pgmask) && !pfn_valid(pfn)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (!meminit_pfn_in_nid(pfn, nid, &nid_init_state)) {
+   nr_pages += DEFERRED_FREE(nr_free, free_base_pfn, page);
+   } else if (page && (pfn & nr_pgmask))

[PATCH v8 10/11] arm64/kasan: explicitly zero kasan shadow memory

2017-09-14 Thread Pavel Tatashin

To optimize the performance of struct page initialization,
vmemmap_populate() will no longer zero memory.

We must explicitly zero the memory that is allocated by vmemmap_populate()
for kasan, as this memory does not go through struct page initialization
path.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 arch/arm64/mm/kasan_init.c | 42 ++
 1 file changed, 42 insertions(+)

diff --git a/arch/arm64/mm/kasan_init.c b/arch/arm64/mm/kasan_init.c
index 81f03959a4ab..e78a9ecbb687 100644
--- a/arch/arm64/mm/kasan_init.c
+++ b/arch/arm64/mm/kasan_init.c
@@ -135,6 +135,41 @@ static void __init clear_pgds(unsigned long start,
set_pgd(pgd_offset_k(start), __pgd(0));
 }
 
+/*
+ * Memory that was allocated by vmemmap_populate is not zeroed, so we must
+ * zero it here explicitly.
+ */
+static void
+zero_vmemmap_populated_memory(void)
+{
+   struct memblock_region *reg;
+   u64 start, end;
+
+   for_each_memblock(memory, reg) {
+   start = __phys_to_virt(reg->base);
+   end = __phys_to_virt(reg->base + reg->size);
+
+   if (start >= end)
+   break;
+
+   start = (u64)kasan_mem_to_shadow((void *)start);
+   end = (u64)kasan_mem_to_shadow((void *)end);
+
+   /* Round to the start end of the mapped pages */
+   start = round_down(start, SWAPPER_BLOCK_SIZE);
+   end = round_up(end, SWAPPER_BLOCK_SIZE);
+   memset((void *)start, 0, end - start);
+   }
+
+   start = (u64)kasan_mem_to_shadow(_text);
+   end = (u64)kasan_mem_to_shadow(_end);
+
+   /* Round to the start end of the mapped pages */
+   start = round_down(start, SWAPPER_BLOCK_SIZE);
+   end = round_up(end, SWAPPER_BLOCK_SIZE);
+   memset((void *)start, 0, end - start);
+}
+
 void __init kasan_init(void)
 {
u64 kimg_shadow_start, kimg_shadow_end;
@@ -205,8 +240,15 @@ void __init kasan_init(void)
pfn_pte(sym_to_pfn(kasan_zero_page), PAGE_KERNEL_RO));
 
memset(kasan_zero_page, 0, PAGE_SIZE);
+
cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
 
+   /*
+* vmemmap_populate does not zero the memory, so we need to zero it
+* explicitly
+*/
+   zero_vmemmap_populated_memory();
+
/* At this point kasan is fully initialized. Enable error messages */
init_task.kasan_depth = 0;
pr_info("KernelAddressSanitizer initialized\n");
-- 
2.14.1

[PATCH v8 04/11] sparc64: simplify vmemmap_populate

2017-09-14 Thread Pavel Tatashin

Remove duplicating code by using common functions
vmemmap_pud_populate and vmemmap_pgd_populate.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: David S. Miller 
---
 arch/sparc/mm/init_64.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 078f1352736e..fc47afa518f5 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2642,30 +2642,19 @@ int __meminit vmemmap_populate(unsigned long vstart, 
unsigned long vend,
vstart = vstart & PMD_MASK;
vend = ALIGN(vend, PMD_SIZE);
for (; vstart < vend; vstart += PMD_SIZE) {
-   pgd_t *pgd = pgd_offset_k(vstart);
+   pgd_t *pgd = vmemmap_pgd_populate(vstart, node);
unsigned long pte;
pud_t *pud;
pmd_t *pmd;
 
-   if (pgd_none(*pgd)) {
-   pud_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
+   if (!pgd)
+   return -ENOMEM;
 
-   if (!new)
-   return -ENOMEM;
-   pgd_populate(&init_mm, pgd, new);
-   }
-
-   pud = pud_offset(pgd, vstart);
-   if (pud_none(*pud)) {
-   pmd_t *new = vmemmap_alloc_block(PAGE_SIZE, node);
-
-   if (!new)
-   return -ENOMEM;
-   pud_populate(&init_mm, pud, new);
-   }
+   pud = vmemmap_pud_populate(pgd, vstart, node);
+   if (!pud)
+   return -ENOMEM;
 
pmd = pmd_offset(pud, vstart);
-
pte = pmd_val(*pmd);
if (!(pte & _PAGE_VALID)) {
void *block = vmemmap_alloc_block(PMD_SIZE, node);
-- 
2.14.1

[PATCH v8 08/11] mm: zero reserved and unavailable struct pages

2017-09-14 Thread Pavel Tatashin

Some memory is reserved but unavailable: not present in memblock.memory
(because not backed by physical pages), but present in memblock.reserved.
Such memory has backing struct pages, but they are not initialized by going
through __init_single_page().

In some cases these struct pages are accessed even if they do not contain
any data. One example is page_to_pfn() might access page->flags if this is
where section information is stored (CONFIG_SPARSEMEM,
SECTION_IN_PAGE_FLAGS).

Since, struct pages are zeroed in __init_single_page(), and not during
allocation time, we must zero such struct pages explicitly.

The patch involves adding a new memblock iterator:
for_each_resv_unavail_range(i, p_start, p_end)

Which iterates through reserved && !memory lists, and we zero struct pages
explicitly by calling mm_zero_struct_page().

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 include/linux/memblock.h | 16 
 include/linux/mm.h   |  6 ++
 mm/page_alloc.c  | 30 ++
 3 files changed, 52 insertions(+)

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index bae11c7e7bf3..bdd4268f9323 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -237,6 +237,22 @@ unsigned long memblock_next_valid_pfn(unsigned long pfn, 
unsigned long max_pfn);
for_each_mem_range_rev(i, &memblock.memory, &memblock.reserved, \
   nid, flags, p_start, p_end, p_nid)
 
+/**
+ * for_each_resv_unavail_range - iterate through reserved and unavailable 
memory
+ * @i: u64 used as loop variable
+ * @flags: pick from blocks based on memory attributes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ *
+ * Walks over unavailabled but reserved (reserved && !memory) areas of 
memblock.
+ * Available as soon as memblock is initialized.
+ * Note: because this memory does not belong to any physical node, flags and
+ * nid arguments do not make sense and thus not exported as arguments.
+ */
+#define for_each_resv_unavail_range(i, p_start, p_end) \
+   for_each_mem_range(i, &memblock.reserved, &memblock.memory, \
+  NUMA_NO_NODE, MEMBLOCK_NONE, p_start, p_end, NULL)
+
 static inline void memblock_set_region_flags(struct memblock_region *r,
 unsigned long flags)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 50b74d628243..a7bba4ce79ba 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2010,6 +2010,12 @@ extern int __meminit __early_pfn_to_nid(unsigned long 
pfn,
struct mminit_pfnnid_cache *state);
 #endif
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+void zero_resv_unavail(void);
+#else
+static inline void zero_resv_unavail(void) {}
+#endif
+
 extern void set_dma_reserve(unsigned long new_dma_reserve);
 extern void memmap_init_zone(unsigned long, int, unsigned long,
unsigned long, enum memmap_context);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4b630ee91430..1d38d391dffd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6202,6 +6202,34 @@ void __paginginit free_area_init_node(int nid, unsigned 
long *zones_size,
free_area_init_core(pgdat);
 }
 
+#ifdef CONFIG_HAVE_MEMBLOCK
+/*
+ * Only struct pages that are backed by physical memory are zeroed and
+ * initialized by going through __init_single_page(). But, there are some
+ * struct pages which are reserved in memblock allocator and their fields
+ * may be accessed (for example page_to_pfn() on some configuration accesses
+ * flags). We must explicitly zero those struct pages.
+ */
+void __paginginit zero_resv_unavail(void)
+{
+   phys_addr_t start, end;
+   unsigned long pfn;
+   u64 i, pgcnt;
+
+   /* Loop through ranges that are reserved, but do not have reported
+* physical memory backing.
+*/
+   pgcnt = 0;
+   for_each_resv_unavail_range(i, &start, &end) {
+   for (pfn = PFN_DOWN(start); pfn < PFN_UP(end); pfn++) {
+   mm_zero_struct_page(pfn_to_page(pfn));
+   pgcnt++;
+   }
+   }
+   pr_info("Reserved but unavailable: %lld pages", pgcnt);
+}
+#endif /* CONFIG_HAVE_MEMBLOCK */
+
 #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
 
 #if MAX_NUMNODES > 1
@@ -6625,6 +6653,7 @@ void __init free_area_init_nodes(unsigned long 
*max_zone_pfn)
node_set_state(nid, N_MEMORY);
check_for_memory(pgdat, nid);
}
+   zero_resv_unavail();
 }
 
 static int __init cmdline_parse_core(char *p, unsigned long *core)
@@ -6788,6 +6817,7 @@ void __init free_area_init(unsigned long *zones_size)
 {
free_area_init_node(0, zones_size,
__pa(PAGE_OFFSET) >> PAGE_

[PATCH v8 06/11] mm: zero struct pages during initialization

2017-09-14 Thread Pavel Tatashin

Add struct page zeroing as a part of initialization of other fields in
__init_single_page().

This single thread performance collected on: Intel(R) Xeon(R) CPU E7-8895
v3 @ 2.60GHz with 1T of memory (268400646 pages in 8 nodes):

BASEFIX
sparse_init 11.244671836s   0.007199623s
zone_sizes_init  4.879775891s   8.355182299s
  --
Total   16.124447727s   8.362381922s

sparse_init is where memory for struct pages is zeroed, and the zeroing
part is moved later in this patch into __init_single_page(), which is
called from zone_sizes_init().

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
Acked-by: Michal Hocko 
---
 include/linux/mm.h | 9 +
 mm/page_alloc.c| 1 +
 2 files changed, 10 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f8c10d336e42..50b74d628243 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -94,6 +94,15 @@ extern int mmap_rnd_compat_bits __read_mostly;
 #define mm_forbids_zeropage(X) (0)
 #endif
 
+/*
+ * On some architectures it is expensive to call memset() for small sizes.
+ * Those architectures should provide their own implementation of "struct page"
+ * zeroing by defining this macro in .
+ */
+#ifndef mm_zero_struct_page
+#define mm_zero_struct_page(pp)  ((void)memset((pp), 0, sizeof(struct page)))
+#endif
+
 /*
  * Default maximum number of active map areas, this limits the number of vmas
  * per mm struct. Users can overwrite this number by sysctl but there is a
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a8dbd405ed94..4b630ee91430 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1170,6 +1170,7 @@ static void free_one_page(struct zone *zone,
 static void __meminit __init_single_page(struct page *page, unsigned long pfn,
unsigned long zone, int nid)
 {
+   mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
-- 
2.14.1

[PATCH v8 01/11] x86/mm: setting fields in deferred pages

2017-09-14 Thread Pavel Tatashin

Without deferred struct page feature (CONFIG_DEFERRED_STRUCT_PAGE_INIT),
flags and other fields in "struct page"es are never changed prior to first
initializing struct pages by going through __init_single_page().

With deferred struct page feature enabled, however, we set fields in
register_page_bootmem_info that are subsequently clobbered right after in
free_all_bootmem:

mem_init() {
register_page_bootmem_info();
free_all_bootmem();
...
}

When register_page_bootmem_info() is called only non-deferred struct pages
are initialized. But, this function goes through some reserved pages which
might be part of the deferred, and thus are not yet initialized.

  mem_init
   register_page_bootmem_info
register_page_bootmem_info_node
 get_page_bootmem
  .. setting fields here ..
  such as: page->freelist = (void *)type;

  free_all_bootmem()
   free_low_memory_core_early()
for_each_reserved_mem_region()
 reserve_bootmem_region()
  init_reserved_page() <- Only if this is deferred reserved page
   __init_single_pfn()
__init_single_page()
memset(0) <-- Loose the set fields here

We end-up with issue where, currently we do not observe problem as memory
is explicitly zeroed. But, if flag asserts are changed we can start hitting
issues.

Also, because in this patch series we will stop zeroing struct page memory
during allocation, we must make sure that struct pages are properly
initialized prior to using them.

The deferred-reserved pages are initialized in free_all_bootmem().
Therefore, the fix is to switch the above calls.

Signed-off-by: Pavel Tatashin 
Reviewed-by: Steven Sistare 
Reviewed-by: Daniel Jordan 
Reviewed-by: Bob Picco 
---
 arch/x86/mm/init_64.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 048fbe8fc274..42b4b7a585c2 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1173,12 +1173,17 @@ void __init mem_init(void)
 
/* clear_bss() already clear the empty_zero_page */
 
-   register_page_bootmem_info();
-
/* this will put all memory onto the freelists */
free_all_bootmem();
after_bootmem = 1;
 
+   /* Must be done after boot memory is put on freelist, because here we
+* might set fields in deferred struct pages that have not yet been
+* initialized, and free_all_bootmem() initializes all the reserved
+* deferred pages for us.
+*/
+   register_page_bootmem_info();
+
/* Register memory areas for /proc/kcore */
kclist_add(&kcore_vsyscall, (void *)VSYSCALL_ADDR,
 PAGE_SIZE, KCORE_OTHER);
-- 
2.14.1

Re: [v4 07/11] soc/fsl/qbman: Rework portal mapping calls for ARM/PPC

2017-09-14 Thread Roy Pledge

On 9/14/2017 10:00 AM, Catalin Marinas wrote:
> On Thu, Aug 24, 2017 at 04:37:51PM -0400, Roy Pledge wrote:
>> diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c
>> index ff8998f..e31c843 100644
>> --- a/drivers/soc/fsl/qbman/bman.c
>> +++ b/drivers/soc/fsl/qbman/bman.c
>> @@ -154,7 +154,7 @@ struct bm_mc {
>>   };
>>   
>>   struct bm_addr {
>> -void __iomem *ce;   /* cache-enabled */
>> +void *ce;   /* cache-enabled */
>>  void __iomem *ci;   /* cache-inhibited */
>>   };
> 
> You dropped __iomem from ce, which is fine since it is now set via
> memremap. However, I haven't seen (at least not in this patch), a change
> to bm_ce_in() which still uses __raw_readl().
> 
> (it may be worth checking this code with sparse, it may warn about this)
Thanks, you're correct I missed this. I will fix this (and the qman 
version) and run sparse.
> 
>> diff --git a/drivers/soc/fsl/qbman/bman_portal.c 
>> b/drivers/soc/fsl/qbman/bman_portal.c
>> index 39b39c8..bb03503 100644
>> --- a/drivers/soc/fsl/qbman/bman_portal.c
>> +++ b/drivers/soc/fsl/qbman/bman_portal.c
>> @@ -91,7 +91,6 @@ static int bman_portal_probe(struct platform_device *pdev)
>>  struct device_node *node = dev->of_node;
>>  struct bm_portal_config *pcfg;
>>  struct resource *addr_phys[2];
>> -void __iomem *va;
>>  int irq, cpu;
>>   
>>  pcfg = devm_kmalloc(dev, sizeof(*pcfg), GFP_KERNEL);
>> @@ -123,23 +122,34 @@ static int bman_portal_probe(struct platform_device 
>> *pdev)
>>  }
>>  pcfg->irq = irq;
>>   
>> -va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0);
>> -if (!va) {
>> -dev_err(dev, "ioremap::CE failed\n");
>> +/*
>> + * TODO: Ultimately we would like to use a cacheable/non-shareable
>> + * (coherent) mapping for the portal on both architectures but that
>> + * isn't currently available in the kernel.  Because of HW differences
>> + * PPC needs to be mapped cacheable while ARM SoCs will work with non
>> + * cacheable mappings
>> + */
> 
> This comment mentions "cacheable/non-shareable (coherent)". Was this
> meant for ARM platforms? Because non-shareable is not coherent, nor is
> this combination guaranteed to work with different CPUs and
> interconnects.
My wording is poor I should have been clearer that non-shareable == 
non-coherent.  I will fix this.

We do understand that cacheable/non shareable isn't supported on all 
CPU/interconnect combinations but we have verified with ARM that for the 
CPU/interconnects we have integrated QBMan on our use is OK. The note is 
here to try to explain why the mapping is different right now. Once we 
get the basic QBMan support integrated for ARM we do plan to try to have 
patches integrated that enable the cacheable mapping as it gives a 
significant performance boost.  This is a step 2 as we understand the 
topic is complex and a little controversial so I think treating it as an 
independent change will be easier than mixing it with the less 
interesting changes in this patchset.

> 
>> +#ifdef CONFIG_PPC
>> +/* PPC requires a cacheable/non-coherent mapping of the portal */
>> +pcfg->addr_virt_ce = memremap(addr_phys[0]->start,
>> +resource_size(addr_phys[0]), MEMREMAP_WB);
>> +#else
>> +/* ARM can use a write combine mapping. */
>> +pcfg->addr_virt_ce = memremap(addr_phys[0]->start,
>> +resource_size(addr_phys[0]), MEMREMAP_WC);
>> +#endif
> 
> Nitpick: you could define something like QBMAN_MAP_ATTR to be different
> between PPC and the rest and just keep a single memremap() call.
I will change this - it will be a little more compact.
> 
> One may complain that "ce" is no longer "cache enabled" but I'm
> personally fine to keep the same name for historical reasons.
Cache Enabled is also how the 'data sheet' for the processor describes 
the region and I think it is useful to keep it aligned so that anyone 
looking at the manual and the code can easily correlate the ter >
>> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h 
>> b/drivers/soc/fsl/qbman/dpaa_sys.h
>> index 81a9a5e..0a1d573 100644
>> --- a/drivers/soc/fsl/qbman/dpaa_sys.h
>> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h
>> @@ -51,12 +51,12 @@
>>   
>>   static inline void dpaa_flush(void *p)
>>   {
>> +/*
>> + * Only PPC needs to flush the cache currently - on ARM the mapping
>> + * is non cacheable
>> + */
>>   #ifdef CONFIG_PPC
>>  flush_dcache_range((unsigned long)p, (unsigned long)p+64);
>> -#elif defined(CONFIG_ARM)
>> -__cpuc_flush_dcache_area(p, 64);
>> -#elif defined(CONFIG_ARM64)
>> -__flush_dcache_area(p, 64);
>>   #endif
>>   }
> 
> Dropping the private API cache maintenance is fine and the memory is WC
> now for ARM (mapping to Normal NonCacheable). However, do you require
> any barriers here? Normal NC doesn't guarantee any ordering.
The barrier is done in the code where the command is f

Re: [v4 05/11] soc/fsl/qbman: Drop L1_CACHE_BYTES compile time check

2017-09-14 Thread Roy Pledge

On 9/14/2017 9:49 AM, Catalin Marinas wrote:
> On Thu, Aug 24, 2017 at 04:37:49PM -0400, Roy Pledge wrote:
>> From: Claudiu Manoil 
>>
>> Not relevant and arch dependent. Overkill for PPC.
>>
>> Signed-off-by: Claudiu Manoil 
>> Signed-off-by: Roy Pledge 
>> ---
>>   drivers/soc/fsl/qbman/dpaa_sys.h | 4 
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h 
>> b/drivers/soc/fsl/qbman/dpaa_sys.h
>> index 2ce394a..f85c319 100644
>> --- a/drivers/soc/fsl/qbman/dpaa_sys.h
>> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h
>> @@ -49,10 +49,6 @@
>>   #define DPAA_PORTAL_CE 0
>>   #define DPAA_PORTAL_CI 1
>>   
>> -#if (L1_CACHE_BYTES != 32) && (L1_CACHE_BYTES != 64)
>> -#error "Unsupported Cacheline Size"
>> -#endif
> 
> Maybe this check was for a reason on PPC as it uses WB memory mappings
> for some of the qbman descriptors (which IIUC fit within a cacheline).
> You could add a check for CONFIG_PPC if you think there is any chance of
> this constant going higher.
> 

No, the reason PPC needs WB (technically any cacheable mapping) is that 
the QBMan block on those parts will raise an error IRQ if it sees any 
transaction less than cacheline size.  We know that this cannot happen 
on PPC parts with QBMan when there is a cacheable mapping because we 
also developed the interconnect for everything that has a QBMan block.

We dropped the check for L1_CACHE_BYTES due to the value being set to 
128 on ARM64 even on parts that has smaller caches. I don't think there 
is much to worry about here as cacheline size isn't something SW 
controls in any case. If we produce a part with QBMan that has a larger 
cache granularity we will need to address that in other parts of the 
code as well. The check was in the code for PPC as a sanity check but 
since the value isn't (in my opinion) meaningful on ARM we can remove it 
to avoid problems.

Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 10:54:08AM -0700, Ram Pai wrote:
> On Thu, Sep 14, 2017 at 11:44:49AM +1000, Balbir Singh wrote:
> > On Fri,  8 Sep 2017 15:44:44 -0700
> > Ram Pai  wrote:
> > 
> > > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> > > in the 64K backed HPTE pages. This along with the earlier
> > > patch will  entirely free  up the four bits from 64K PTE.
> > > The bit numbers are  big-endian as defined in the  ISA3.0
> > > 
> > > This patch  does  the  following change to 64K PTE backed
> > > by 64K HPTE.
> > > 
> > > H_PAGE_F_SECOND (S) which  occupied  bit  4  moves to the
> > >   second part of the pte to bit 60.
> > > H_PAGE_F_GIX (G,I,X) which  occupied  bit 5, 6 and 7 also
> > >   moves  to  the   second part of the pte to bit 61,
> > >   62, 63, 64 respectively
> > > 
> > > since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
> > > bit  9  to  bit  7.
> > > 
> > > The second part of the PTE will hold
> > > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
> > > NOTE: None of the bits in the secondary PTE were not used
> > > by 64k-HPTE backed PTE.
> > > 
> > > Before the patch, the 64K HPTE backed 64k PTE format was
> > > as follows
> > > 
> > >  0 1 2 3 4  5  6  7  8 9 10...63
> > >  : : : : :  :  :  :  : : ::
> > >  v v v v v  v  v  v  v v vv
> > > 
> > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > > |x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte
> > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > > | | | | |  |  |  |  | | | | |..| | | | | <- secondary pte
> > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> > > 
> > > After the patch, the 64k HPTE backed 64k PTE format is
> > > as follows
> > > 
> > >  0 1 2 3 4  5  6  7  8 9 10...63
> > >  : : : : :  :  :  :  : : ::
> > >  v v v v v  v  v  v  v v vv
> > > 
> > > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > > |x|x|x| |  |  |  |B |x| | |x|x||.|.|.|.| <- primary pte
> > > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > > | | | | |  |  |  |  | | | | |..|S|G|I|X| <- secondary pte
> > > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> > > 
> > > The above PTE changes is applicable to hugetlbpages aswell.
> > > 
> > > The patch does the following code changes:
> > > 
> > > a) moves  the  H_PAGE_F_SECOND and  H_PAGE_F_GIX to 4k PTE
> > >   header   since it is no more needed b the 64k PTEs.
> > > b) abstracts  out __real_pte() and __rpte_to_hidx() so the
> > >   caller  need not know the bit location of the slot.
> > > c) moves the slot bits to the secondary pte.
> > > 
> > > Reviewed-by: Aneesh Kumar K.V 
> > > Signed-off-by: Ram Pai 
> > > ---
> > >  arch/powerpc/include/asm/book3s/64/hash-4k.h  |3 ++
> > >  arch/powerpc/include/asm/book3s/64/hash-64k.h |   29 
> > > +++-
> > >  arch/powerpc/include/asm/book3s/64/hash.h |3 --
> > >  arch/powerpc/mm/hash64_64k.c  |   23 ---
> > >  arch/powerpc/mm/hugetlbpage-hash64.c  |   18 ++-
> > >  5 files changed, 33 insertions(+), 43 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> > > b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > > index e66bfeb..dc153c6 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > > @@ -16,6 +16,9 @@
> > >  #define H_PUD_TABLE_SIZE (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> > >  #define H_PGD_TABLE_SIZE (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> > >  
> > > +#define H_PAGE_F_GIX_SHIFT   56
> > > +#define H_PAGE_F_SECOND  _RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > > +#define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> > >  #define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are 
> > > busy */
> > >  
> > >  /* PTE flags to conserve for HPTE identification */
> > > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> > > b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > > index e038f1c..89ef5a9 100644
> > > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > > @@ -12,7 +12,7 @@
> > >   */
> > >  #define H_PAGE_COMBO _RPAGE_RPN0 /* this is a combo 4k page */
> > >  #define H_PAGE_4K_PFN_RPAGE_RPN1 /* PFN is for a single 4k page */
> > > -#define H_PAGE_BUSY  _RPAGE_RPN42 /* software: PTE & hash are 
> > > busy */
> > > +#define H_PAGE_BUSY  _RPAGE_RPN44 /* software: PTE & hash are 
> > > busy */
> > >  
> > >  /*
> > >   * We need to differentiate between explicit huge page and THP huge
> > > @@ -21,8 +21,7 @@
> > >  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
> > >  
> > >  /* PTE flags to conserve for HPTE identification */
> >

Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 11:44:49AM +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:44 -0700
> Ram Pai  wrote:
> 
> > Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
> > in the 64K backed HPTE pages. This along with the earlier
> > patch will  entirely free  up the four bits from 64K PTE.
> > The bit numbers are  big-endian as defined in the  ISA3.0
> > 
> > This patch  does  the  following change to 64K PTE backed
> > by 64K HPTE.
> > 
> > H_PAGE_F_SECOND (S) which  occupied  bit  4  moves to the
> > second part of the pte to bit 60.
> > H_PAGE_F_GIX (G,I,X) which  occupied  bit 5, 6 and 7 also
> > moves  to  the   second part of the pte to bit 61,
> > 62, 63, 64 respectively
> > 
> > since bit 7 is now freed up, we move H_PAGE_BUSY (B) from
> > bit  9  to  bit  7.
> > 
> > The second part of the PTE will hold
> > (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
> > NOTE: None of the bits in the secondary PTE were not used
> > by 64k-HPTE backed PTE.
> > 
> > Before the patch, the 64K HPTE backed 64k PTE format was
> > as follows
> > 
> >  0 1 2 3 4  5  6  7  8 9 10...63
> >  : : : : :  :  :  :  : : ::
> >  v v v v v  v  v  v  v v vv
> > 
> > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > |x|x|x| |S |G |I |X |x|B| |x|x||x|x|x|x| <- primary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > | | | | |  |  |  |  | | | | |..| | | | | <- secondary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> > 
> > After the patch, the 64k HPTE backed 64k PTE format is
> > as follows
> > 
> >  0 1 2 3 4  5  6  7  8 9 10...63
> >  : : : : :  :  :  :  : : ::
> >  v v v v v  v  v  v  v v vv
> > 
> > ,-,-,-,-,--,--,--,--,-,-,-,-,-,--,-,-,-,
> > |x|x|x| |  |  |  |B |x| | |x|x||.|.|.|.| <- primary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'_''_'_'_'_'
> > | | | | |  |  |  |  | | | | |..|S|G|I|X| <- secondary pte
> > '_'_'_'_'__'__'__'__'_'_'_'_'__'_'_'_'_'
> > 
> > The above PTE changes is applicable to hugetlbpages aswell.
> > 
> > The patch does the following code changes:
> > 
> > a) moves  the  H_PAGE_F_SECOND and  H_PAGE_F_GIX to 4k PTE
> > header   since it is no more needed b the 64k PTEs.
> > b) abstracts  out __real_pte() and __rpte_to_hidx() so the
> > caller  need not know the bit location of the slot.
> > c) moves the slot bits to the secondary pte.
> > 
> > Reviewed-by: Aneesh Kumar K.V 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/include/asm/book3s/64/hash-4k.h  |3 ++
> >  arch/powerpc/include/asm/book3s/64/hash-64k.h |   29 
> > +++-
> >  arch/powerpc/include/asm/book3s/64/hash.h |3 --
> >  arch/powerpc/mm/hash64_64k.c  |   23 ---
> >  arch/powerpc/mm/hugetlbpage-hash64.c  |   18 ++-
> >  5 files changed, 33 insertions(+), 43 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > index e66bfeb..dc153c6 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> > @@ -16,6 +16,9 @@
> >  #define H_PUD_TABLE_SIZE   (sizeof(pud_t) << H_PUD_INDEX_SIZE)
> >  #define H_PGD_TABLE_SIZE   (sizeof(pgd_t) << H_PGD_INDEX_SIZE)
> >  
> > +#define H_PAGE_F_GIX_SHIFT 56
> > +#define H_PAGE_F_SECOND_RPAGE_RSV2 /* HPTE is in 2ndary HPTEG */
> > +#define H_PAGE_F_GIX   (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
> >  #define H_PAGE_BUSY_RPAGE_RSV1 /* software: PTE & hash are 
> > busy */
> >  
> >  /* PTE flags to conserve for HPTE identification */
> > diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h 
> > b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > index e038f1c..89ef5a9 100644
> > --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> > @@ -12,7 +12,7 @@
> >   */
> >  #define H_PAGE_COMBO   _RPAGE_RPN0 /* this is a combo 4k page */
> >  #define H_PAGE_4K_PFN  _RPAGE_RPN1 /* PFN is for a single 4k page */
> > -#define H_PAGE_BUSY_RPAGE_RPN42 /* software: PTE & hash are 
> > busy */
> > +#define H_PAGE_BUSY_RPAGE_RPN44 /* software: PTE & hash are 
> > busy */
> >  
> >  /*
> >   * We need to differentiate between explicit huge page and THP huge
> > @@ -21,8 +21,7 @@
> >  #define H_PAGE_THP_HUGE  H_PAGE_4K_PFN
> >  
> >  /* PTE flags to conserve for HPTE identification */
> > -#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_F_SECOND | \
> > -H_PAGE_F_GIX | H_PAGE_HASHPTE | H_PAGE_COMBO)
> > +#define _PAGE_HPTEFLAGS (H_PAGE_BUSY | H_PAGE_HASHPTE | H_PAGE_COMBO)
> >  /*
> >   * we support 16 fragments per PT

Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT

2017-09-14 Thread Greg KH

On Wed, Sep 13, 2017 at 06:51:25PM -0500, Rob Landley wrote:
> From: Rob Landley 
> 
> Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move
> /dev/console open after devtmpfs mount.
> 
> Add workaround for Debian bug that was copied by Ubuntu.
> 
> Signed-off-by: Rob Landley 
> ---
> 
> v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html
> 
>  drivers/base/Kconfig |   14 --
>  fs/namespace.c   |   14 ++
>  init/main.c  |   15 +--
>  3 files changed, 27 insertions(+), 16 deletions(-)

Always run scripts/checkpatch.pl so you don't get grumpy emails from
reviewers telling you to run scripts/checkpatch.pl... telling you to run
scripts/checkpatch.pl... telling you to run scripts/checkpatch.pl...
telling you to run scripts/checkpatch.pl...

Re: [PATCH 5/7] powerpc: Swizzle around 4K PTE bits to free up bit 5 and bit 6

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 11:48:34AM +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:45 -0700
> Ram Pai  wrote:
> 
> > We  need  PTE bits  3 ,4, 5, 6 and 57 to support protection-keys,
> > because these are  the bits we want to consolidate on across all
> > configuration to support protection keys.
> > 
> > Bit 3,4,5 and 6 are currently used on 4K-pte kernels.  But bit 9
> > and 10 are available.  Hence  we  use the two available bits and
> > free up bit 5 and 6.  We will still not be able to free up bit 3
> > and 4. In the absence  of  any  other free bits, we will have to
> > stay satisfied  with  what we have :-(.   This means we will not
> > be  able  to support  32  protection  keys, but only 8.  The bit
> > numbers are  big-endian as defined in the  ISA3.0
> >
> 
> Any chance for 4k PTE's we can do slot searching for the PTE?
> I guess thats add additional complexity

Aneesh, i think, is working on moving slot information out of the PTE.
If that happens, we will have leg-space to support more keys.

RP

Re: [PATCH 7/7] powerpc: capture the PTE format changes in the dump pte report

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 01:22:27PM +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:47 -0700
> Ram Pai  wrote:
> 
> > The H_PAGE_F_SECOND,H_PAGE_F_GIX are not in the 64K main-PTE.
> > capture these changes in the dump pte report.
> > 
> > Reviewed-by: Aneesh Kumar K.V 
> > Signed-off-by: Ram Pai 
> > ---
> 
> So we lose slot and secondary information for 64K PTE's with
> this change?

yes. It was anyway not there for 4k-backed-64k ptes. Now it wont
be there for any 64k ptes.

RP

Re: Machine Check in P2010(e500v2)

2017-09-14 Thread Joakim Tjernlund

On Sat, 2017-09-09 at 14:59 +0200, Joakim Tjernlund wrote:
> On Sat, 2017-09-09 at 14:45 +0200, Joakim Tjernlund wrote:
> > On Fri, 2017-09-08 at 22:27 +, Leo Li wrote:
> > > > -Original Message-
> > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > Sent: Friday, September 08, 2017 7:51 AM
> > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ; York Sun
> > > > 
> > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > 
> > > > On Fri, 2017-09-08 at 11:54 +0200, Joakim Tjernlund wrote:
> > > > > On Thu, 2017-09-07 at 18:54 +, Leo Li wrote:
> > > > > > > -Original Message-
> > > > > > > From: Joakim Tjernlund [mailto:joakim.tjernl...@infinera.com]
> > > > > > > Sent: Thursday, September 07, 2017 3:41 AM
> > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li ;
> > > > > > > York Sun 
> > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > 
> > > > > > > On Thu, 2017-09-07 at 00:50 +0200, Joakim Tjernlund wrote:
> > > > > > > > On Wed, 2017-09-06 at 21:13 +, Leo Li wrote:
> > > > > > > > > > -Original Message-
> > > > > > > > > > From: Joakim Tjernlund
> > > > > > > > > > [mailto:joakim.tjernl...@infinera.com]
> > > > > > > > > > Sent: Wednesday, September 06, 2017 3:54 PM
> > > > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > > > > ; York Sun 
> > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > 
> > > > > > > > > > On Wed, 2017-09-06 at 20:28 +, Leo Li wrote:
> > > > > > > > > > > > -Original Message-
> > > > > > > > > > > > From: Joakim Tjernlund
> > > > > > > > > > > > [mailto:joakim.tjernl...@infinera.com]
> > > > > > > > > > > > Sent: Wednesday, September 06, 2017 3:17 PM
> > > > > > > > > > > > To: linuxppc-dev@lists.ozlabs.org; Leo Li
> > > > > > > > > > > > ; York Sun 
> > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > > > 
> > > > > > > > > > > > On Wed, 2017-09-06 at 19:31 +, Leo Li wrote:
> > > > > > > > > > > > > > -Original Message-
> > > > > > > > > > > > > > From: York Sun
> > > > > > > > > > > > > > Sent: Wednesday, September 06, 2017 10:38 AM
> > > > > > > > > > > > > > To: Joakim Tjernlund
> > > > > > > > > > > > > > ;
> > > > > > > > > > > > > > linuxppc- d...@lists.ozlabs.org; Leo Li
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Subject: Re: Machine Check in P2010(e500v2)
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Scott is no longer with Freescale/NXP. Adding Leo.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On 09/05/2017 01:40 AM, Joakim Tjernlund wrote:
> > > > > > > > > > > > > > > So after some debugging I found this bug:
> > > > > > > > > > > > > > > @@ -996,7 +998,7 @@ int
> > > > > > > > > > > > > > > fsl_pci_mcheck_exception(struct pt_regs
> > > > > > > > > > 
> > > > > > > > > > *regs)
> > > > > > > > > > > > > > >  if (is_in_pci_mem_space(addr)) {
> > > > > > > > > > > > > > >  if (user_mode(regs)) {
> > > > > > > > > > > > > > >  pagefault_disable();
> > > > > > > > > > > > > > > -   ret = get_user(regs->nip, 
> > > > > > > > > > > > > > > &inst);
> > > > > > > > > > > > > > > +   ret = get_user(inst,
> > > > > > > > > > > > > > > + (__u32 __user *)regs->nip);
> > > > > > > > > > > > > > >  pagefault_enable();
> > > > > > > > > > > > > > >  } else {
> > > > > > > > > > > > > > >  ret =
> > > > > > > > > > > > > > > probe_kernel_address(regs->nip, inst);
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > However, the kernel still locked up after fixing 
> > > > > > > > > > > > > > > that.
> > > > > > > > > > > > > > > Now I wonder why this fixup is there in the first 
> > > > > > > > > > > > > > > place?
> > > > > > > > > > > > > > > The routine will not really fixup the insn, just
> > > > > > > > > > > > > > > return 0x for the failing read and then 
> > > > > > > > > > > > > > > advance the
> > > > 
> > > > process NIP.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > You are right.  The code here only gives 0x to
> > > > > > > > > > > > > the load instructions and
> > > > > > > > > > > > 
> > > > > > > > > > > > continue with the next instruction when the load
> > > > > > > > > > > > instruction is causing the machine check.  This will
> > > > > > > > > > > > prevent a system lockup when reading from PCI/RapidIO 
> > > > > > > > > > > > device
> > > > 
> > > > which is link down.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I don't know what is actual problem in your case.
> > > > > > > > > > > > > Maybe it is a write
> > > > > > > > > > > > 
> > > > > > > > > > > > instruction instead of read?   Or the code is in a 
> > > > > > > > > > > > infinite loop
> > > > 
> > > > waiting for
> > > > > > > 
> > > > > > > a
> > > > > > > > > > 
> > > > > > > > > > valid
> > > > >

Re: [PATCH 01/25] powerpc: initial pkey plumbing

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 01:32:05PM +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:49 -0700
> Ram Pai  wrote:
> 
> > Basic  plumbing  to   initialize  the   pkey  system.
> > Nothing is enabled yet. A later patch will enable it
> > ones all the infrastructure is in place.
> > 
> > Signed-off-by: Ram Pai 
> > ---
> >  arch/powerpc/Kconfig   |   16 +++
> >  arch/powerpc/include/asm/mmu_context.h |5 +++
> >  arch/powerpc/include/asm/pkeys.h   |   45 
> > 
> >  arch/powerpc/kernel/setup_64.c |4 +++
> >  arch/powerpc/mm/Makefile   |1 +
> >  arch/powerpc/mm/hash_utils_64.c|1 +
> >  arch/powerpc/mm/pkeys.c|   33 +++
> >  7 files changed, 105 insertions(+), 0 deletions(-)
> >  create mode 100644 arch/powerpc/include/asm/pkeys.h
> >  create mode 100644 arch/powerpc/mm/pkeys.c
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 9fc3c0b..a4cd210 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -864,6 +864,22 @@ config SECCOMP
> >  
> >   If unsure, say Y. Only embedded should say N here.
> >  
> > +config PPC64_MEMORY_PROTECTION_KEYS
> > +   prompt "PowerPC Memory Protection Keys"
> > +   def_bool y
> > +   # Note: only available in 64-bit mode
> > +   depends on PPC64
> 
> This is not sufficient right, you need PPC_BOOK3S_64
> for compile time at-least?

Ok. Not thought too deep about this. Thanks for the input.


> 
> > +   select ARCH_USES_HIGH_VMA_FLAGS
> > +
.
> > +void __init pkey_initialize(void)
> > +{
> > +   /* disable the pkey system till everything
> > +* is in place. A patch further down the
> > +* line will enable it.
> > +*/
> 
> Comment style is broken
> 

checkpatch.pl does not complain.  So is it really a broken
comment style, or is it checkpatch.pl needs to be fixed?


RP

Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.

2017-09-14 Thread Ram Pai

On Thu, Sep 14, 2017 at 02:38:07PM +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:50 -0700
> Ram Pai  wrote:
> 
> > powerpc needs an additional vma bit to support 32 keys.
> > Till the additional vma bit lands in include/linux/mm.h
> > we have to define  it  in powerpc specific header file.
> > This is  needed to get pkeys working on power.
> > 
> > Signed-off-by: Ram Pai 
> > ---
> 
> "This" being an arch specific hack for the additional bit?

Yes. arch-specific hack.  I am trying to get the arch specific
changes merged parallelly, along with these patches. Don't know
which one will merge first. Regardless of which patch-set
lands-in first; I have organized the code such that nothing
breaks.

RP

Re: [PATCH] net/ethernet/freescale: fix warning for ucc_geth

2017-09-14 Thread Longchamp, Valentin

Hi Christophe,

On Thu, 2017-09-14 at 15:24 +0200, Christophe LEROY wrote:
> Hi,
> 
> Le 14/09/2017 à 14:05, Valentin Longchamp a écrit :
> > Simple printk format warning for the the ucc registers address.
> 
> Did you test your patch with mpc83xx_defconfig ?

No I only tested on a 85xx where I had another (similar, because the
physical addresses are u64 and not u32) warning.

My quick fix indeed did not take the different typedefs for
phys_addr_t.

I try to come with a v2 that covers this.

Thanks for the feedback.

Valentin
> 
> I get a new warning with your patch:
> 
>CC  drivers/net/ethernet/freescale/ucc_geth.o
> In file included from ./include/linux/printk.h:6:0,
>   from ./include/linux/kernel.h:13,
>   from drivers/net/ethernet/freescale/ucc_geth.c:18:
> drivers/net/ethernet/freescale/ucc_geth.c: In function
> ‘ucc_geth_probe’:
> ./include/linux/kern_levels.h:4:18: warning: format ‘%llx’ expects 
> argument of type ‘long long unsigned int’, but argument 3 has type 
> ‘resource_size_t {aka unsigned int}’ [-Wformat=]
>   #define KERN_SOH "\001"  /* ASCII Start Of Header */
>^
> ./include/linux/kern_levels.h:13:19: note: in expansion of macro
> ‘KERN_SOH’
>   #define KERN_INFO KERN_SOH "6" /* informational */
> ^
> ./include/linux/printk.h:308:9: note: in expansion of macro
> ‘KERN_INFO’
>printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
>   ^
> drivers/net/ethernet/freescale/ucc_geth.c:3860:3: note: in expansion
> of 
> macro ‘pr_info’
> pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
> ^
> 
> Christophe
> 
> > 
> > Signed-off-by: Valentin Longchamp 
> > ---
> >   drivers/net/ethernet/freescale/ucc_geth.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/freescale/ucc_geth.c
> > b/drivers/net/ethernet/freescale/ucc_geth.c
> > index f77ba9fa257b..56b8fdb35c3b 100644
> > --- a/drivers/net/ethernet/freescale/ucc_geth.c
> > +++ b/drivers/net/ethernet/freescale/ucc_geth.c
> > @@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct
> > platform_device* ofdev)
> > }
> >   
> > if (netif_msg_probe(&debug))
> > -   pr_info("UCC%1d at 0x%8x (irq = %d)\n",
> > +   pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
> > ug_info->uf_info.ucc_num + 1, ug_info-
> > >uf_info.regs,
> > ug_info->uf_info.irq);
> >   
> >

Re: [v4 07/11] soc/fsl/qbman: Rework portal mapping calls for ARM/PPC

2017-09-14 Thread Catalin Marinas

On Thu, Aug 24, 2017 at 04:37:51PM -0400, Roy Pledge wrote:
> diff --git a/drivers/soc/fsl/qbman/bman.c b/drivers/soc/fsl/qbman/bman.c
> index ff8998f..e31c843 100644
> --- a/drivers/soc/fsl/qbman/bman.c
> +++ b/drivers/soc/fsl/qbman/bman.c
> @@ -154,7 +154,7 @@ struct bm_mc {
>  };
>  
>  struct bm_addr {
> - void __iomem *ce;   /* cache-enabled */
> + void *ce;   /* cache-enabled */
>   void __iomem *ci;   /* cache-inhibited */
>  };

You dropped __iomem from ce, which is fine since it is now set via
memremap. However, I haven't seen (at least not in this patch), a change
to bm_ce_in() which still uses __raw_readl().

(it may be worth checking this code with sparse, it may warn about this)

> diff --git a/drivers/soc/fsl/qbman/bman_portal.c 
> b/drivers/soc/fsl/qbman/bman_portal.c
> index 39b39c8..bb03503 100644
> --- a/drivers/soc/fsl/qbman/bman_portal.c
> +++ b/drivers/soc/fsl/qbman/bman_portal.c
> @@ -91,7 +91,6 @@ static int bman_portal_probe(struct platform_device *pdev)
>   struct device_node *node = dev->of_node;
>   struct bm_portal_config *pcfg;
>   struct resource *addr_phys[2];
> - void __iomem *va;
>   int irq, cpu;
>  
>   pcfg = devm_kmalloc(dev, sizeof(*pcfg), GFP_KERNEL);
> @@ -123,23 +122,34 @@ static int bman_portal_probe(struct platform_device 
> *pdev)
>   }
>   pcfg->irq = irq;
>  
> - va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0);
> - if (!va) {
> - dev_err(dev, "ioremap::CE failed\n");
> + /*
> +  * TODO: Ultimately we would like to use a cacheable/non-shareable
> +  * (coherent) mapping for the portal on both architectures but that
> +  * isn't currently available in the kernel.  Because of HW differences
> +  * PPC needs to be mapped cacheable while ARM SoCs will work with non
> +  * cacheable mappings
> +  */

This comment mentions "cacheable/non-shareable (coherent)". Was this
meant for ARM platforms? Because non-shareable is not coherent, nor is
this combination guaranteed to work with different CPUs and
interconnects.

> +#ifdef CONFIG_PPC
> + /* PPC requires a cacheable/non-coherent mapping of the portal */
> + pcfg->addr_virt_ce = memremap(addr_phys[0]->start,
> + resource_size(addr_phys[0]), MEMREMAP_WB);
> +#else
> + /* ARM can use a write combine mapping. */
> + pcfg->addr_virt_ce = memremap(addr_phys[0]->start,
> + resource_size(addr_phys[0]), MEMREMAP_WC);
> +#endif

Nitpick: you could define something like QBMAN_MAP_ATTR to be different
between PPC and the rest and just keep a single memremap() call.

One may complain that "ce" is no longer "cache enabled" but I'm
personally fine to keep the same name for historical reasons.

> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h 
> b/drivers/soc/fsl/qbman/dpaa_sys.h
> index 81a9a5e..0a1d573 100644
> --- a/drivers/soc/fsl/qbman/dpaa_sys.h
> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h
> @@ -51,12 +51,12 @@
>  
>  static inline void dpaa_flush(void *p)
>  {
> + /*
> +  * Only PPC needs to flush the cache currently - on ARM the mapping
> +  * is non cacheable
> +  */
>  #ifdef CONFIG_PPC
>   flush_dcache_range((unsigned long)p, (unsigned long)p+64);
> -#elif defined(CONFIG_ARM)
> - __cpuc_flush_dcache_area(p, 64);
> -#elif defined(CONFIG_ARM64)
> - __flush_dcache_area(p, 64);
>  #endif
>  }

Dropping the private API cache maintenance is fine and the memory is WC
now for ARM (mapping to Normal NonCacheable). However, do you require
any barriers here? Normal NC doesn't guarantee any ordering.

> diff --git a/drivers/soc/fsl/qbman/qman_portal.c 
> b/drivers/soc/fsl/qbman/qman_portal.c
> index cbacdf4..41fe33a 100644
> --- a/drivers/soc/fsl/qbman/qman_portal.c
> +++ b/drivers/soc/fsl/qbman/qman_portal.c
> @@ -224,7 +224,6 @@ static int qman_portal_probe(struct platform_device *pdev)
>   struct device_node *node = dev->of_node;
>   struct qm_portal_config *pcfg;
>   struct resource *addr_phys[2];
> - void __iomem *va;
>   int irq, cpu, err;
>   u32 val;
>  
> @@ -262,23 +261,34 @@ static int qman_portal_probe(struct platform_device 
> *pdev)
>   }
>   pcfg->irq = irq;
>  
> - va = ioremap_prot(addr_phys[0]->start, resource_size(addr_phys[0]), 0);
> - if (!va) {
> - dev_err(dev, "ioremap::CE failed\n");
> + /*
> +  * TODO: Ultimately we would like to use a cacheable/non-shareable
> +  * (coherent) mapping for the portal on both architectures but that
> +  * isn't currently available in the kernel.  Because of HW differences
> +  * PPC needs to be mapped cacheable while ARM SoCs will work with non
> +  * cacheable mappings
> +  */

Same comment as above non non-shareable.

> +#ifdef CONFIG_PPC
> + /* PPC requires a cacheable mapping of the portal */
> + pcfg->addr_virt_ce = memremap(addr_phys[0]->start,
> +

Re: [v4 05/11] soc/fsl/qbman: Drop L1_CACHE_BYTES compile time check

2017-09-14 Thread Catalin Marinas

On Thu, Aug 24, 2017 at 04:37:49PM -0400, Roy Pledge wrote:
> From: Claudiu Manoil 
> 
> Not relevant and arch dependent. Overkill for PPC.
> 
> Signed-off-by: Claudiu Manoil 
> Signed-off-by: Roy Pledge 
> ---
>  drivers/soc/fsl/qbman/dpaa_sys.h | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/soc/fsl/qbman/dpaa_sys.h 
> b/drivers/soc/fsl/qbman/dpaa_sys.h
> index 2ce394a..f85c319 100644
> --- a/drivers/soc/fsl/qbman/dpaa_sys.h
> +++ b/drivers/soc/fsl/qbman/dpaa_sys.h
> @@ -49,10 +49,6 @@
>  #define DPAA_PORTAL_CE 0
>  #define DPAA_PORTAL_CI 1
>  
> -#if (L1_CACHE_BYTES != 32) && (L1_CACHE_BYTES != 64)
> -#error "Unsupported Cacheline Size"
> -#endif

Maybe this check was for a reason on PPC as it uses WB memory mappings
for some of the qbman descriptors (which IIUC fit within a cacheline).
You could add a check for CONFIG_PPC if you think there is any chance of
this constant going higher.

-- 
Catalin

Re: [v4 03/11] dt-bindings: soc/fsl: Update reserved memory binding for QBMan

2017-09-14 Thread Catalin Marinas

On Thu, Aug 24, 2017 at 04:37:47PM -0400, Roy Pledge wrote:
> Updates the QMan and BMan device tree bindings for reserved memory
> nodes. This makes the reserved memory allocation compatible with
> the shared-dma-pool usage.
> 
> Signed-off-by: Roy Pledge 
> ---
>  Documentation/devicetree/bindings/soc/fsl/bman.txt | 12 +-
>  Documentation/devicetree/bindings/soc/fsl/qman.txt | 26 
> --

This needs reviewed by the DT maintainers.

-- 
Catalin

Re: [v4 01/11] soc/fsl/qbman: Use shared-dma-pool for BMan private memory allocations

2017-09-14 Thread Catalin Marinas

On Thu, Aug 24, 2017 at 04:37:45PM -0400, Roy Pledge wrote:
> --- a/drivers/soc/fsl/qbman/bman_ccsr.c
> +++ b/drivers/soc/fsl/qbman/bman_ccsr.c
[...]
> @@ -201,6 +202,38 @@ static int fsl_bman_probe(struct platform_device *pdev)
>   return -ENODEV;
>   }
>  
> + /*
> +  * If FBPR memory wasn't defined using the qbman compatible string
> +  * try using the of_reserved_mem_device method
> +  */
> + if (!fbpr_a) {
> + ret = of_reserved_mem_device_init(dev);
> + if (ret) {
> + dev_err(dev, "of_reserved_mem_device_init() failed 
> 0x%x\n",
> + ret);
> + return -ENODEV;
> + }
> + mem_node = of_parse_phandle(dev->of_node, "memory-region", 0);
> + if (mem_node) {
> + ret = of_property_read_u64(mem_node, "size", &size);
> + if (ret) {
> + dev_err(dev, "FBPR: of_address_to_resource 
> fails 0x%x\n",
> + ret);
> + return -ENODEV;
> + }
> + fbpr_sz = size;
> + } else {
> + dev_err(dev, "No memory-region found for FBPR\n");
> + return -ENODEV;
> + }
> + if (!dma_zalloc_coherent(dev, fbpr_sz, &fbpr_a, 0)) {
> + dev_err(dev, "Alloc FBPR memory failed\n");
> + return -ENODEV;
> + }
> + }

At a quick look, I think I spotted this pattern a couple of more times
in the subsequent patch. Could it be moved to a common function?

-- 
Catalin

Re: [PATCH] net/ethernet/freescale: fix warning for ucc_geth

2017-09-14 Thread Christophe LEROY


Hi,

Le 14/09/2017 à 14:05, Valentin Longchamp a écrit :

Simple printk format warning for the the ucc registers address.


Did you test your patch with mpc83xx_defconfig ?

I get a new warning with your patch:

  CC  drivers/net/ethernet/freescale/ucc_geth.o
In file included from ./include/linux/printk.h:6:0,
 from ./include/linux/kernel.h:13,
 from drivers/net/ethernet/freescale/ucc_geth.c:18:
drivers/net/ethernet/freescale/ucc_geth.c: In function ‘ucc_geth_probe’:
./include/linux/kern_levels.h:4:18: warning: format ‘%llx’ expects 
argument of type ‘long long unsigned int’, but argument 3 has type 
‘resource_size_t {aka unsigned int}’ [-Wformat=]

 #define KERN_SOH "\001"  /* ASCII Start Of Header */
  ^
./include/linux/kern_levels.h:13:19: note: in expansion of macro ‘KERN_SOH’
 #define KERN_INFO KERN_SOH "6" /* informational */
   ^
./include/linux/printk.h:308:9: note: in expansion of macro ‘KERN_INFO’
  printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
 ^
drivers/net/ethernet/freescale/ucc_geth.c:3860:3: note: in expansion of 
macro ‘pr_info’

   pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
   ^

Christophe



Signed-off-by: Valentin Longchamp 
---
  drivers/net/ethernet/freescale/ucc_geth.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index f77ba9fa257b..56b8fdb35c3b 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct platform_device* ofdev)
}
  
  	if (netif_msg_probe(&debug))

-   pr_info("UCC%1d at 0x%8x (irq = %d)\n",
+   pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs,
ug_info->uf_info.irq);

[PATCH] net/ethernet/freescale: fix warning for ucc_geth

2017-09-14 Thread Valentin Longchamp

Simple printk format warning for the the ucc registers address.

Signed-off-by: Valentin Longchamp 
---
 drivers/net/ethernet/freescale/ucc_geth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/ucc_geth.c 
b/drivers/net/ethernet/freescale/ucc_geth.c
index f77ba9fa257b..56b8fdb35c3b 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -3857,7 +3857,7 @@ static int ucc_geth_probe(struct platform_device* ofdev)
}
 
if (netif_msg_probe(&debug))
-   pr_info("UCC%1d at 0x%8x (irq = %d)\n",
+   pr_info("UCC%1d at 0x%8llx (irq = %d)\n",
ug_info->uf_info.ucc_num + 1, ug_info->uf_info.regs,
ug_info->uf_info.irq);
 
-- 
2.13.5

Re: [RFC PATCH 2/2] powerpc/powernv: implement NMI IPIs with OPAL_SIGNAL_SYSTEM_RESET

2017-09-14 Thread Nicholas Piggin

On Wed, 13 Sep 2017 02:05:53 +1000
Nicholas Piggin  wrote:

> There are two complications. The first is that sreset from stop states
> come in with SRR1 set to do a powersave wakeup, with an sreset reason
> encoded.
> 
> The second is that threads on the same core can't be signalled directly
> so we must designate a bounce CPU to reflect the IPI back.

This is a revised patch with only DD2 enablement. DD2 allows threads on
the same core to be IPIed. It's much simpler, and most of the code is
fixing the watchdog and preventing it from triggering from xmon (which
will be split into other patches of course).

It's probably a better starting point to get this working and merged
first, then revisiting bouncing.

---
 arch/powerpc/include/asm/opal-api.h|  1 +
 arch/powerpc/include/asm/opal.h|  2 ++
 arch/powerpc/kernel/irq.c  | 20 ++
 arch/powerpc/kernel/watchdog.c | 29 +++---
 arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
 arch/powerpc/platforms/powernv/powernv.h   |  1 +
 arch/powerpc/platforms/powernv/setup.c |  3 +++
 arch/powerpc/platforms/powernv/smp.c   | 24 +
 arch/powerpc/xmon/xmon.c   | 17 +++
 9 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h 
b/arch/powerpc/include/asm/opal-api.h
index 450a60b81d2a..9d191ebea706 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -188,6 +188,7 @@
 #define OPAL_XIVE_DUMP 142
 #define OPAL_XIVE_RESERVED3143
 #define OPAL_XIVE_RESERVED4144
+#define OPAL_SIGNAL_SYSTEM_RESET   145
 #define OPAL_NPU_INIT_CONTEXT  146
 #define OPAL_NPU_DESTROY_CONTEXT   147
 #define OPAL_NPU_MAP_LPAR  148
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index 726c23304a57..7d7613c49f2b 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -281,6 +281,8 @@ int opal_get_power_shift_ratio(u32 handle, int token, u32 
*psr);
 int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
 int opal_sensor_group_clear(u32 group_hndl, int token);
 
+int64_t opal_signal_system_reset(int32_t cpu);
+
 /* Internal functions */
 extern int early_init_dt_scan_opal(unsigned long node, const char *uname,
   int depth, void *data);
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 4e65bf82f5e0..8ffebb9437e5 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -407,11 +407,31 @@ static const u8 srr1_to_lazyirq[0x10] = {
PACA_IRQ_HMI,
0, 0, 0, 0, 0 };
 
+static noinline void replay_system_reset(void)
+{
+   struct pt_regs regs;
+
+   ppc_save_regs(®s);
+
+   get_paca()->in_nmi = 1;
+   system_reset_exception(®s);
+   get_paca()->in_nmi = 0;
+}
+
 void irq_set_pending_from_srr1(unsigned long srr1)
 {
unsigned int idx = (srr1 & SRR1_WAKEMASK_P8) >> 18;
 
/*
+* 0100b SRR1 reason is system reset. Take it now,
+* which is immediately after registers are restored
+* from idle. It's an NMI, so interrupts needn't be
+* re-enabled.
+*/
+   if (unlikely(idx == 4))
+   replay_system_reset();
+
+   /*
 * The 0 index (SRR1[42:45]=b) must always evaluate to 0,
 * so this can be called unconditionally with srr1 wake reason.
 */
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 2f6eadd9408d..1fb9379dc683 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -97,8 +97,7 @@ static void wd_lockup_ipi(struct pt_regs *regs)
else
dump_stack();
 
-   if (hardlockup_panic)
-   nmi_panic(regs, "Hard LOCKUP");
+   /* Do not panic from here because that can recurse into NMI IPI layer */
 }
 
 static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
@@ -134,15 +133,18 @@ static void watchdog_smp_panic(int cpu, u64 tb)
pr_emerg("Watchdog CPU:%d detected Hard LOCKUP other CPUS:%*pbl\n",
cpu, cpumask_pr_args(&wd_smp_cpus_pending));
 
-   /*
-* Try to trigger the stuck CPUs.
-*/
-   for_each_cpu(c, &wd_smp_cpus_pending) {
-   if (c == cpu)
-   continue;
-   smp_send_nmi_ipi(c, wd_lockup_ipi, 100);
+   if (!sysctl_hardlockup_all_cpu_backtrace) {
+   /*
+* Try to trigger the stuck CPUs, unless we are going to
+* get a backtrace on all of them anyway.
+*/
+   for_each_cpu(c, &wd_smp_cpus_pending) {
+   if (c == cpu)
+   continue;
+

Re: [PATCH 4/5] powerpc/jprobes: Disable preemption when triggered through ftrace

2017-09-14 Thread Masami Hiramatsu

On Thu, 14 Sep 2017 15:55:39 +0530
"Naveen N. Rao"  wrote:

> On 2017/09/13 05:05PM, Masami Hiramatsu wrote:
> > On Thu, 14 Sep 2017 02:50:35 +0530
> > "Naveen N. Rao"  wrote:
> > 
> > > KPROBES_SANITY_TEST throws the below splat when CONFIG_PREEMPT is
> > > enabled:
> > > 
> > > [3.140410] Kprobe smoke test: started
> > > [3.149680] DEBUG_LOCKS_WARN_ON(val > preempt_count())
> > > [3.149684] [ cut here ]
> > > [3.149695] WARNING: CPU: 19 PID: 1 at kernel/sched/core.c:3094 
> > > preempt_count_sub+0xcc/0x140
> > > [3.149699] Modules linked in:
> > > [3.149705] CPU: 19 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc7-nnr+ 
> > > #97
> > > [3.149709] task: c000fea8 task.stack: c000feb0
> > > [3.149713] NIP:  c011d3dc LR: c011d3d8 CTR: 
> > > c0a090d0
> > > [3.149718] REGS: c000feb03400 TRAP: 0700   Not tainted  
> > > (4.13.0-rc7-nnr+)
> > > [3.149722] MSR:  80021033   CR: 28000282  
> > > XER: 
> > > [3.149732] CFAR: c015aa18 SOFTE: 0
> > > 
> > > [3.149786] NIP [c011d3dc] preempt_count_sub+0xcc/0x140
> > > [3.149790] LR [c011d3d8] preempt_count_sub+0xc8/0x140
> > > [3.149794] Call Trace:
> > > [3.149798] [c000feb03680] [c011d3d8] 
> > > preempt_count_sub+0xc8/0x140 (unreliable)
> > > [3.149804] [c000feb036e0] [c0046198] 
> > > kprobe_handler+0x228/0x4b0
> > > [3.149810] [c000feb03750] [c00269c8] 
> > > program_check_exception+0x58/0x3b0
> > > [3.149816] [c000feb037c0] [c000903c] 
> > > program_check_common+0x16c/0x170
> > > [3.149822] --- interrupt: 0 at kprobe_target+0x8/0x20
> > >LR = init_test_probes+0x248/0x7d0
> > > [3.149829] [c000feb03ab0] [c0e4f048] kp+0x0/0x80 
> > > (unreliable)
> > > [3.149835] [c000feb03b10] [c004ea60] 
> > > livepatch_handler+0x38/0x74
> > > [3.149841] [c000feb03ba0] [c0d0de54] 
> > > init_kprobes+0x1d8/0x208
> > > [3.149846] [c000feb03c40] [c000daa8] 
> > > do_one_initcall+0x68/0x1d0
> > > [3.149852] [c000feb03d00] [c0ce44f0] 
> > > kernel_init_freeable+0x298/0x374
> > > [3.149857] [c000feb03dc0] [c000dd84] 
> > > kernel_init+0x24/0x160
> > > [3.149863] [c000feb03e30] [c000bfec] 
> > > ret_from_kernel_thread+0x5c/0x70
> > > [3.149867] Instruction dump:
> > > [3.149871] 419effdc 3d22001b 39299240 8129 2f89 409effc8 
> > > 3c82ffcb 3c62ffcb
> > > [3.149879] 3884bc68 3863bc18 4803d5fd 6000 <0fe0> 4ba8 
> > > 6000 6000
> > > [3.149890] ---[ end trace 432dd46b4ce3d29f ]---
> > > [3.166003] Kprobe smoke test: passed successfully
> > > 
> > > The issue is that we aren't disabling preemption in
> > > kprobe_ftrace_handler(). Disable it.
> > 
> > Oops, right! Similar patch may need for x86 too.
> 
> Indeed, I will send a patch for that.
> 
> On a related note, I've been looking into why we have !PREEMPT for 
> CONFIG_OPTPROBES. It looks like the primary reason is x86 having to deal 
> with replacing multiple instructions. However, that isn't true with arm 
> and powerpc. So, does it make sense to move 'depends on !PREEMPT' to the 
> x86 code? Are there other scenarios where it might cause issues for 
> arm/powerpc?

Indeed! Whuat I expected was to replace several instructions in RISC processors
for jumping far site (like load & jump), but you chose different approach.
So there is no reason to prehibit that.

Thanks!


-- 
Masami Hiramatsu

Re: [PATCH 4/5] powerpc/jprobes: Disable preemption when triggered through ftrace

2017-09-14 Thread Naveen N. Rao

On 2017/09/13 05:05PM, Masami Hiramatsu wrote:
> On Thu, 14 Sep 2017 02:50:35 +0530
> "Naveen N. Rao"  wrote:
> 
> > KPROBES_SANITY_TEST throws the below splat when CONFIG_PREEMPT is
> > enabled:
> > 
> > [3.140410] Kprobe smoke test: started
> > [3.149680] DEBUG_LOCKS_WARN_ON(val > preempt_count())
> > [3.149684] [ cut here ]
> > [3.149695] WARNING: CPU: 19 PID: 1 at kernel/sched/core.c:3094 
> > preempt_count_sub+0xcc/0x140
> > [3.149699] Modules linked in:
> > [3.149705] CPU: 19 PID: 1 Comm: swapper/0 Not tainted 4.13.0-rc7-nnr+ 
> > #97
> > [3.149709] task: c000fea8 task.stack: c000feb0
> > [3.149713] NIP:  c011d3dc LR: c011d3d8 CTR: 
> > c0a090d0
> > [3.149718] REGS: c000feb03400 TRAP: 0700   Not tainted  
> > (4.13.0-rc7-nnr+)
> > [3.149722] MSR:  80021033   CR: 28000282  
> > XER: 
> > [3.149732] CFAR: c015aa18 SOFTE: 0
> > 
> > [3.149786] NIP [c011d3dc] preempt_count_sub+0xcc/0x140
> > [3.149790] LR [c011d3d8] preempt_count_sub+0xc8/0x140
> > [3.149794] Call Trace:
> > [3.149798] [c000feb03680] [c011d3d8] 
> > preempt_count_sub+0xc8/0x140 (unreliable)
> > [3.149804] [c000feb036e0] [c0046198] 
> > kprobe_handler+0x228/0x4b0
> > [3.149810] [c000feb03750] [c00269c8] 
> > program_check_exception+0x58/0x3b0
> > [3.149816] [c000feb037c0] [c000903c] 
> > program_check_common+0x16c/0x170
> > [3.149822] --- interrupt: 0 at kprobe_target+0x8/0x20
> >LR = init_test_probes+0x248/0x7d0
> > [3.149829] [c000feb03ab0] [c0e4f048] kp+0x0/0x80 
> > (unreliable)
> > [3.149835] [c000feb03b10] [c004ea60] 
> > livepatch_handler+0x38/0x74
> > [3.149841] [c000feb03ba0] [c0d0de54] 
> > init_kprobes+0x1d8/0x208
> > [3.149846] [c000feb03c40] [c000daa8] 
> > do_one_initcall+0x68/0x1d0
> > [3.149852] [c000feb03d00] [c0ce44f0] 
> > kernel_init_freeable+0x298/0x374
> > [3.149857] [c000feb03dc0] [c000dd84] kernel_init+0x24/0x160
> > [3.149863] [c000feb03e30] [c000bfec] 
> > ret_from_kernel_thread+0x5c/0x70
> > [3.149867] Instruction dump:
> > [3.149871] 419effdc 3d22001b 39299240 8129 2f89 409effc8 
> > 3c82ffcb 3c62ffcb
> > [3.149879] 3884bc68 3863bc18 4803d5fd 6000 <0fe0> 4ba8 
> > 6000 6000
> > [3.149890] ---[ end trace 432dd46b4ce3d29f ]---
> > [3.166003] Kprobe smoke test: passed successfully
> > 
> > The issue is that we aren't disabling preemption in
> > kprobe_ftrace_handler(). Disable it.
> 
> Oops, right! Similar patch may need for x86 too.

Indeed, I will send a patch for that.

On a related note, I've been looking into why we have !PREEMPT for 
CONFIG_OPTPROBES. It looks like the primary reason is x86 having to deal 
with replacing multiple instructions. However, that isn't true with arm 
and powerpc. So, does it make sense to move 'depends on !PREEMPT' to the 
x86 code? Are there other scenarios where it might cause issues for 
arm/powerpc?

Thanks!
- Naveen

[PATCH] drivers: of: static DT reservations incorrectly added to dynamic list

2017-09-14 Thread Stewart Smith

There are two types of memory reservations firmware can ask the kernel
to make in the device tree: static and dynamic.
See Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

If you have greater than 16 entries in /reserved-memory (as we do on
POWER9 systems) you would get this scary looking error message:
[0.00] OF: reserved mem: not enough space all defined regions.

This is harmless if all your reservations are static (which with OPAL on
POWER9, they are).

It is not harmless if you have any dynamic reservations after the 16th.

In the first pass over the fdt to find reservations, the child nodes of
/reserved-memory are added to a static array in of_reserved_mem.c so that
memory can be reserved in a 2nd pass. The array has 16 entries. This is why,
on my dual socket POWER9 system, I get that error 4 times with 20 static
reservations.

We don't have a problem on ppc though, as in arch/powerpc/kernel/prom.c
we look at the new style /reserved-ranges property to do reservations,
and this logic was introduced in 0962e8004e974 (well before any powernv
system shipped).

Google shows up no occurances of that exact error message, so we're probably
safe in that no machine that people use has memory not being reserved when
it should be.

The fix is simple, as there's a different code path for static and dynamic
allocations, we just don't add the region to the list if it's static. Since
it's a static *OR* dynamic region, this is a perfectly valid thing to do
(although I have not checked every real world device tree on the planet
for this condition)

Fixes: 3f0c8206644836e4f10a6b9fc47cda6a9a372f9b
Signed-off-by: Stewart Smith 
---
NOTE: I've done only fairly limited testing of this on POWER, I
certainly haven't tested on ARM or *anything* with dynamic
allocations. So, testing and comments welcome.
---
 drivers/of/fdt.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index ce30c9a588a4..a9a44099ed69 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -587,7 +587,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long 
node,
phys_addr_t base, size;
int len;
const __be32 *prop;
-   int nomap, first = 1;
+   int nomap;
 
prop = of_get_flat_dt_prop(node, "reg", &len);
if (!prop)
@@ -614,10 +614,6 @@ static int __init __reserved_mem_reserve_reg(unsigned long 
node,
uname, &base, (unsigned long)size / SZ_1M);
 
len -= t_len;
-   if (first) {
-   fdt_reserved_mem_save_node(node, uname, base, size);
-   first = 0;
-   }
}
return 0;
 }
-- 
2.13.5

Re: [PATCH 3/5] powerpc/kprobes: Fix warnings from __this_cpu_read() on preempt kernels

2017-09-14 Thread Masami Hiramatsu

On Thu, 14 Sep 2017 12:17:20 +0530
"Naveen N. Rao"  wrote:

> On 2017/09/13 05:36PM, Masami Hiramatsu wrote:
> > On Thu, 14 Sep 2017 02:50:34 +0530
> > "Naveen N. Rao"  wrote:
> > 
> > > Kamalesh pointed out that we are getting the below call traces with
> > > livepatched functions when we enable CONFIG_PREEMPT:
> > > 
> > > [  495.470721] BUG: using __this_cpu_read() in preemptible [] 
> > > code: cat/8394
> > > [  495.471167] caller is is_current_kprobe_addr+0x30/0x90
> > > [  495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G  K 
> > > 4.13.0-rc7-nnr+ #95
> > > [  495.471173] Call Trace:
> > > [  495.471178] [c0008fd9b960] [c09f039c] 
> > > dump_stack+0xec/0x160 (unreliable)
> > > [  495.471184] [c0008fd9b9a0] [c059169c] 
> > > check_preemption_disabled+0x15c/0x170
> > > [  495.471187] [c0008fd9ba30] [c0046460] 
> > > is_current_kprobe_addr+0x30/0x90
> > > [  495.471191] [c0008fd9ba60] [c004e9a0] ftrace_call+0x1c/0xb8
> > > [  495.471195] [c0008fd9bc30] [c0376fd8] seq_read+0x238/0x5c0
> > > [  495.471199] [c0008fd9bcd0] [c03cfd78] 
> > > proc_reg_read+0x88/0xd0
> > > [  495.471203] [c0008fd9bd00] [c033e5d4] __vfs_read+0x44/0x1b0
> > > [  495.471206] [c0008fd9bd90] [c03402ec] vfs_read+0xbc/0x1b0
> > > [  495.471210] [c0008fd9bde0] [c0342138] SyS_read+0x68/0x110
> > > [  495.471214] [c0008fd9be30] [c000bc6c] system_call+0x58/0x6c
> > > 
> > > Commit c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for
> > > jprobes") introduced a helper is_current_kprobe_addr() to help determine
> > > if the current function has been livepatched or if it has a jprobe
> > > installed, both of which modify the NIP.
> > > 
> > > In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption
> > > before calling into setjmp_pre_handler() which returns without disabling
> > > pre-emption. This is done to ensure that the jprobe han dler won't
> > > disappear beneath us if the jprobe is unregistered between the
> > > setjmp_pre_handler() and the subsequent longjmp_break_handler() called
> > > from the jprobe handler. Due to this, we can use __this_cpu_read() in
> > > is_current_kprobe_addr() with the pre-emption check as we know that
> > > pre-emption will be disabled.
> > > 
> > > However, if this function has been livepatched, we are still doing this
> > > check and when we do so, pre-emption won't necessarily be disabled. This
> > > results in the call trace shown above.
> > > 
> > > Fix this by only invoking is_current_kprobe_addr() when pre-emption is
> > > disabled. And since we now guard this within a pre-emption check, we can
> > > instead use raw_cpu_read() to get the current_kprobe value skipping the
> > > check done by __this_cpu_read().
> > 
> > Hmm, can you disable preempt temporary at this function and read it?
> 
> Yes, but I felt this approach is more optimal specifically for live 
> patching. We don't normally expect preemption to be disabled while 
> handling a livepatched function, so the simple 'if (!preemptible())'
> check helps us bypass looking at current_kprobe.

Ah, I see. this is for checking "jprobes", since kprobes/kretprobes
usually already done there.

> 
> The check also ensures we can use raw_cpu_read() safely in other 
> scenarios. Do you see any other concerns with this approach?

Yes, there are 2 reasons.

At first, if user's custom kprobe pre-handler changes NIP for their
custom handwriting livepatching, such handler will enable preemption
before return. We don't prohibit it. I think this function can not
detect it.

And also, the function "name" is very confusing :)
Especially, since this symbol is exported, if you are checking jprobes
context, it should be renamed, at least it starts with "__" so that
show it as local use.

Thank you,


> 
> Thanks,
> Naveen
> 
> > 
> > Thank you,
> > 
> > > 
> > > Fixes: c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for 
> > > jprobes")
> > > Reported-by: Kamalesh Babulal 
> > > Signed-off-by: Naveen N. Rao 
> > > ---
> > >  arch/powerpc/kernel/kprobes.c | 8 ++--
> > >  1 file changed, 6 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
> > > index e848fe2c93fb..db40b13fd3d1 100644
> > > --- a/arch/powerpc/kernel/kprobes.c
> > > +++ b/arch/powerpc/kernel/kprobes.c
> > > @@ -45,8 +45,12 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = 
> > > {{NULL, NULL}};
> > >  
> > >  int is_current_kprobe_addr(unsigned long addr)
> > >  {
> > > - struct kprobe *p = kprobe_running();
> > > - return (p && (unsigned long)p->addr == addr) ? 1 : 0;
> > > + if (!preemptible()) {
> > > + struct kprobe *p = raw_cpu_read(current_kprobe);
> > > + return (p && (unsigned long)p->addr == addr) ? 1 : 0;
> > > + }
> > > +
> > > + return 0;
> > >  }
> > >  
> > >  bool arch_within_kprobe_blacklist(unsigned long addr)
> > > --

Re: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR

2017-09-14 Thread Benjamin Herrenschmidt

On Thu, 2017-09-14 at 09:27 +, David Laight wrote:
> You can logically 'hotplug' PCI(e) on any system [1].
> 
> The 'problem' is that whatever enumerates the PCI(e) at system
> powerup doesn't normally assign extra resources to bridges to allow
> for devices that aren't present at boot time.
> So you can normally only replace cards with ones that use the same
> (or less) resources, or that are not behind any bridges.
> This is problematic if you have a docking station connected via
> a bridge.

There's also the problem of Max Payload Size. If you can hotplug behind
a bridge then the standard algorithm of finding the max of all devices
behind a host bridge doesn't work anymore and you have to clamp
everybody to 128 bytes.

> [1] Apart from some annoying x86 Dell servers we have which generate
> an NMI when the PCIe link goes down (when we reprogram the fpga).
> They also fail to boot if a link doesn't come up...
> 
> David

Re: [PATCH 2/5] powerpc/kprobes: Do not suppress instruction emulation if a single run failed

2017-09-14 Thread Naveen N. Rao

On 2017/09/14 02:45AM, Masami Hiramatsu wrote:
> On Thu, 14 Sep 2017 12:08:07 +0530
> "Naveen N. Rao"  wrote:
> 
> > On 2017/09/13 04:53PM, Masami Hiramatsu wrote:
> > > On Thu, 14 Sep 2017 02:50:33 +0530
> > > "Naveen N. Rao"  wrote:
> > > 
> > > > Currently, we disable instruction emulation if emulate_step() fails for
> > > > any reason. However, such failures could be transient and specific to a
> > > > particular run. Instead, only disable instruction emulation if we have
> > > > never been able to emulate this. If we had emulated this instruction
> > > > successfully at least once, then we single step only this probe hit and
> > > > continue to try emulating the instruction in subsequent probe hits.
> > > 
> > > Hmm, would this mean that the instruction is emulatable or not depends
> > > on context? What kind of situation is considerable?
> > 
> > Yes, as an example, a load/store instruction can cause exceptions 
> > depending on the address. In some of those cases, we will have to single 
> > step the instruction, but we will be able to emulate in most scenarios.
> 
> OK, I got it.
> Could you add this example as comment in the code so that readers can
> easily understand?

Sure.

Thanks,
Naveen

Re: [PATCH 3/5] powerpc/kprobes: Fix warnings from __this_cpu_read() on preempt kernels

2017-09-14 Thread Kamalesh Babulal


On Thursday 14 September 2017 02:50 AM, Naveen N. Rao wrote:

Kamalesh pointed out that we are getting the below call traces with
livepatched functions when we enable CONFIG_PREEMPT:

[  495.470721] BUG: using __this_cpu_read() in preemptible [] code: 
cat/8394
[  495.471167] caller is is_current_kprobe_addr+0x30/0x90
[  495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G  K 
4.13.0-rc7-nnr+ #95
[  495.471173] Call Trace:
[  495.471178] [c0008fd9b960] [c09f039c] dump_stack+0xec/0x160 
(unreliable)
[  495.471184] [c0008fd9b9a0] [c059169c] 
check_preemption_disabled+0x15c/0x170
[  495.471187] [c0008fd9ba30] [c0046460] 
is_current_kprobe_addr+0x30/0x90
[  495.471191] [c0008fd9ba60] [c004e9a0] ftrace_call+0x1c/0xb8
[  495.471195] [c0008fd9bc30] [c0376fd8] seq_read+0x238/0x5c0
[  495.471199] [c0008fd9bcd0] [c03cfd78] proc_reg_read+0x88/0xd0
[  495.471203] [c0008fd9bd00] [c033e5d4] __vfs_read+0x44/0x1b0
[  495.471206] [c0008fd9bd90] [c03402ec] vfs_read+0xbc/0x1b0
[  495.471210] [c0008fd9bde0] [c0342138] SyS_read+0x68/0x110
[  495.471214] [c0008fd9be30] [c000bc6c] system_call+0x58/0x6c

Commit c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for
jprobes") introduced a helper is_current_kprobe_addr() to help determine
if the current function has been livepatched or if it has a jprobe
installed, both of which modify the NIP.

In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption
before calling into setjmp_pre_handler() which returns without disabling
pre-emption. This is done to ensure that the jprobe handler won't
disappear beneath us if the jprobe is unregistered between the
setjmp_pre_handler() and the subsequent longjmp_break_handler() called
from the jprobe handler. Due to this, we can use __this_cpu_read() in
is_current_kprobe_addr() with the pre-emption check as we know that
pre-emption will be disabled.

However, if this function has been livepatched, we are still doing this
check and when we do so, pre-emption won't necessarily be disabled. This
results in the call trace shown above.

Fix this by only invoking is_current_kprobe_addr() when pre-emption is
disabled. And since we now guard this within a pre-emption check, we can
instead use raw_cpu_read() to get the current_kprobe value skipping the
check done by __this_cpu_read().

Fixes: c05b8c4474c030 ("powerpc/kprobes: Skip livepatch_handler() for jprobes")
Reported-by: Kamalesh Babulal 
Signed-off-by: Naveen N. Rao 


Thanks, the call traces are not seen anymore with the patch.

Tested-by: Kamalesh Babulal 




---
 arch/powerpc/kernel/kprobes.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index e848fe2c93fb..db40b13fd3d1 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -45,8 +45,12 @@ struct kretprobe_blackpoint kretprobe_blacklist[] = {{NULL, 
NULL}};

 int is_current_kprobe_addr(unsigned long addr)
 {
-   struct kprobe *p = kprobe_running();
-   return (p && (unsigned long)p->addr == addr) ? 1 : 0;
+   if (!preemptible()) {
+   struct kprobe *p = raw_cpu_read(current_kprobe);
+   return (p && (unsigned long)p->addr == addr) ? 1 : 0;
+   }
+
+   return 0;
 }

 bool arch_within_kprobe_blacklist(unsigned long addr)

Re: [PATCH 2/5] powerpc/kprobes: Do not suppress instruction emulation if a single run failed

2017-09-14 Thread Masami Hiramatsu

On Thu, 14 Sep 2017 12:08:07 +0530
"Naveen N. Rao"  wrote:

> On 2017/09/13 04:53PM, Masami Hiramatsu wrote:
> > On Thu, 14 Sep 2017 02:50:33 +0530
> > "Naveen N. Rao"  wrote:
> > 
> > > Currently, we disable instruction emulation if emulate_step() fails for
> > > any reason. However, such failures could be transient and specific to a
> > > particular run. Instead, only disable instruction emulation if we have
> > > never been able to emulate this. If we had emulated this instruction
> > > successfully at least once, then we single step only this probe hit and
> > > continue to try emulating the instruction in subsequent probe hits.
> > 
> > Hmm, would this mean that the instruction is emulatable or not depends
> > on context? What kind of situation is considerable?
> 
> Yes, as an example, a load/store instruction can cause exceptions 
> depending on the address. In some of those cases, we will have to single 
> step the instruction, but we will be able to emulate in most scenarios.

OK, I got it.
Could you add this example as comment in the code so that readers can
easily understand?

Thank you,

> 
> Thanks for the review!
> - Naveen
> 


-- 
Masami Hiramatsu

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Sergey Senozhatsky

On (09/14/17 11:15), Laurent Dufour wrote:
> On 14/09/2017 11:11, Sergey Senozhatsky wrote:
> > On (09/14/17 10:58), Laurent Dufour wrote:
> > [..]
> >> That's right, but here this is the  sequence counter mm->mm_seq, not the
> >> vm_seq one.
> > 
> > d'oh... you are right.
> 
> So I'm doubting about the probability of a deadlock here, but I don't like
> to see lockdep complaining. Is there an easy way to make it happy ?


 /*
  * well... answering your question - it seems raw versions of seqcount
  * functions don't call lockdep's lock_acquire/lock_release...
  *
  * but I have never told you that. never.
  */


lockdep, perhaps, can be wrong sometimes, and may be it's one of those
cases. may be not... I'm not a MM guy myself.

below is a lockdep splat I got yesterday. that's v3 of SPF patch set.


[ 2763.365898] ==
[ 2763.365899] WARNING: possible circular locking dependency detected
[ 2763.365902] 4.13.0-next-20170913-dbg-00039-ge3c06ea4b028-dirty #1837 Not 
tainted
[ 2763.365903] --
[ 2763.365905] khugepaged/42 is trying to acquire lock:
[ 2763.365906]  (&mapping->i_mmap_rwsem){}, at: [] 
rmap_walk_file+0x5a/0x142
[ 2763.365913] 
   but task is already holding lock:
[ 2763.365915]  (fs_reclaim){+.+.}, at: [] 
fs_reclaim_acquire+0x12/0x35
[ 2763.365920] 
   which lock already depends on the new lock.

[ 2763.365922] 
   the existing dependency chain (in reverse order) is:
[ 2763.365924] 
   -> #3 (fs_reclaim){+.+.}:
[ 2763.365930]lock_acquire+0x176/0x19e
[ 2763.365932]fs_reclaim_acquire+0x32/0x35
[ 2763.365934]__alloc_pages_nodemask+0x6d/0x1f9
[ 2763.365937]pte_alloc_one+0x17/0x62
[ 2763.365940]__pte_alloc+0x1f/0x83
[ 2763.365943]move_page_tables+0x2c3/0x5a2
[ 2763.365944]move_vma.isra.25+0xff/0x29f
[ 2763.365946]SyS_mremap+0x41b/0x49e
[ 2763.365949]entry_SYSCALL_64_fastpath+0x18/0xad
[ 2763.365951] 
   -> #2 (&vma->vm_sequence/1){+.+.}:
[ 2763.365955]lock_acquire+0x176/0x19e
[ 2763.365958]write_seqcount_begin_nested+0x1b/0x1d
[ 2763.365959]__vma_adjust+0x1c4/0x5f1
[ 2763.365961]__split_vma+0x12c/0x181
[ 2763.365963]do_munmap+0x128/0x2af
[ 2763.365965]vm_munmap+0x5a/0x73
[ 2763.365968]elf_map+0xb1/0xce
[ 2763.365970]load_elf_binary+0x91e/0x137a
[ 2763.365973]search_binary_handler+0x70/0x1f3
[ 2763.365974]do_execveat_common+0x45e/0x68e
[ 2763.365978]call_usermodehelper_exec_async+0xf7/0x11f
[ 2763.365980]ret_from_fork+0x27/0x40
[ 2763.365981] 
   -> #1 (&vma->vm_sequence){+.+.}:
[ 2763.365985]lock_acquire+0x176/0x19e
[ 2763.365987]write_seqcount_begin_nested+0x1b/0x1d
[ 2763.365989]__vma_adjust+0x1a9/0x5f1
[ 2763.365991]__split_vma+0x12c/0x181
[ 2763.365993]do_munmap+0x128/0x2af
[ 2763.365994]vm_munmap+0x5a/0x73
[ 2763.365996]elf_map+0xb1/0xce
[ 2763.365998]load_elf_binary+0x91e/0x137a
[ 2763.365999]search_binary_handler+0x70/0x1f3
[ 2763.366001]do_execveat_common+0x45e/0x68e
[ 2763.366003]call_usermodehelper_exec_async+0xf7/0x11f
[ 2763.366005]ret_from_fork+0x27/0x40
[ 2763.366006] 
   -> #0 (&mapping->i_mmap_rwsem){}:
[ 2763.366010]__lock_acquire+0xa72/0xca0
[ 2763.366012]lock_acquire+0x176/0x19e
[ 2763.366015]down_read+0x3b/0x55
[ 2763.366017]rmap_walk_file+0x5a/0x142
[ 2763.366018]page_referenced+0xfc/0x134
[ 2763.366022]shrink_active_list+0x1ac/0x37d
[ 2763.366024]shrink_node_memcg.constprop.72+0x3ca/0x567
[ 2763.366026]shrink_node+0x3f/0x14c
[ 2763.366028]try_to_free_pages+0x288/0x47a
[ 2763.366030]__alloc_pages_slowpath+0x3a7/0xa49
[ 2763.366032]__alloc_pages_nodemask+0xf1/0x1f9
[ 2763.366035]khugepaged+0xc8/0x167c
[ 2763.366037]kthread+0x133/0x13b
[ 2763.366039]ret_from_fork+0x27/0x40
[ 2763.366040] 
   other info that might help us debug this:

[ 2763.366042] Chain exists of:
 &mapping->i_mmap_rwsem --> &vma->vm_sequence/1 --> fs_reclaim

[ 2763.366048]  Possible unsafe locking scenario:

[ 2763.366049]CPU0CPU1
[ 2763.366050]
[ 2763.366051]   lock(fs_reclaim);
[ 2763.366054]lock(&vma->vm_sequence/1);
[ 2763.366056]lock(fs_reclaim);
[ 2763.366058]   lock(&mapping->i_mmap_rwsem);
[ 2763.366061] 
*** DEADLOCK ***

[ 2763.366063] 1 lock held by khugepaged/42:
[ 2763.366064]  #0:  (fs_reclaim){+.+.}, at: [] 
fs_reclaim_acquire+0x12/0x35
[ 2763.366068] 
   stack backtrace:
[ 2763.366071] CPU: 2 PID: 42 Comm: khugepaged Not tainted 
4.13.0-next-20170913-dbg-000

RE: [PATCH kernel] powerpc/powernv: Update comment about shifting IOV BAR

2017-09-14 Thread David Laight

From: Benjamin Herrenschmidt
> Sent: 14 September 2017 04:40
> On Thu, 2017-09-14 at 13:18 +1000, Alexey Kardashevskiy wrote:
> > On 14/09/17 13:07, Benjamin Herrenschmidt wrote:
> > > On Thu, 2017-09-14 at 12:45 +1000, Alexey Kardashevskiy wrote:
> > > > On 31/08/17 13:34, Alexey Kardashevskiy wrote:
> > > > > From: Benjamin Herrenschmidt 
> > > >
> > > > Oops, this was not right :)
> > > >
> > > > Anyway, Ben, please comment. Thanks.
> > >
> > > This is incorrect, we can do hotplug behind switches afaik.
> >
> > Do we have an actual system which allows this?
> 
> Tuleta no ?

You can logically 'hotplug' PCI(e) on any system [1].

The 'problem' is that whatever enumerates the PCI(e) at system
powerup doesn't normally assign extra resources to bridges to allow
for devices that aren't present at boot time.
So you can normally only replace cards with ones that use the same
(or less) resources, or that are not behind any bridges.
This is problematic if you have a docking station connected via
a bridge.

[1] Apart from some annoying x86 Dell servers we have which generate
an NMI when the PCIe link goes down (when we reprogram the fpga).
They also fail to boot if a link doesn't come up...

David

Re: [PATCH v3] Make initramfs honor CONFIG_DEVTMPFS_MOUNT

2017-09-14 Thread Christophe LEROY


Le 14/09/2017 à 01:51, Rob Landley a écrit :

From: Rob Landley 

Make initramfs honor CONFIG_DEVTMPFS_MOUNT, and move
/dev/console open after devtmpfs mount.

Add workaround for Debian bug that was copied by Ubuntu.


Is that a bug only for Debian ? Why ?
Why should a Debian bug be fixed by a workaround in the mainline kernel ?



Signed-off-by: Rob Landley 
---

v2 discussion: http://lkml.iu.edu/hypermail/linux/kernel/1705.2/05611.html

  drivers/base/Kconfig |   14 --
  fs/namespace.c   |   14 ++
  init/main.c  |   15 +--
  3 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
index f046d21..97352d4 100644
--- a/drivers/base/Kconfig
+++ b/drivers/base/Kconfig
@@ -48,16 +48,10 @@ config DEVTMPFS_MOUNT
bool "Automount devtmpfs at /dev, after the kernel mounted the rootfs"
depends on DEVTMPFS
help
- This will instruct the kernel to automatically mount the
- devtmpfs filesystem at /dev, directly after the kernel has
- mounted the root filesystem. The behavior can be overridden
- with the commandline parameter: devtmpfs.mount=0|1.
- This option does not affect initramfs based booting, here
- the devtmpfs filesystem always needs to be mounted manually
- after the rootfs is mounted.
- With this option enabled, it allows to bring up a system in
- rescue mode with init=/bin/sh, even when the /dev directory
- on the rootfs is completely empty.
+ Automatically mount devtmpfs at /dev on the root filesystem, which
+ lets the system to come up in rescue mode with [rd]init=/bin/sh.
+ Override with devtmpfs.mount=0 on the commandline. Initramfs can
+ create a /dev dir as needed, other rootfs needs the mount point.


Why modifying the initial text ?
Why talking about rescue mode only, whereas this feature mainly concerns 
the standard mode.


  
  config STANDALONE

bool "Select only drivers that don't need compile-time external 
firmware"
diff --git a/fs/namespace.c b/fs/namespace.c
index f8893dc..06057d7 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2417,7 +2417,21 @@ static int do_add_mount(struct mount *newmnt, struct 
path *path, int mnt_flags)
err = -EBUSY;
if (path->mnt->mnt_sb == newmnt->mnt.mnt_sb &&
path->mnt->mnt_root == path->dentry)
+   {
+   if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT) &&
+   !strcmp(path->mnt->mnt_sb->s_type->name, "devtmpfs"))
+   {
+   /* Debian's kernel config enables DEVTMPFS_MOUNT, then
+  its initramfs setup script tries to mount devtmpfs
+  again, and if the second mount-over-itself fails
+  the script overmounts a tmpfs on /dev to hide the
+  existing contents, then boot fails with empty /dev. 
*/


Does it matter for the kernel code what Debian's kernel config does ?


+   printk(KERN_WARNING "Debian bug workaround for devtmpfs 
overmount.");


Is this log message worth it when this modification goes in mainline 
kernel ?


If so, pr_err() should be used instead.


+
+   err = 0;
+   }
goto unlock;
+   }
  
  	err = -EINVAL;

if (d_is_symlink(newmnt->mnt.mnt_root))
diff --git a/init/main.c b/init/main.c
index 0ee9c686..0d8e5ec 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1065,12 +1065,6 @@ static noinline void __init kernel_init_freeable(void)
  
  	do_basic_setup();
  
-	/* Open the /dev/console on the rootfs, this should never fail */

-   if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
-   pr_err("Warning: unable to open an initial console.\n");
-
-   (void) sys_dup(0);
-   (void) sys_dup(0);
/*
 * check if there is an early userspace init.  If yes, let it do all
 * the work
@@ -1082,8 +1076,17 @@ static noinline void __init kernel_init_freeable(void)
if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
ramdisk_execute_command = NULL;
prepare_namespace();
+   } else if (IS_ENABLED(CONFIG_DEVTMPFS_MOUNT)) {
+   sys_mkdir("/dev", 0755);


Why not, but couldn't we also expect the initramfs to already contains 
that mountpoint ?



+   devtmpfs_mount("/dev");
}
  
+	/* Open the /dev/console on the rootfs, this should never fail */

+   if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
+   pr_err("Warning: unable to open an initial console.\n");
+   (void) sys_dup(0);
+   (void) sys_dup(0);
+
/*
 * Ok, we have completed the initial bootup, and
 * we're essentially up and running. Get rid of the




Christophe

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Laurent Dufour

On 14/09/2017 11:11, Sergey Senozhatsky wrote:
> On (09/14/17 10:58), Laurent Dufour wrote:
> [..]
>> That's right, but here this is the  sequence counter mm->mm_seq, not the
>> vm_seq one.
> 
> d'oh... you are right.

So I'm doubting about the probability of a deadlock here, but I don't like
to see lockdep complaining. Is there an easy way to make it happy ?

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Sergey Senozhatsky

On (09/14/17 10:58), Laurent Dufour wrote:
[..]
> That's right, but here this is the  sequence counter mm->mm_seq, not the
> vm_seq one.

d'oh... you are right.

-ss

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Laurent Dufour

On 14/09/2017 10:13, Sergey Senozhatsky wrote:
> Hi,
> 
> On (09/14/17 09:55), Laurent Dufour wrote:
> [..]
>>> so if there are two CPUs, one doing write_seqcount() and the other one
>>> doing read_seqcount() then what can happen is something like this
>>>
>>> CPU0CPU1
>>>
>>> fs_reclaim_acquire()
>>> write_seqcount_begin()
>>> fs_reclaim_acquire()read_seqcount_begin()
>>> write_seqcount_end()
>>>
>>> CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from
>>> CPU1, CPU1 can't read_seqcount_begin() because CPU0 did 
>>> write_seqcount_begin()
>>> and now waits for fs_reclaim_acquire(). makes sense?
>>
>> Yes, this makes sense.
>>
>> But in the case of this series, there is no call to
>> __read_seqcount_begin(), and the reader (the speculative page fault
>> handler), is just checking for (vm_seq & 1) and if this is true, simply
>> exit the speculative path without waiting.
>> So there is no deadlock possibility.
> 
> probably lockdep just knows that those locks interleave at some
> point.
> 
> 
> by the way, I think there is one path that can spin
> 
> find_vma_srcu()
>  read_seqbegin()
>   read_seqcount_begin()
>raw_read_seqcount_begin()
> __read_seqcount_begin()


That's right, but here this is the  sequence counter mm->mm_seq, not the
vm_seq one.

This one is held to protect walking the VMA list "locklessly"...

Cheers,
Laurent.

Re: [PATCH 4/7] powerpc: Free up four 64K PTE bits in 64K backed HPTE pages

2017-09-14 Thread Benjamin Herrenschmidt

On Fri, 2017-09-08 at 15:44 -0700, Ram Pai wrote:
> The second part of the PTE will hold
> (H_PAGE_F_SECOND|H_PAGE_F_GIX) at bit 60,61,62,63.
> NOTE: None of the bits in the secondary PTE were not used
> by 64k-HPTE backed PTE.

Have you measured the performance impact of this ? The second part of
the PTE being in a different cache line there could be one...

Cheers,
Ben.

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Sergey Senozhatsky

Hi,

On (09/14/17 09:55), Laurent Dufour wrote:
[..]
> > so if there are two CPUs, one doing write_seqcount() and the other one
> > doing read_seqcount() then what can happen is something like this
> > 
> > CPU0CPU1
> > 
> > fs_reclaim_acquire()
> > write_seqcount_begin()
> > fs_reclaim_acquire()read_seqcount_begin()
> > write_seqcount_end()
> > 
> > CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from
> > CPU1, CPU1 can't read_seqcount_begin() because CPU0 did 
> > write_seqcount_begin()
> > and now waits for fs_reclaim_acquire(). makes sense?
> 
> Yes, this makes sense.
> 
> But in the case of this series, there is no call to
> __read_seqcount_begin(), and the reader (the speculative page fault
> handler), is just checking for (vm_seq & 1) and if this is true, simply
> exit the speculative path without waiting.
> So there is no deadlock possibility.

probably lockdep just knows that those locks interleave at some
point.


by the way, I think there is one path that can spin

find_vma_srcu()
 read_seqbegin()
  read_seqcount_begin()
   raw_read_seqcount_begin()
__read_seqcount_begin()

-ss

Re: [PATCH 02/25] powerpc: define an additional vma bit for protection keys.

2017-09-14 Thread Benjamin Herrenschmidt

On Thu, 2017-09-14 at 14:38 +1000, Balbir Singh wrote:
> On Fri,  8 Sep 2017 15:44:50 -0700
> Ram Pai  wrote:
> 
> > powerpc needs an additional vma bit to support 32 keys.
> > Till the additional vma bit lands in include/linux/mm.h
> > we have to define  it  in powerpc specific header file.
> > This is  needed to get pkeys working on power.
> > 
> > Signed-off-by: Ram Pai 
> > ---
> 
> "This" being an arch specific hack for the additional bit?

Arch VMA bits ? really ? I'd rather we limit ourselves to 16 keys first
then push for adding the extra bit to the generic code.

Ben.

Re: [PATCH v3 04/20] mm: VMA sequence count

2017-09-14 Thread Laurent Dufour

Hi,

On 14/09/2017 02:31, Sergey Senozhatsky wrote:
> Hi,
> 
> On (09/13/17 18:56), Laurent Dufour wrote:
>> Hi Sergey,
>>
>> On 13/09/2017 13:53, Sergey Senozhatsky wrote:
>>> Hi,
>>>
>>> On (09/08/17 20:06), Laurent Dufour wrote:
> [..]
>>> ok, so what I got on my box is:
>>>
>>> vm_munmap()  -> down_write_killable(&mm->mmap_sem)
>>>  do_munmap()
>>>   __split_vma()
>>>__vma_adjust()  -> write_seqcount_begin(&vma->vm_sequence)
>>>-> write_seqcount_begin_nested(&next->vm_sequence, 
>>> SINGLE_DEPTH_NESTING)
>>>
>>> so this gives 3 dependencies  ->mmap_sem   ->   ->vm_seq
>>>   ->vm_seq ->   ->vm_seq/1
>>>   ->mmap_sem   ->   ->vm_seq/1
>>>
>>>
>>> SyS_mremap() -> down_write_killable(¤t->mm->mmap_sem)
>>>  move_vma()   -> write_seqcount_begin(&vma->vm_sequence)
>>>   -> write_seqcount_begin_nested(&new_vma->vm_sequence, 
>>> SINGLE_DEPTH_NESTING);
>>>   move_page_tables()
>>>__pte_alloc()
>>> pte_alloc_one()
>>>  __alloc_pages_nodemask()
>>>   fs_reclaim_acquire()
>>>
>>>
>>> I think here we have prepare_alloc_pages() call, that does
>>>
>>> -> fs_reclaim_acquire(gfp_mask)
>>> -> fs_reclaim_release(gfp_mask)
>>>
>>> so that adds one more dependency  ->mmap_sem   ->   ->vm_seq->   
>>> fs_reclaim
>>>   ->mmap_sem   ->   ->vm_seq/1  ->   
>>> fs_reclaim
>>>
>>>
>>> now, under memory pressure we hit the slow path and perform direct
>>> reclaim. direct reclaim is done under fs_reclaim lock, so we end up
>>> with the following call chain
>>>
>>> __alloc_pages_nodemask()
>>>  __alloc_pages_slowpath()
>>>   __perform_reclaim()   ->   fs_reclaim_acquire(gfp_mask);
>>>try_to_free_pages()
>>> shrink_node()
>>>  shrink_active_list()
>>>   rmap_walk_file()  ->   i_mmap_lock_read(mapping);
>>>
>>>
>>> and this break the existing dependency. since we now take the leaf lock
>>> (fs_reclaim) first and the the root lock (->mmap_sem).
>>
>> Thanks for looking at this.
>> I'm sorry, I should have miss something.
> 
> no prob :)
> 
> 
>> My understanding is that there are 2 chains of locks:
>>  1. from __vma_adjust() mmap_sem -> i_mmap_rwsem -> vm_seq
>>  2. from move_vmap() mmap_sem -> vm_seq -> fs_reclaim
>>  2. from __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem
> 
> yes, as far as lockdep warning suggests.
> 
>> So the solution would be to have in __vma_adjust()
>>  mmap_sem -> vm_seq -> i_mmap_rwsem
>>
>> But this will raised the following dependency from  unmap_mapping_range()
>> unmap_mapping_range()-> i_mmap_rwsem
>>  unmap_mapping_range_tree()
>>   unmap_mapping_range_vma()
>>zap_page_range_single()
>> unmap_single_vma()
>>  unmap_page_range()  -> vm_seq
>>
>> And there is no way to get rid of it easily as in unmap_mapping_range()
>> there is no VMA identified yet.
>>
>> That's being said I can't see any clear way to get lock dependency cleaned
>> here.
>> Furthermore, this is not clear to me how a deadlock could happen as vm_seq
>> is a sequence lock, and there is no way to get blocked here.
> 
> as far as I understand,
>seq locks can deadlock, technically. not on the write() side, but on
> the read() side:
> 
> read_seqcount_begin()
>  raw_read_seqcount_begin()
>__read_seqcount_begin()
> 
> and __read_seqcount_begin() spins for ever
> 
>__read_seqcount_begin()
>{
> repeat:
>  ret = READ_ONCE(s->sequence);
>  if (unlikely(ret & 1)) {
>  cpu_relax();
>  goto repeat;
>  }
>  return ret;
>}
> 
> 
> so if there are two CPUs, one doing write_seqcount() and the other one
> doing read_seqcount() then what can happen is something like this
> 
>   CPU0CPU1
> 
>   fs_reclaim_acquire()
>   write_seqcount_begin()
>   fs_reclaim_acquire()read_seqcount_begin()
>   write_seqcount_end()
> 
> CPU0 can't write_seqcount_end() because of fs_reclaim_acquire() from
> CPU1, CPU1 can't read_seqcount_begin() because CPU0 did write_seqcount_begin()
> and now waits for fs_reclaim_acquire(). makes sense?

Yes, this makes sense.

But in the case of this series, there is no call to
__read_seqcount_begin(), and the reader (the speculative page fault
handler), is just checking for (vm_seq & 1) and if this is true, simply
exit the speculative path without waiting.
So there is no deadlock possibility.

The bad case would be to have 2 concurrent threads calling
write_seqcount_begin() on the same VMA, leading a wrongly freed sequence
lock but this can't happen because of the mmap_sem holding for write in
such a case.

Cheers,
Laurent.

63 matches

Mail list logo