Re: [PATCH 5/9] kprobes/ftrace: Add recursion protection to the ftrace callback

2020-11-01 Thread Masami Hiramatsu
On Thu, 29 Oct 2020 09:40:01 -0400
Steven Rostedt  wrote:

> On Thu, 29 Oct 2020 16:58:03 +0900
> Masami Hiramatsu  wrote:
> 
> > Hi Steve,
> > 
> > On Wed, 28 Oct 2020 07:52:49 -0400
> > Steven Rostedt  wrote:
> > 
> > > From: "Steven Rostedt (VMware)" 
> > > 
> > > If a ftrace callback does not supply its own recursion protection and
> > > does not set the RECURSION_SAFE flag in its ftrace_ops, then ftrace will
> > > make a helper trampoline to do so before calling the callback instead of
> > > just calling the callback directly.  
> > 
> > So in that case the handlers will be called without preempt disabled?
> > 
> > 
> > > The default for ftrace_ops is going to assume recursion protection unless
> > > otherwise specified.  
> > 
> > This seems to skip entier handler if ftrace finds recursion.
> > I would like to increment the missed counter even in that case.
> 
> Note, this code does not change the functionality at this point, because
> without having the FL_RECURSION flag set (which kprobes does not even in
> this patch), it always gets called from the helper function that does this:
> 
>   bit = trace_test_and_set_recursion(TRACE_LIST_START, TRACE_LIST_MAX);
>   if (bit < 0)
>   return;
> 
>   preempt_disable_notrace();
> 
>   op->func(ip, parent_ip, op, regs);
> 
>   preempt_enable_notrace();
>   trace_clear_recursion(bit);
> 
> Where this function gets called by op->func().
> 
> In other words, you don't get that count anyway, and I don't think you want
> it. Because it means you traced something that your callback calls.

Got it. So nmissed count increment will be an improvement.

> 
> That bit check is basically a nop, because the last patch in this series
> will make the default that everything has recursion protection, but at this
> patch the test does this:
> 
>   /* A previous recursion check was made */
>   if ((val & TRACE_CONTEXT_MASK) > max)
>   return 0;
> 
> Which would always return true, because this function is called via the
> helper that already did the trace_test_and_set_recursion() which, if it
> made it this far, the val would always be greater than max.

OK, let me check the last patch too.

> 
> > 
> > [...]
> > e.g.
> > 
> > > diff --git a/arch/csky/kernel/probes/ftrace.c 
> > > b/arch/csky/kernel/probes/ftrace.c
> > > index 5264763d05be..5eb2604fdf71 100644
> > > --- a/arch/csky/kernel/probes/ftrace.c
> > > +++ b/arch/csky/kernel/probes/ftrace.c
> > > @@ -13,16 +13,21 @@ int arch_check_ftrace_location(struct kprobe *p)
> > >  void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
> > >  struct ftrace_ops *ops, struct pt_regs *regs)
> > >  {
> > > + int bit;
> > >   bool lr_saver = false;
> > >   struct kprobe *p;
> > >   struct kprobe_ctlblk *kcb;
> > >  
> > > - /* Preempt is disabled by ftrace */
> > > + bit = ftrace_test_recursion_trylock();  
> > 
> > > +
> > > + preempt_disable_notrace();
> > >   p = get_kprobe((kprobe_opcode_t *)ip);
> > >   if (!p) {
> > >   p = get_kprobe((kprobe_opcode_t *)(ip - MCOUNT_INSN_SIZE));
> > >   if (unlikely(!p) || kprobe_disabled(p))
> > > - return;
> > > + goto out;
> > >   lr_saver = true;
> > >   }  
> > 
> > if (bit < 0) {
> > kprobes_inc_nmissed_count(p);
> > goto out;
> > }
> 
> If anything called in get_kprobe() or kprobes_inc_nmissed_count() gets
> traced here, you have zero recursion protection, and this will crash the
> machine with a likely reboot (triple fault).

Oops, ok, those can be traced. 

> 
> Note, the recursion handles interrupts and wont stop them. bit < 0 only
> happens if you recurse because this function called something that ends up
> calling itself. Really, why would you care about missing a kprobe on the
> same kprobe?

Usually, sw-breakpoint based kprobes will count that case. Moreover, kprobes
shares one ftrace_ops among all kprobes. I guess in that case any kprobes
in kprobes (e.g. recursive call inside kprobe pre_handlers) will be skipped
by ftrace_test_recursion_trylock(), is that correct?

Thank you,

> 
> -- Steve


-- 
Masami Hiramatsu 


Re: [patch V2 00/18] mm/highmem: Preemptible variant of kmap_atomic & friends

2020-11-01 Thread Thomas Gleixner
On Thu, Oct 29 2020 at 23:18, Thomas Gleixner wrote:
>
> There is also a still to be investigated question from Linus on the initial
> posting versus the per cpu / per task mapping stack depth which might need
> to be made larger due to the ability to take page faults within a mapping
> region.

I looked deeper into that and we have a stack depth of 20. That's plenty
and I couldn't find a way to get above 10 nested ones including faults,
interrupts, softirqs. With some stress testing I was not able to get over
a maximum of 6 according to the traceprintk I added.

For some obscure reason when CONFIG_DEBUG_HIGHMEM is enabled the stack
depth is increased from 20 to 41. But the only thing DEBUG_HIGHMEM does
is to enable a few BUG_ON()'s in the mapping code.

That's a leftover from the historical mapping code which had fixed
entries for various purposes. DEBUG_HIGHMEM inserted guard mappings
between the map types. But that got all ditched when kmap_atomic()
switched to a stack based map management. Though the WITH_KM_FENCE magic
survived without being functional. All the thing does today is to
increase the stack depth.

I just made that functional again by keeping the stack depth increase
and utilizing every second slot. That should catch Willy's mapping
problem nicely if he bothers to test on 32bit :)

Thanks,

tglx



[PATCH v2 net-next 3/3] crypto: caam: Replace in_irq() usage.

2020-11-01 Thread Sebastian Andrzej Siewior
The driver uses in_irq() + in_serving_softirq() magic to decide if NAPI
scheduling is required or packet processing.

The usage of in_*() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
separated or the context be conveyed in an argument passed by the caller,
which usually knows the context.

Use the `sched_napi' argument passed by the callback. It is set true if
called from the interrupt handler and NAPI should be scheduled.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: "Horia Geantă" 
Cc: Aymen Sghaier 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Madalin Bucur 
Cc: Jakub Kicinski 
Cc: Li Yang 
Cc: linux-cry...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/crypto/caam/qi.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/drivers/crypto/caam/qi.c b/drivers/crypto/caam/qi.c
index 57f6ab6bfb56a..8163f5df8ebf7 100644
--- a/drivers/crypto/caam/qi.c
+++ b/drivers/crypto/caam/qi.c
@@ -545,14 +545,10 @@ static void cgr_cb(struct qman_portal *qm, struct 
qman_cgr *cgr, int congested)
}
 }
 
-static int caam_qi_napi_schedule(struct qman_portal *p, struct caam_napi *np)
+static int caam_qi_napi_schedule(struct qman_portal *p, struct caam_napi *np,
+bool sched_napi)
 {
-   /*
-* In case of threaded ISR, for RT kernels in_irq() does not return
-* appropriate value, so use in_serving_softirq to distinguish between
-* softirq and irq contexts.
-*/
-   if (unlikely(in_irq() || !in_serving_softirq())) {
+   if (sched_napi) {
/* Disable QMan IRQ source and invoke NAPI */
qman_p_irqsource_remove(p, QM_PIRQ_DQRI);
np->p = p;
@@ -574,7 +570,7 @@ static enum qman_cb_dqrr_result caam_rsp_fq_dqrr_cb(struct 
qman_portal *p,
struct caam_drv_private *priv = dev_get_drvdata(qidev);
u32 status;
 
-   if (caam_qi_napi_schedule(p, caam_napi))
+   if (caam_qi_napi_schedule(p, caam_napi, sched_napi))
return qman_cb_dqrr_stop;
 
fd = &dqrr->fd;
-- 
2.29.1



[PATCH v2 net-next 2/3] net: dpaa: Replace in_irq() usage.

2020-11-01 Thread Sebastian Andrzej Siewior
The driver uses in_irq() + in_serving_softirq() magic to decide if NAPI
scheduling is required or packet processing.

The usage of in_*() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
separated or the context be conveyed in an argument passed by the caller,
which usually knows the context.

Use the `sched_napi' argument passed by the callback. It is set true if
called from the interrupt handler and NAPI should be scheduled.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: "Horia Geantă" 
Cc: Aymen Sghaier 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Madalin Bucur 
Cc: Jakub Kicinski 
Cc: Li Yang 
Cc: linux-cry...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 98ead77c673e5..39c996b64ccec 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2300,9 +2300,9 @@ static void dpaa_tx_conf(struct net_device *net_dev,
 }
 
 static inline int dpaa_eth_napi_schedule(struct dpaa_percpu_priv *percpu_priv,
-struct qman_portal *portal)
+struct qman_portal *portal, bool 
sched_napi)
 {
-   if (unlikely(in_irq() || !in_serving_softirq())) {
+   if (sched_napi) {
/* Disable QMan IRQ and invoke NAPI */
qman_p_irqsource_remove(portal, QM_PIRQ_DQRI);
 
@@ -2333,7 +2333,7 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct 
qman_portal *portal,
 
percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-   if (dpaa_eth_napi_schedule(percpu_priv, portal))
+   if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
return qman_cb_dqrr_stop;
 
dpaa_eth_refill_bpools(priv);
@@ -2377,7 +2377,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct 
qman_portal *portal,
percpu_priv = this_cpu_ptr(priv->percpu_priv);
percpu_stats = &percpu_priv->stats;
 
-   if (unlikely(dpaa_eth_napi_schedule(percpu_priv, portal)))
+   if (unlikely(dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi)))
return qman_cb_dqrr_stop;
 
/* Make sure we didn't run out of buffers */
@@ -2474,7 +2474,7 @@ static enum qman_cb_dqrr_result conf_error_dqrr(struct 
qman_portal *portal,
 
percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-   if (dpaa_eth_napi_schedule(percpu_priv, portal))
+   if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
return qman_cb_dqrr_stop;
 
dpaa_tx_error(net_dev, priv, percpu_priv, &dq->fd, fq->fqid);
@@ -2499,7 +2499,7 @@ static enum qman_cb_dqrr_result conf_dflt_dqrr(struct 
qman_portal *portal,
 
percpu_priv = this_cpu_ptr(priv->percpu_priv);
 
-   if (dpaa_eth_napi_schedule(percpu_priv, portal))
+   if (dpaa_eth_napi_schedule(percpu_priv, portal, sched_napi))
return qman_cb_dqrr_stop;
 
dpaa_tx_conf(net_dev, priv, percpu_priv, &dq->fd, fq->fqid);
-- 
2.29.1



[PATCH v2 net-next 1/3] soc/fsl/qbman: Add an argument to signal if NAPI processing is required.

2020-11-01 Thread Sebastian Andrzej Siewior
dpaa_eth_napi_schedule() and caam_qi_napi_schedule() schedule NAPI if
invoked from:

 - Hard interrupt context
 - Any context which is not serving soft interrupts

Any context which is not serving soft interrupts includes hard interrupts
so the in_irq() check is redundant. caam_qi_napi_schedule() has a comment
about this:

/*
 * In case of threaded ISR, for RT kernels in_irq() does not return
 * appropriate value, so use in_serving_softirq to distinguish between
 * softirq and irq contexts.
 */
 if (in_irq() || !in_serving_softirq())

This has nothing to do with RT. Even on a non RT kernel force threaded
interrupts run obviously in thread context and therefore in_irq() returns
false when invoked from the handler.

The extension of the in_irq() check with !in_serving_softirq() was there
when the drivers were added, but in the out of tree FSL BSP the original
condition was in_irq() which got extended due to failures on RT.

The usage of in_xxx() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
separated or the context be conveyed in an argument passed by the caller,
which usually knows the context. Right he is, the above construct is
clearly showing why.

The following callchains have been analyzed to end up in
dpaa_eth_napi_schedule():

qman_p_poll_dqrr()
  __poll_portal_fast()
fq->cb.dqrr()
   dpaa_eth_napi_schedule()

portal_isr()
  __poll_portal_fast()
fq->cb.dqrr()
   dpaa_eth_napi_schedule()

Both need to schedule NAPI.
The crypto part has another code path leading up to this:
  kill_fq()
 empty_retired_fq()
   qman_p_poll_dqrr()
 __poll_portal_fast()
fq->cb.dqrr()
   dpaa_eth_napi_schedule()

kill_fq() is called from task context and ends up scheduling NAPI, but
that's pointless and an unintended side effect of the !in_serving_softirq()
check.

The code path:
  caam_qi_poll() -> qman_p_poll_dqrr()

is invoked from NAPI and I *assume* from crypto's NAPI device and not
from qbman's NAPI device. I *guess* it is okay to skip scheduling NAPI
(because this is what happens now) but could be changed if it is wrong
due to `budget' handling.

Add an argument to __poll_portal_fast() which is true if NAPI needs to be
scheduled. This requires propagating the value to the caller including
`qman_cb_dqrr' typedef which is used by the dpaa and the crypto driver.

Signed-off-by: Sebastian Andrzej Siewior 
Cc: "Horia Geantă" 
Cc: Aymen Sghaier 
Cc: Herbert XS 
Cc: "David S. Miller" 
Cc: Madalin Bucur 
Cc: Jakub Kicinski 
Cc: Li Yang 
Cc: linux-cry...@vger.kernel.org
Cc: net...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/crypto/caam/qi.c   |  3 ++-
 drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 12 
 drivers/soc/fsl/qbman/qman.c   | 12 ++--
 drivers/soc/fsl/qbman/qman_test_api.c  |  6 --
 drivers/soc/fsl/qbman/qman_test_stash.c|  6 --
 include/soc/fsl/qman.h |  3 ++-
 6 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/crypto/caam/qi.c b/drivers/crypto/caam/qi.c
index ec53528d82058..57f6ab6bfb56a 100644
--- a/drivers/crypto/caam/qi.c
+++ b/drivers/crypto/caam/qi.c
@@ -564,7 +564,8 @@ static int caam_qi_napi_schedule(struct qman_portal *p, 
struct caam_napi *np)
 
 static enum qman_cb_dqrr_result caam_rsp_fq_dqrr_cb(struct qman_portal *p,
struct qman_fq *rsp_fq,
-   const struct qm_dqrr_entry 
*dqrr)
+   const struct qm_dqrr_entry 
*dqrr,
+   bool sched_napi)
 {
struct caam_napi *caam_napi = raw_cpu_ptr(&pcpu_qipriv.caam_napi);
struct caam_drv_req *drv_req;
diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c 
b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
index 06cc863f4dd63..98ead77c673e5 100644
--- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
+++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
@@ -2316,7 +2316,8 @@ static inline int dpaa_eth_napi_schedule(struct 
dpaa_percpu_priv *percpu_priv,
 
 static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
  struct qman_fq *fq,
- const struct qm_dqrr_entry *dq)
+ const struct qm_dqrr_entry *dq,
+ bool sched_napi)
 {
struct dpaa_fq *dpaa_fq = container_of(fq, struct dpaa_fq, fq_base);
struct dpaa_percpu_priv *percpu_priv;
@@ -2343,7 +2344,8 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct 
qman_portal *portal,
 
 static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *

Re: [PATCH net-next 14/15] net: dpaa: Replace in_irq() usage.

2020-11-01 Thread Sebastian Andrzej Siewior
On 2020-10-31 10:12:15 [-0700], Jakub Kicinski wrote:
> Nit: some networking drivers have a bool napi which means "are we
> running in napi context", the semantics here feel a little backwards,
> at least to me. But if I'm the only one thinking this, so be it.

I renamed it to `sched_napi'.

Sebastian


[PATCH v3 4/4] arch, mm: make kernel_page_present() always available

2020-11-01 Thread Mike Rapoport
From: Mike Rapoport 

For architectures that enable ARCH_HAS_SET_MEMORY having the ability to
verify that a page is mapped in the kernel direct map can be useful
regardless of hibernation.

Add RISC-V implementation of kernel_page_present(), update its forward
declarations and stubs to be a part of set_memory API and remove ugly
ifdefery in inlcude/linux/mm.h around current declarations of
kernel_page_present().

Signed-off-by: Mike Rapoport 
---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c|  4 +---
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 29 +
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  4 +---
 include/linux/mm.h  |  7 ---
 include/linux/set_memory.h  |  5 +
 8 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h 
b/arch/arm64/include/asm/cacheflush.h
index 9384fd8fc13c..45217f21f1fe 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -140,6 +140,7 @@ int set_memory_valid(unsigned long addr, int numpages, int 
enable);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #include 
 
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 439325532be1..92eccaf595c8 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -186,8 +186,8 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 
set_memory_valid((unsigned long)page_address(page), numpages, enable);
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#ifdef CONFIG_HIBERNATION
 /*
  * This function is used to determine if a linear map page has been marked as
  * not-valid. Walk the page table and check the PTE_VALID bit. This is based
@@ -234,5 +234,3 @@ bool kernel_page_present(struct page *page)
ptep = pte_offset_kernel(pmdp, addr);
return pte_valid(READ_ONCE(*ptep));
 }
-#endif /* CONFIG_HIBERNATION */
-#endif /* CONFIG_DEBUG_PAGEALLOC */
diff --git a/arch/riscv/include/asm/set_memory.h 
b/arch/riscv/include/asm/set_memory.h
index 4c5bae7ca01c..d690b08dff2a 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -24,6 +24,7 @@ static inline int set_memory_nx(unsigned long addr, int 
numpages) { return 0; }
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 321b09d2e2ea..87ba5a68bbb8 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -198,3 +198,32 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 __pgprot(0), __pgprot(_PAGE_PRESENT));
 }
 #endif
+
+bool kernel_page_present(struct page *page)
+{
+   unsigned long addr = (unsigned long)page_address(page);
+   pgd_t *pgd;
+   pud_t *pud;
+   p4d_t *p4d;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset_k(addr);
+   if (!pgd_present(*pgd))
+   return false;
+
+   p4d = p4d_offset(pgd, addr);
+   if (!p4d_present(*p4d))
+   return false;
+
+   pud = pud_offset(p4d, addr);
+   if (!pud_present(*pud))
+   return false;
+
+   pmd = pmd_offset(pud, addr);
+   if (!pmd_present(*pmd))
+   return false;
+
+   pte = pte_offset_kernel(pmd, addr);
+   return pte_present(*pte);
+}
diff --git a/arch/x86/include/asm/set_memory.h 
b/arch/x86/include/asm/set_memory.h
index 5948218f35c5..4352f08bfbb5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -82,6 +82,7 @@ int set_pages_rw(struct page *page, int numpages);
 
 int set_direct_map_invalid_noflush(struct page *page);
 int set_direct_map_default_noflush(struct page *page);
+bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index bc9be96b777f..16f878c26667 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2226,8 +2226,8 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
 
arch_flush_lazy_mmu_mode();
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#ifdef CONFIG_HIBERNATION
 bool kernel_page_present(struct page *page)
 {
unsigned int level;
@@ -2239,8 +2239,6 @@ bool kernel_page_present(struct page *page)
pte = lookup_address((unsigned long)page_address(page), &level);
return (pte_val(*pte) & _PAGE_PRESENT);
 }
-#endif /* CONFIG_HIBERNATION */
-#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, unsigned long address,
  

[PATCH v3 3/4] arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC

2020-11-01 Thread Mike Rapoport
From: Mike Rapoport 

The design of DEBUG_PAGEALLOC presumes that __kernel_map_pages() must never
fail. With this assumption is wouldn't be safe to allow general usage of
this function.

Moreover, some architectures that implement __kernel_map_pages() have this
function guarded by #ifdef DEBUG_PAGEALLOC and some refuse to map/unmap
pages when page allocation debugging is disabled at runtime.

As all the users of __kernel_map_pages() were converted to use
debug_pagealloc_map_pages() it is safe to make it available only when
DEBUG_PAGEALLOC is set.

Signed-off-by: Mike Rapoport 
---
 arch/Kconfig |  3 +++
 arch/arm64/Kconfig   |  4 +---
 arch/arm64/mm/pageattr.c |  8 ++--
 arch/powerpc/Kconfig |  5 +
 arch/riscv/Kconfig   |  4 +---
 arch/riscv/include/asm/pgtable.h |  2 --
 arch/riscv/mm/pageattr.c |  2 ++
 arch/s390/Kconfig|  4 +---
 arch/sparc/Kconfig   |  4 +---
 arch/x86/Kconfig |  4 +---
 arch/x86/mm/pat/set_memory.c |  2 ++
 include/linux/mm.h   | 10 +++---
 12 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 56b6ccc0e32d..56d4752b6db6 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1028,6 +1028,9 @@ config HAVE_STATIC_CALL_INLINE
bool
depends on HAVE_STATIC_CALL
 
+config ARCH_SUPPORTS_DEBUG_PAGEALLOC
+   bool
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index f858c352f72a..5a01dfb77b93 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -71,6 +71,7 @@ config ARM64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_MEMORY_FAILURE
select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
select ARCH_SUPPORTS_ATOMIC_RMW
@@ -1005,9 +1006,6 @@ config HOLES_IN_ZONE
 
 source "kernel/Kconfig.hz"
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-   def_bool y
-
 config ARCH_SPARSEMEM_ENABLE
def_bool y
select SPARSEMEM_VMEMMAP_ENABLE
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 1b94f5b82654..439325532be1 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -155,7 +155,7 @@ int set_direct_map_invalid_noflush(struct page *page)
.clear_mask = __pgprot(PTE_VALID),
};
 
-   if (!rodata_full)
+   if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
 
return apply_to_page_range(&init_mm,
@@ -170,7 +170,7 @@ int set_direct_map_default_noflush(struct page *page)
.clear_mask = __pgprot(PTE_RDONLY),
};
 
-   if (!rodata_full)
+   if (!debug_pagealloc_enabled() && !rodata_full)
return 0;
 
return apply_to_page_range(&init_mm,
@@ -178,6 +178,7 @@ int set_direct_map_default_noflush(struct page *page)
   PAGE_SIZE, change_page_range, &data);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
if (!debug_pagealloc_enabled() && !rodata_full)
@@ -186,6 +187,7 @@ void __kernel_map_pages(struct page *page, int numpages, 
int enable)
set_memory_valid((unsigned long)page_address(page), numpages, enable);
 }
 
+#ifdef CONFIG_HIBERNATION
 /*
  * This function is used to determine if a linear map page has been marked as
  * not-valid. Walk the page table and check the PTE_VALID bit. This is based
@@ -232,3 +234,5 @@ bool kernel_page_present(struct page *page)
ptep = pte_offset_kernel(pmdp, addr);
return pte_valid(READ_ONCE(*ptep));
 }
+#endif /* CONFIG_HIBERNATION */
+#endif /* CONFIG_DEBUG_PAGEALLOC */
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index e9f13fe08492..ad8a83f3ddca 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -146,6 +146,7 @@ config PPC
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
select ARCH_SUPPORTS_ATOMIC_RMW
+   select ARCH_SUPPORTS_DEBUG_PAGEALLOCif PPC32 || PPC_BOOK3S_64
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
@@ -355,10 +356,6 @@ config PPC_OF_PLATFORM_PCI
depends on PCI
depends on PPC64 # not supported on 32 bits yet
 
-config ARCH_SUPPORTS_DEBUG_PAGEALLOC
-   depends on PPC32 || PPC_BOOK3S_64
-   def_bool y
-
 config ARCH_SUPPORTS_UPROBES
def_bool y
 
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 44377fd7860e..9283c6f9ae2a 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -14,6 +14,7 @@ config RISCV
def_bool y
select ARCH_CLOCKSOURCE_INIT
se

[PATCH v3 2/4] PM: hibernate: make direct map manipulations more explicit

2020-11-01 Thread Mike Rapoport
From: Mike Rapoport 

When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
not present in the direct map and has to be explicitly mapped before it
could be copied.

Introduce hibernate_map_page() that will explicitly use
set_direct_map_{default,invalid}_noflush() for ARCH_HAS_SET_DIRECT_MAP case
and debug_pagealloc_map_pages() for DEBUG_PAGEALLOC case.

The remapping of the pages in safe_copy_page() presumes that it only
changes protection bits in an existing PTE and so it is safe to ignore
return value of set_direct_map_{default,invalid}_noflush().

Still, add a WARN_ON() so that future changes in set_memory APIs will not
silently break hibernation.

Signed-off-by: Mike Rapoport 
Acked-by: Rafael J. Wysocki 
---
 include/linux/mm.h  | 12 
 kernel/power/snapshot.c | 30 --
 2 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fc0609056dc..14e397f3752c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2927,16 +2927,6 @@ static inline bool debug_pagealloc_enabled_static(void)
 #if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_ARCH_HAS_SET_DIRECT_MAP)
 extern void __kernel_map_pages(struct page *page, int numpages, int enable);
 
-/*
- * When called in DEBUG_PAGEALLOC context, the call should most likely be
- * guarded by debug_pagealloc_enabled() or debug_pagealloc_enabled_static()
- */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable)
-{
-   __kernel_map_pages(page, numpages, enable);
-}
-
 static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable)
 {
@@ -2948,8 +2938,6 @@ static inline void debug_pagealloc_map_pages(struct page 
*page,
 extern bool kernel_page_present(struct page *page);
 #endif /* CONFIG_HIBERNATION */
 #else  /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
-static inline void
-kernel_map_pages(struct page *page, int numpages, int enable) {}
 static inline void debug_pagealloc_map_pages(struct page *page,
 int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index 46b1804c1ddf..054c8cce4236 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -76,6 +76,32 @@ static inline void hibernate_restore_protect_page(void 
*page_address) {}
 static inline void hibernate_restore_unprotect_page(void *page_address) {}
 #endif /* CONFIG_STRICT_KERNEL_RWX  && CONFIG_ARCH_HAS_SET_MEMORY */
 
+static inline void hibernate_map_page(struct page *page, int enable)
+{
+   if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
+   unsigned long addr = (unsigned long)page_address(page);
+   int ret;
+
+   /*
+* This should not fail because remapping a page here means
+* that we only update protection bits in an existing PTE.
+* It is still worth to have WARN_ON() here if something
+* changes and this will no longer be the case.
+*/
+   if (enable)
+   ret = set_direct_map_default_noflush(page);
+   else
+   ret = set_direct_map_invalid_noflush(page);
+
+   if (WARN_ON(ret))
+   return;
+
+   flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+   } else {
+   debug_pagealloc_map_pages(page, 1, enable);
+   }
+}
+
 static int swsusp_page_is_free(struct page *);
 static void swsusp_set_page_forbidden(struct page *);
 static void swsusp_unset_page_forbidden(struct page *);
@@ -1355,9 +1381,9 @@ static void safe_copy_page(void *dst, struct page *s_page)
if (kernel_page_present(s_page)) {
do_copy_page(dst, page_address(s_page));
} else {
-   kernel_map_pages(s_page, 1, 1);
+   hibernate_map_page(s_page, 1);
do_copy_page(dst, page_address(s_page));
-   kernel_map_pages(s_page, 1, 0);
+   hibernate_map_page(s_page, 0);
}
 }
 
-- 
2.28.0



[PATCH v3 1/4] mm: introduce debug_pagealloc_map_pages() helper

2020-11-01 Thread Mike Rapoport
From: Mike Rapoport 

When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
direct mapping after free_pages(). The pages than need to be mapped back
before they could be used. Theese mapping operations use
__kernel_map_pages() guarded with with debug_pagealloc_enabled().

The only place that calls __kernel_map_pages() without checking whether
DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.
Still, on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is
not enabled but set_direct_map_invalid_noflush() may render some pages not
present in the direct map and hibernation code won't be able to save such
pages.

To make page allocation debugging and hibernation interaction more robust,
the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be made
more explicit.

Start with combining the guard condition and the call to
__kernel_map_pages() into a single debug_pagealloc_map_pages() function to
emphasize that __kernel_map_pages() should not be called without
DEBUG_PAGEALLOC and use this new function to map/unmap pages when page
allocation debug is enabled.

Signed-off-by: Mike Rapoport 
Reviewed-by: David Hildenbrand 
---
 include/linux/mm.h  | 10 ++
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c |  6 ++
 mm/slab.c   |  8 +++-
 4 files changed, 16 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ef360fe70aaf..1fc0609056dc 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2936,12 +2936,22 @@ kernel_map_pages(struct page *page, int numpages, int 
enable)
 {
__kernel_map_pages(page, numpages, enable);
 }
+
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable)
+{
+   if (debug_pagealloc_enabled_static())
+   __kernel_map_pages(page, numpages, enable);
+}
+
 #ifdef CONFIG_HIBERNATION
 extern bool kernel_page_present(struct page *page);
 #endif /* CONFIG_HIBERNATION */
 #else  /* CONFIG_DEBUG_PAGEALLOC || CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable) {}
+static inline void debug_pagealloc_map_pages(struct page *page,
+int numpages, int enable) {}
 #ifdef CONFIG_HIBERNATION
 static inline bool kernel_page_present(struct page *page) { return true; }
 #endif /* CONFIG_HIBERNATION */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b44d4c7ba73b..e2b6043a4428 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -614,8 +614,7 @@ void generic_online_page(struct page *page, unsigned int 
order)
 * so we should map it first. This is better than introducing a special
 * case in page freeing fast path.
 */
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
__free_pages_core(page, order);
totalram_pages_add(1UL << order);
 #ifdef CONFIG_HIGHMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 23f5066bd4a5..9a66a1ff9193 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1272,8 +1272,7 @@ static __always_inline bool free_pages_prepare(struct 
page *page,
 */
arch_free_page(page, order);
 
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 0);
+   debug_pagealloc_map_pages(page, 1 << order, 0);
 
kasan_free_nondeferred_pages(page, order);
 
@@ -2270,8 +2269,7 @@ inline void post_alloc_hook(struct page *page, unsigned 
int order,
set_page_refcounted(page);
 
arch_alloc_page(page, order);
-   if (debug_pagealloc_enabled_static())
-   kernel_map_pages(page, 1 << order, 1);
+   debug_pagealloc_map_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
kernel_poison_pages(page, 1 << order, 1);
set_page_owner(page, order, gfp_flags);
diff --git a/mm/slab.c b/mm/slab.c
index b1113561b98b..340db0ce74c4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1431,10 +1431,8 @@ static bool is_debug_pagealloc_cache(struct kmem_cache 
*cachep)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static void slab_kernel_map(struct kmem_cache *cachep, void *objp, int map)
 {
-   if (!is_debug_pagealloc_cache(cachep))
-   return;
-
-   kernel_map_pages(virt_to_page(objp), cachep->size / PAGE_SIZE, map);
+   debug_pagealloc_map_pages(virt_to_page(objp),
+ cachep->size / PAGE_SIZE, map);
 }
 
 #else
@@ -2062,7 +2060,7 @@ int __kmem_cache_create(struct kmem_cache *cachep, 
slab_flags_t flags)
 
 #if DEBUG
/*
-* If we're going to use the generic kernel_map_pages()
+* If we're going to use the generic debug_pagealloc_map_pages()
 * poisoning, then it's going to smash the c

[PATCH v3 0/4] arch, mm: improve robustness of direct map manipulation

2020-11-01 Thread Mike Rapoport
From: Mike Rapoport 

Hi,

During recent discussion about KVM protected memory, David raised a concern
about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC scope [1].

Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
possible that __kernel_map_pages() would fail, but since this function is
void, the failure will go unnoticed.

Moreover, there's lack of consistency of __kernel_map_pages() semantics
across architectures as some guard this function with
#ifdef DEBUG_PAGEALLOC, some refuse to update the direct map if page
allocation debugging is disabled at run time and some allow modifying the
direct map regardless of DEBUG_PAGEALLOC settings.

This set straightens this out by restoring dependency of
__kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
accordingly. 

Since currently the only user of __kernel_map_pages() outside
DEBUG_PAGEALLOC, it is updated to make direct map accesses there more
explicit.

[1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a403482...@redhat.com

v3 changes:
* update arm64 changes to avoid regression, per Rick's comments
* fix bisectability

v2 changes:
* Rephrase patch 2 changelog to better describe the change intentions and
implications
* Move removal of kernel_map_pages() from patch 1 to patch 2, per David
https://lore.kernel.org/lkml/20201029161902.19272-1-r...@kernel.org

v1:
https://lore.kernel.org/lkml/20201025101555.3057-1-r...@kernel.org

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: make direct map manipulations more explicit
  arch, mm: restore dependency of __kernel_map_pages() of DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig|  3 +++
 arch/arm64/Kconfig  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c|  6 +++--
 arch/powerpc/Kconfig|  5 +
 arch/riscv/Kconfig  |  4 +---
 arch/riscv/include/asm/pgtable.h|  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 31 +
 arch/s390/Kconfig   |  4 +---
 arch/sparc/Kconfig  |  4 +---
 arch/x86/Kconfig|  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  4 ++--
 include/linux/mm.h  | 35 +
 include/linux/set_memory.h  |  5 +
 kernel/power/snapshot.c | 30 +++--
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c |  6 ++---
 mm/slab.c   |  8 +++
 20 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.28.0

*** BLURB HERE ***

Mike Rapoport (4):
  mm: introduce debug_pagealloc_map_pages() helper
  PM: hibernate: make direct map manipulations more explicit
  arch, mm: restore dependency of __kernel_map_pages() of
DEBUG_PAGEALLOC
  arch, mm: make kernel_page_present() always available

 arch/Kconfig|  3 +++
 arch/arm64/Kconfig  |  4 +---
 arch/arm64/include/asm/cacheflush.h |  1 +
 arch/arm64/mm/pageattr.c|  6 +++--
 arch/powerpc/Kconfig|  5 +
 arch/riscv/Kconfig  |  4 +---
 arch/riscv/include/asm/pgtable.h|  2 --
 arch/riscv/include/asm/set_memory.h |  1 +
 arch/riscv/mm/pageattr.c| 31 +
 arch/s390/Kconfig   |  4 +---
 arch/sparc/Kconfig  |  4 +---
 arch/x86/Kconfig|  4 +---
 arch/x86/include/asm/set_memory.h   |  1 +
 arch/x86/mm/pat/set_memory.c|  4 ++--
 include/linux/mm.h  | 35 +
 include/linux/set_memory.h  |  5 +
 kernel/power/snapshot.c | 30 +++--
 mm/memory_hotplug.c |  3 +--
 mm/page_alloc.c |  6 ++---
 mm/slab.c   |  8 +++
 20 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.28.0



Re: [PATCH 2/4] PM: hibernate: improve robustness of mapping pages in the direct map

2020-11-01 Thread Mike Rapoport
On Thu, Oct 29, 2020 at 11:19:18PM +, Edgecombe, Rick P wrote:
> On Thu, 2020-10-29 at 09:54 +0200, Mike Rapoport wrote:
> > __kernel_map_pages() on arm64 will also bail out if rodata_full is
> > false:
> > void __kernel_map_pages(struct page *page, int numpages, int enable)
> > {
> > if (!debug_pagealloc_enabled() && !rodata_full)
> > return;
> > 
> > set_memory_valid((unsigned long)page_address(page), numpages,
> > enable);
> > }
> > 
> > So using set_direct_map() to map back pages removed from the direct
> > map
> > with __kernel_map_pages() seems safe to me.
> 
> Heh, one of us must have some simple boolean error in our head. I hope
> its not me! :) I'll try on more time.

Well, then it's me :)
You are right, I misread this and I could not understand why
!rodata_full bothers you.

> __kernel_map_pages() will bail out if rodata_full is false **AND**
> debug page alloc is off. So it will only bail under conditions where
> there could be nothing unmapped on the direct map.
> 
> Equivalent logic would be:
>   if (!(debug_pagealloc_enabled() || rodata_full))
>   return;
> 
> Or:
>   if (debug_pagealloc_enabled() || rodata_full)
>   set_memory_valid(blah)
> 
> So if either is on, the existing code will try to re-map. But the
> set_direct_map_()'s will only work if rodata_full is on. So switching
> hibernate to set_direct_map() will cause the remap to be missed for the
> debug page alloc case, with !rodata_full.
> 
> It also breaks normal debug page alloc usage with !rodata_full for
> similar reasons after patch 3. The pages would never get unmapped.

I've updated the patches, there should be no regression now.

-- 
Sincerely yours,
Mike.