from:"Qian Cai"

[PATCH v2] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-03-01 Thread Qian Cai

When onlining a memory block with DEBUG_PAGEALLOC, it unmaps the pages
in the block from kernel, However, it does not map those pages while
offlining at the beginning. As the result, it triggers a panic below
while onlining on ppc64le as it checks if the pages are mapped before
unmapping. However, the imbalance exists for all arches where
double-unmappings could happen. Therefore, let kernel map those pages in
generic_online_page() before they have being freed into the page
allocator for the first time where it will set the page count to one.

On the other hand, it works fine during the boot, because at least for
IBM POWER8, it does,

early_setup
  early_init_mmu
harsh__early_init_mmu
  htab_initialize [1]
htab_bolt_mapping [2]

where it effectively map all memblock regions just like
kernel_map_linear_page(), so later mem_init() -> memblock_free_all()
will unmap them just fine without any imbalance. On other arches without
this imbalance checking, it still unmap them once at the most.

[1]
for_each_memblock(memory, reg) {
base = (unsigned long)__va(reg->base);
size = reg->size;

DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
base, size, prot);

BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
prot, mmu_linear_psize, mmu_kernel_ssize));
}

[2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;

kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
NIP:  c0062670 LR: c006265c CTR: 
REGS: c005bf8a75b0 TRAP: 0700   Not tainted  (5.0.0-rc7+)
MSR:  8282b033   CR: 28422842
XER: 
CFAR: c0804f44 IRQMASK: 1
GPR00: c006265c c005bf8a7840 c1518200 c13cbcc8
GPR04: 00080004  ccc457e0 c005c4e341d8
GPR08:  0001 c7f4f800 0001
GPR12: 2200 c7f4e100  000139c29710
GPR16: 000139c29714 000139c29788 c13cbcc8 
GPR20: 00034000 c16e05e8  0001
GPR24: 00bf50d9 818e  c16e04b8
GPR28: f0d00040 006420a2f217 f0d0 00ea1b217034
NIP [c0062670] __kernel_map_pages+0x2e0/0x4f0
LR [c006265c] __kernel_map_pages+0x2cc/0x4f0
Call Trace:
[c005bf8a7840] [c006265c] __kernel_map_pages+0x2cc/0x4f0
(unreliable)
[c005bf8a78d0] [c028c4a0] free_unref_page_prepare+0x2f0/0x4d0
[c005bf8a7930] [c0293144] free_unref_page+0x44/0x90
[c005bf8a7970] [c037af24] __online_page_free+0x84/0x110
[c005bf8a79a0] [c037b6e0] online_pages_range+0xc0/0x150
[c005bf8a7a00] [c005aaa8] walk_system_ram_range+0xc8/0x120
[c005bf8a7a50] [c037e710] online_pages+0x280/0x5a0
[c005bf8a7b40] [c06419e4] memory_subsys_online+0x1b4/0x270
[c005bf8a7bb0] [c0616720] device_online+0xc0/0xf0
[c005bf8a7bf0] [c0642570] state_store+0xc0/0x180
[c005bf8a7c30] [c0610b2c] dev_attr_store+0x3c/0x60
[c005bf8a7c50] [c04c0a50] sysfs_kf_write+0x70/0xb0
[c005bf8a7c90] [c04bf40c] kernfs_fop_write+0x10c/0x250
[c005bf8a7ce0] [c03e4b18] __vfs_write+0x48/0x240
[c005bf8a7d80] [c03e4f68] vfs_write+0xd8/0x210
[c005bf8a7dd0] [c03e52f0] ksys_write+0x70/0x120
[c005bf8a7e20] [c000b000] system_call+0x5c/0x70
Instruction dump:
7fbd5278 7fbd4a78 3e42ffeb 7bbd0640 3a523ac8 7e439378 487a2881 6000
e95505f0 7e6aa0ae 6a690080 7929c9c2 <0b09> 7f4aa1ae 7e439378 487a28dd

Signed-off-by: Qian Cai 
---

v2: map pages in kernel at the beginning of online path instead of the offline
path to avoid possible issues pointed out by Michal.

 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1ad28323fb9f..736e107e2197 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -660,6 +660,7 @@ static void generic_online_page(struct page *page)
 {
__online_page_set_limits(page);
__online_page_increment_counters(page);
+   kernel_map_pages(page, 1, 1);
__online_page_free(page);
 }
 
-- 
2.17.2 (Apple Git-113)

[PATCH v2] x86/mm: fix "old_pte" set but not used

2019-03-01 Thread Qian Cai

The commit 3a19109efbfa ("x86/mm: Fix try_preserve_large_page() to
handle large PAT bit") fixed try_preserve_large_page() by using the
corresponding pud/pmd prot/pfn interfaces, but left a variable unused
because it no longer used pte_pfn().

Later, the commit 8679de0959e6 ("x86/mm/cpa: Split, rename and clean up
try_preserve_large_page()") renamed try_preserve_large_page() to
__should_split_large_page(), but the unused variable remains.

arch/x86/mm/pageattr.c: In function '__should_split_large_page':
arch/x86/mm/pageattr.c:741:17: warning: variable 'old_pte' set but not
used [-Wunused-but-set-variable]

Signed-off-by: Qian Cai 
---

v2: improve the commit message.

 arch/x86/mm/pageattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14e6119838a6..4c570612e24e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -738,7 +738,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned 
long address,
 {
unsigned long numpages, pmask, psize, lpaddr, pfn, old_pfn;
pgprot_t old_prot, new_prot, req_prot, chk_prot;
-   pte_t new_pte, old_pte, *tmp;
+   pte_t new_pte, *tmp;
enum pg_level level;
 
/*
@@ -781,7 +781,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned 
long address,
 * Convert protection attributes to 4k-format, as cpa->mask* are set
 * up accordingly.
 */
-   old_pte = *kpte;
+
/* Clear PSE (aka _PAGE_PAT) and move PAT bit to correct position */
req_prot = pgprot_large_2_4k(old_prot);
 
-- 
2.17.2 (Apple Git-113)

[PATCH v2] memcg: fix a bad line

2019-03-01 Thread Qian Cai

The commit 230671533d64 ("mm: memory.low hierarchical behavior") missed
an asterisk in one of the comments.

mm/memcontrol.c:5774: warning: bad line:| 0, otherwise.

Acked-by: Souptick Joarder 
Signed-off-by: Qian Cai 
---

v2: improve the commit message.

 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..d4b96dc4bd8a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5747,7 +5747,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  *
  * | memory.current, if memory.current < memory.low
  * low_usage = |
-  | 0, otherwise.
+ *| 0, otherwise.
  *
  *
  * Such definition of the effective memory.low provides the expected
-- 
2.17.2 (Apple Git-113)

[PATCH v2] powerpc/mm: fix "section_base" set but not used

2019-03-01 Thread Qian Cai

The commit 24b6d4164348 ("mm: pass the vmem_altmap to vmemmap_free")
removed a line in vmemmap_free(),

altmap = to_vmem_altmap((unsigned long) section_base);

but left a variable no longer used.

arch/powerpc/mm/init_64.c: In function 'vmemmap_free':
arch/powerpc/mm/init_64.c:277:16: error: variable 'section_base' set but
not used [-Werror=unused-but-set-variable]

Signed-off-by: Qian Cai 
---

v2: improve the commit message.

 arch/powerpc/mm/init_64.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a5091c034747..a4c155af1597 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -274,7 +274,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
 
for (; start < end; start += page_size) {
unsigned long nr_pages, addr;
-   struct page *section_base;
struct page *page;
 
/*
@@ -290,7 +289,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
continue;
 
page = pfn_to_page(addr >> PAGE_SHIFT);
-   section_base = pfn_to_page(vmemmap_section_start(start));
nr_pages = 1 << page_order;
base_pfn = PHYS_PFN(addr);
 
-- 
2.17.2 (Apple Git-113)

[PATCH] mm/hugepages: fix "orig_pud" set but not used

2019-02-28 Thread Qian Cai

The commit a00cc7d9dd93 ("mm, x86: add support for PUD-sized transparent
hugepages") introduced pudp_huge_get_and_clear_full() but no one uses
its return code, so just make it void.

mm/huge_memory.c: In function 'zap_huge_pud':
mm/huge_memory.c:1982:8: warning: variable 'orig_pud' set but not used
[-Wunused-but-set-variable]
  pud_t orig_pud;
^~~~

Signed-off-by: Qian Cai 
---
 include/asm-generic/pgtable.h | 8 
 mm/huge_memory.c  | 4 +---
 2 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 05e61e6c843f..17b789557afe 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -167,11 +167,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
mm_struct *mm,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
-   unsigned long address, pud_t *pudp,
-   int full)
+static inline void pudp_huge_get_and_clear_full(struct mm_struct *mm,
+   unsigned long address,
+   pud_t *pudp, int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   pudp_huge_get_and_clear(mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index faf357eaf0ce..9f57a1173e6a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1979,7 +1979,6 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct 
vm_area_struct *vma)
 int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 pud_t *pud, unsigned long addr)
 {
-   pud_t orig_pud;
spinlock_t *ptl;
 
ptl = __pud_trans_huge_lock(pud, vma);
@@ -1991,8 +1990,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
 * pgtable_trans_huge_withdraw after finishing pudp related
 * operations.
 */
-   orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud,
-   tlb->fullmm);
+   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_dax(vma)) {
spin_unlock(ptl);
-- 
2.17.2 (Apple Git-113)

[PATCH] x86/mm: fix "cpu" set but not used

2019-02-28 Thread Qian Cai

The commit a2055abe9c67 ("x86/mm: Pass flush_tlb_info to
flush_tlb_others() etc") removed the unnecessary cpu parameter from
uv_flush_tlb_others() but left an unused variable.

arch/x86/mm/tlb.c: In function 'native_flush_tlb_others':
arch/x86/mm/tlb.c:688:16: warning: variable 'cpu' set but not used
[-Wunused-but-set-variable]
   unsigned int cpu;
^~~

Signed-off-by: Qian Cai 
---
 arch/x86/mm/tlb.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 999d6d8f0bef..bc4bc7b2f075 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -685,9 +685,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 * that UV should be updated so that smp_call_function_many(),
 * etc, are optimal on UV.
 */
-   unsigned int cpu;
-
-   cpu = smp_processor_id();
cpumask = uv_flush_tlb_others(cpumask, info);
if (cpumask)
smp_call_function_many(cpumask, flush_tlb_func_remote,
-- 
2.17.2 (Apple Git-113)

[PATCH] x86/mm: fix "prev_pud" set but not used

2019-02-28 Thread Qian Cai

The commit 04b67022fb6d ("x86/mm/dump_pagetables: Speed up page tables
dump for CONFIG_KASAN=y") removed pud_already_checked() but left a
unused variable.

arch/x86/mm/dump_pagetables.c: In function 'walk_pud_level':
arch/x86/mm/dump_pagetables.c:447:9: warning: variable 'prev_pud' set
but not used [-Wunused-but-set-variable]
  pud_t *prev_pud = NULL;
 ^~~~

Signed-off-by: Qian Cai 
---
 arch/x86/mm/dump_pagetables.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index e3cdc85ce5b6..ee8f8ab46941 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -444,7 +444,6 @@ static void walk_pud_level(struct seq_file *m, struct 
pg_state *st, p4d_t addr,
int i;
pud_t *start, *pud_start;
pgprotval_t prot, eff;
-   pud_t *prev_pud = NULL;
 
pud_start = start = (pud_t *)p4d_page_vaddr(addr);
 
@@ -462,7 +461,6 @@ static void walk_pud_level(struct seq_file *m, struct 
pg_state *st, p4d_t addr,
} else
note_page(m, st, __pgprot(0), 0, 3);
 
-   prev_pud = start;
start++;
}
 }
-- 
2.17.2 (Apple Git-113)

[PATCH] mm/sparse: fix a bad comparison

2019-02-28 Thread Qian Cai

next_present_section_nr() could only return an unsigned number -1, so
just check it specifically where compilers will convert -1 to unsigned
if needed.

mm/sparse.c: In function 'sparse_init_nid':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is
always true [-Wtype-limits]
   ((section_nr >= 0) &&\
^~
mm/sparse.c:478:2: note: in expansion of macro
'for_each_present_section_nr'
  for_each_present_section_nr(pnum_begin, pnum) {
  ^~~
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is
always true [-Wtype-limits]
   ((section_nr >= 0) &&\
^~
mm/sparse.c:497:2: note: in expansion of macro
'for_each_present_section_nr'
  for_each_present_section_nr(pnum_begin, pnum) {
  ^~~
mm/sparse.c: In function 'sparse_init':
mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is
always true [-Wtype-limits]
   ((section_nr >= 0) &&\
^~
mm/sparse.c:520:2: note: in expansion of macro
'for_each_present_section_nr'
  for_each_present_section_nr(pnum_begin + 1, pnum_end) {
  ^~~

Fixes: c4e1be9ec113 ("mm, sparsemem: break out of loops early")
Signed-off-by: Qian Cai 
---
 mm/sparse.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/sparse.c b/mm/sparse.c
index 7ea5dc6c6b19..77a0554fa5bd 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -197,7 +197,7 @@ static inline int next_present_section_nr(int section_nr)
 }
 #define for_each_present_section_nr(start, section_nr) \
for (section_nr = next_present_section_nr(start-1); \
-((section_nr >= 0) &&  \
+((section_nr != -1) && \
  (section_nr <= __highest_present_section_nr));\
 section_nr = next_present_section_nr(section_nr))
 
-- 
2.17.2 (Apple Git-113)

[PATCH] powerpc: fix "sz" set but not used

2019-02-27 Thread Qian Cai

arch/powerpc/mm/hugetlbpage-hash64.c: In function '__hash_page_huge':
arch/powerpc/mm/hugetlbpage-hash64.c:29:28: warning: variable 'sz' set
but not used [-Wunused-but-set-variable]

Signed-off-by: Qian Cai 
---
 arch/powerpc/mm/hugetlbpage-hash64.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-hash64.c 
b/arch/powerpc/mm/hugetlbpage-hash64.c
index 2e6a8f9345d3..f6b09edc5e6e 100644
--- a/arch/powerpc/mm/hugetlbpage-hash64.c
+++ b/arch/powerpc/mm/hugetlbpage-hash64.c
@@ -26,7 +26,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, 
unsigned long vsid,
real_pte_t rpte;
unsigned long vpn;
unsigned long old_pte, new_pte;
-   unsigned long rflags, pa, sz;
+   unsigned long rflags, pa;
long slot, offset;
 
BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
@@ -73,7 +73,6 @@ int __hash_page_huge(unsigned long ea, unsigned long access, 
unsigned long vsid,
offset = PTRS_PER_PMD;
rpte = __real_pte(__pte(old_pte), ptep, offset);
 
-   sz = ((1UL) << shift);
if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
/* No CPU has hugepages but lacks no execute, so we
 * don't need to worry about that case */
-- 
2.17.2 (Apple Git-113)

[PATCH] x86: fix "old_pte" set but not used

2019-02-27 Thread Qian Cai

arch/x86/mm/pageattr.c: In function '__should_split_large_page':
arch/x86/mm/pageattr.c:741:17: warning: variable 'old_pte' set but not
used [-Wunused-but-set-variable]

Signed-off-by: Qian Cai 
---
 arch/x86/mm/pageattr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 14e6119838a6..4c570612e24e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -738,7 +738,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned 
long address,
 {
unsigned long numpages, pmask, psize, lpaddr, pfn, old_pfn;
pgprot_t old_prot, new_prot, req_prot, chk_prot;
-   pte_t new_pte, old_pte, *tmp;
+   pte_t new_pte, *tmp;
enum pg_level level;
 
/*
@@ -781,7 +781,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned 
long address,
 * Convert protection attributes to 4k-format, as cpa->mask* are set
 * up accordingly.
 */
-   old_pte = *kpte;
+
/* Clear PSE (aka _PAGE_PAT) and move PAT bit to correct position */
req_prot = pgprot_large_2_4k(old_prot);
 
-- 
2.17.2 (Apple Git-113)

[PATCH] powerpc: fix "section_base" set but not used

2019-02-27 Thread Qian Cai

arch/powerpc/mm/init_64.c: In function 'vmemmap_free':
arch/powerpc/mm/init_64.c:277:16: error: variable 'section_base' set but
not used [-Werror=unused-but-set-variable]

Signed-off-by: Qian Cai 
---
 arch/powerpc/mm/init_64.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index a5091c034747..a4c155af1597 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -274,7 +274,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
 
for (; start < end; start += page_size) {
unsigned long nr_pages, addr;
-   struct page *section_base;
struct page *page;
 
/*
@@ -290,7 +289,6 @@ void __ref vmemmap_free(unsigned long start, unsigned long 
end,
continue;
 
page = pfn_to_page(addr >> PAGE_SHIFT);
-   section_base = pfn_to_page(vmemmap_section_start(start));
nr_pages = 1 << page_order;
base_pfn = PHYS_PFN(addr);
 
-- 
2.17.2 (Apple Git-113)

[PATCH] memcg: fix a bad line

2019-02-27 Thread Qian Cai

Miss a star.

Signed-off-by: Qian Cai 
---
 mm/memcontrol.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index af7f18b32389..d4b96dc4bd8a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5747,7 +5747,7 @@ struct cgroup_subsys memory_cgrp_subsys = {
  *
  * | memory.current, if memory.current < memory.low
  * low_usage = |
-  | 0, otherwise.
+ *| 0, otherwise.
  *
  *
  * Such definition of the effective memory.low provides the expected
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] tmpfs: fix uninitialized return value in shmem_link

2019-02-27 Thread Qian Cai

On Wed, 2019-02-27 at 09:09 -0500, Qian Cai wrote:
> On Mon, 2019-02-25 at 16:07 -0800, Linus Torvalds wrote:
> > On Mon, Feb 25, 2019 at 4:03 PM Qian Cai  wrote:
> > > > 
> > > > Of course, that's just gcc. I have no idea what llvm ends up doing.
> > > 
> > > Clang 7.0:
> > > 
> > > # clang  -O2 -S -Wall /tmp/test.c
> > > /tmp/test.c:46:6: warning: variable 'ret' is used uninitialized whenever
> > > 'if'
> > > condition is false [-Wsometimes-uninitialized]
> > 
> > Ok, good.
> > 
> > Do we have any clang builds in any of the zero-day robot
> > infrastructure or something? Should we?
> > 
> > And maybe this was how Dan noticed the problem in the first place? Or
> > is it just because of his eagle-eyes?
> > 
> 
> BTW, even clang is able to generate warnings in your sample code, it does not
> generate any warnings when compiling the buggy shmem.o via "make CC=clang".
> Here is the objdump for arm64 (with KASAN_SW_TAGS inline).
> 

Ah, thanks to the commit 6e8d666e9253 ("Disable "maybe-uninitialized" warning
globally"), it will no longer generate this type of warnings until using "make
W=1" due to the commit a76bcf557ef4 ("Kbuild: enable -Wmaybe-uninitialized
warning for 'make W=1'"). Anyway, the generated code is the same using clang
with and without this patch.

d_instantiate(dentry, inode);
4eec:   9400bl  0 
ret = shmem_reserve_inode(inode->i_sb);
4ef0:   2a1f03e0mov w0, wzr < ret = 0
return ret;

[PATCH v3] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Qian Cai

After offlined a memory block, kmemleak scan will trigger a crash, as it
encounters a page ext address that has already been freed during memory
offlining. At the beginning in alloc_page_ext(), it calls
kmemleak_alloc(), but it does not call kmemleak_free() in
free_page_ext().

BUG: unable to handle kernel paging request at 888453d0
PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800bac2ff060
Oops:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Call Trace:
 scan_gray_list+0x269/0x430
 kmemleak_scan+0x5a8/0x10f0
 kmemleak_write+0x541/0x6ca
 full_proxy_write+0xf8/0x190
 __vfs_write+0xeb/0x980
 vfs_write+0x15a/0x4f0
 ksys_write+0xd2/0x1b0
 __x64_sys_write+0x73/0xb0
 do_syscall_64+0xeb/0xaaa
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6c0dad73b8
Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa
48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
RSP: 002b:7ffd5b863cb8 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 0005 RCX: 7f6c0dad73b8
RDX: 0005 RSI: 55a9216e1710 RDI: 0001
RBP: 55a9216e1710 R08: 000a R09: 7ffd5b863840
R10: 000a R11: 0246 R12: 7f6c0dda9780
R13: 0005 R14: 7f6c0dda4740 R15: 0005
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm
irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb
i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
efivarfs
CR2: 888453d0
---[ end trace ccf646c7456717c5 ]---
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Kernel panic - not syncing: Fatal exception
Shutting down cpus with NMI
Kernel Offset: 0x24c0 from 0x8100 (relocation range:
0x8000-0xbfff)
---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Qian Cai 
---

v3: place kmemleak_free() before free_pages_exact() to avoid a small window
where the address has been freed but kmemleak not informed pointed out by
Catalin.

v2: move kmemleak_free() into free_page_ext() as there is no need to call
kmemleak_free() in the vfree() case.

 mm/page_ext.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 8c78b8d45117..f116431c3dee 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -273,6 +273,7 @@ static void free_page_ext(void *addr)
table_size = get_entry_size() * PAGES_PER_SECTION;
 
BUG_ON(PageReserved(page));
+   kmemleak_free(addr);
free_pages_exact(addr, table_size);
}
 }
-- 
2.17.2 (Apple Git-113)

[PATCH v2] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Qian Cai

After offlined a memory block, kmemleak scan will trigger a crash, as it
encounters a page ext address that has already been freed during memory
offlining. At the beginning in alloc_page_ext(), it calls
kmemleak_alloc(), but it does not call kmemleak_free() in
free_page_ext().

BUG: unable to handle kernel paging request at 888453d0
PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800bac2ff060
Oops:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Call Trace:
 scan_gray_list+0x269/0x430
 kmemleak_scan+0x5a8/0x10f0
 kmemleak_write+0x541/0x6ca
 full_proxy_write+0xf8/0x190
 __vfs_write+0xeb/0x980
 vfs_write+0x15a/0x4f0
 ksys_write+0xd2/0x1b0
 __x64_sys_write+0x73/0xb0
 do_syscall_64+0xeb/0xaaa
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6c0dad73b8
Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa
48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
RSP: 002b:7ffd5b863cb8 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 0005 RCX: 7f6c0dad73b8
RDX: 0005 RSI: 55a9216e1710 RDI: 0001
RBP: 55a9216e1710 R08: 000a R09: 7ffd5b863840
R10: 000a R11: 0246 R12: 7f6c0dda9780
R13: 0005 R14: 7f6c0dda4740 R15: 0005
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm
irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb
i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
efivarfs
CR2: 888453d0
---[ end trace ccf646c7456717c5 ]---
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Kernel panic - not syncing: Fatal exception
Shutting down cpus with NMI
Kernel Offset: 0x24c0 from 0x8100 (relocation range:
0x8000-0xbfff)
---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Qian Cai 
---

v2: move kmemleak_free() into free_page_ext() as there is no need to call
kmemleak_free() in the vfree() case.

 mm/page_ext.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 8c78b8d45117..0b6637d7bae9 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -274,6 +274,7 @@ static void free_page_ext(void *addr)
 
BUG_ON(PageReserved(page));
free_pages_exact(addr, table_size);
+   kmemleak_free(addr);
}
 }
 
-- 
2.17.2 (Apple Git-113)

[PATCH] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Qian Cai

After offlined a memory block, kmemleak scan will trigger a crash, as it
encounters a page ext address that has already been freed during memory
offlining. At the beginning in alloc_page_ext(), it calls
kmemleak_alloc(), but it does not call kmemleak_free() in
__free_page_ext().

BUG: unable to handle kernel paging request at 888453d0
PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800bac2ff060
Oops:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Call Trace:
 scan_gray_list+0x269/0x430
 kmemleak_scan+0x5a8/0x10f0
 kmemleak_write+0x541/0x6ca
 full_proxy_write+0xf8/0x190
 __vfs_write+0xeb/0x980
 vfs_write+0x15a/0x4f0
 ksys_write+0xd2/0x1b0
 __x64_sys_write+0x73/0xb0
 do_syscall_64+0xeb/0xaaa
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f6c0dad73b8
Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa
48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00
f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
RSP: 002b:7ffd5b863cb8 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 0005 RCX: 7f6c0dad73b8
RDX: 0005 RSI: 55a9216e1710 RDI: 0001
RBP: 55a9216e1710 R08: 000a R09: 7ffd5b863840
R10: 000a R11: 0246 R12: 7f6c0dda9780
R13: 0005 R14: 7f6c0dda4740 R15: 0005
Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm
irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb
i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
efivarfs
CR2: 888453d0
---[ end trace ccf646c7456717c5 ]---
RIP: 0010:scan_block+0xb5/0x290
Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
RAX:  RBX: 888453d0 RCX: a61e5a54
RDX:  RSI: 0008 RDI: 888453d0
RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
R10: fbfff4ed588c R11: a76ac463 R12: dc00
R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
FS:  7f6c0e3f8740() GS:8881f768() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
Kernel panic - not syncing: Fatal exception
Shutting down cpus with NMI
Kernel Offset: 0x24c0 from 0x8100 (relocation range:
0x8000-0xbfff)
---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Qian Cai 
---
 mm/page_ext.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page_ext.c b/mm/page_ext.c
index 8c78b8d45117..b68f2a58ea3b 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -288,6 +288,7 @@ static void __free_page_ext(unsigned long pfn)
base = get_entry(ms->page_ext, pfn);
free_page_ext(base);
ms->page_ext = NULL;
+   kmemleak_free(base);
 }
 
 static int __meminit online_page_ext(unsigned long start_pfn,
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] tmpfs: fix uninitialized return value in shmem_link

2019-02-27 Thread Qian Cai

On Mon, 2019-02-25 at 16:07 -0800, Linus Torvalds wrote:
> On Mon, Feb 25, 2019 at 4:03 PM Qian Cai  wrote:
> > > 
> > > Of course, that's just gcc. I have no idea what llvm ends up doing.
> > 
> > Clang 7.0:
> > 
> > # clang  -O2 -S -Wall /tmp/test.c
> > /tmp/test.c:46:6: warning: variable 'ret' is used uninitialized whenever
> > 'if'
> > condition is false [-Wsometimes-uninitialized]
> 
> Ok, good.
> 
> Do we have any clang builds in any of the zero-day robot
> infrastructure or something? Should we?
> 
> And maybe this was how Dan noticed the problem in the first place? Or
> is it just because of his eagle-eyes?
> 

BTW, even clang is able to generate warnings in your sample code, it does not
generate any warnings when compiling the buggy shmem.o via "make CC=clang". Here
is the objdump for arm64 (with KASAN_SW_TAGS inline).

effc :
{
effc:   f81c0ff7str x23, [sp, #-64]!
f000:   a90157f6stp x22, x21, [sp, #16]
f004:   a9024ff4stp x20, x19, [sp, #32]
f008:   a9037bfdstp x29, x30, [sp, #48]
f00c:   9100c3fdadd x29, sp, #0x30
f010:   aa0203f3mov x19, x2
f014:   aa0103f5mov x21, x1
f018:   aa0003f4mov x20, x0
f01c:   9400bl  0 <_mcount>
f020:   91016280add x0, x20, #0x58
f024:   d2c20017mov x23, #0x1000//
#17592186044416
f028:   b2481c08orr x8, x0, #0xff00
f02c:   f2fdfff7movkx23, #0xefff, lsl #48
f030:   d344fd08lsr x8, x8, #4
f034:   38776909ldrbw9, [x8, x23]
f038:   940017d5bl  14f8c 
f03c:   5460b.eqf048   // b.none
f040:   7103fd1fcmp w8, #0xff
f044:   54000981b.nef174   // b.any
f048:   f9400014ldr x20, [x0]
if (inode->i_nlink) {
f04c:   91010280add x0, x20, #0x40
f050:   b2481c08orr x8, x0, #0xff00
f054:   d344fd08lsr x8, x8, #4
f058:   38776909ldrbw9, [x8, x23]
f05c:   940017ccbl  14f8c 
f060:   5460b.eqf06c   // b.none
f064:   7103fd1fcmp w8, #0xff
f068:   540008a1b.nef17c   // b.any
f06c:   b948ldr w8, [x0]
f070:   34000148cbz w8, f098 
f074:   940018fdbl  15468 
ret = shmem_reserve_inode(inode->i_sb);
f078:   38776909ldrbw9, [x8, x23]
f07c:   940017c4bl  14f8c 
f080:   5460b.eqf08c   // b.none
f084:   7103fd1fcmp w8, #0xff
f088:   540007e1b.nef184   // b.any
f08c:   f940ldr x0, [x0]
f090:   97fffcf6bl  e468 
if (ret)
f094:   35000660cbnzw0, f160 
dir->i_size += BOGO_DIRENT_SIZE;
f098:   910122a0add x0, x21, #0x48
f09c:   b2481c08orr x8, x0, #0xff00
f0a0:   d344fd09lsr x9, x8, #4
f0a4:   3877692aldrbw10, [x9, x23]
f0a8:   94001828bl  15148 
f0ac:   5460b.eqf0b8   // b.none
f0b0:   7103fd1fcmp w8, #0xff
f0b4:   540006c1b.nef18c   // b.any
f0b8:   38776929ldrbw9, [x9, x23]
f0bc:   94001a4abl  159e4 
f0c0:   5460b.eqf0cc   // b.none
f0c4:   7103fd1fcmp w8, #0xff
f0c8:   54000661b.nef194   // b.any
f0cc:   f909str x9, [x0]
inode->i_ctime = dir->i_ctime = dir->i_mtime = current_time(inode);
f0d0:   aa1403e0mov x0, x20
f0d4:   910182b6add x22, x21, #0x60
f0d8:   9400bl  0 
f0dc:   b2481ec9orr x9, x22, #0xff00
f0e0:   d344fd29lsr x9, x9, #4

Re: [PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-26 Thread Qian Cai

On Tue, 2019-02-26 at 20:40 +0100, Michal Hocko wrote:
> It seems you have missed the point of my question. It simply doesn't
> make much sense to have offline memory mapped. That memory is not
> accessible in general. So mapping it at the offline time is dubious at
> best. 

Well, kernel_map_pages() is like other debug features which could look
"unusual".

> Also you do not get through the offlining phase on a newly
> hotplugged (and not yet onlined) memory. So the patch doesn't look
> correct to me and it all smells like the bug you are seeing is a wrong
> reporting.
> 

That (physical memory hotadd) is a special case like during the boot. The patch
is strictly to deal with offline/online memory, i.e., logical/soft memory
hotplug.

Re: [PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-26 Thread Qian Cai

On Tue, 2019-02-26 at 19:20 +0100, Michal Hocko wrote:
> Btw. what happens if the offlined pfn range is removed completely? Is
> the range still mapped? What kind of consequences does this have?

Well, the pages are still marked as reserved as well, so it is up to the
physically memory hotplug handler to free kernel direct mapping pagetable,
virtual memory mapping pages, and virtual memory mapping pagetable as by design,
although I have no way to test it.

> Also when does this tweak happens on a completely new hotplugged memory
> range?

I suppose it will call online_pages() which in-turn call
kernel_unmap_linear_page() which may or may not have the same issue, but I have
no way to test that path.

Re: [PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-26 Thread Qian Cai

On Tue, 2019-02-26 at 15:23 +0100, Michal Hocko wrote:
> On Tue 26-02-19 09:16:30, Qian Cai wrote:
> > 
> > 
> > On 2/26/19 7:35 AM, Michal Hocko wrote:
> > > On Mon 25-02-19 14:17:10, Qian Cai wrote:
> > > > When onlining memory pages, it calls kernel_unmap_linear_page(),
> > > > However, it does not call kernel_map_linear_page() while offlining
> > > > memory pages. As the result, it triggers a panic below while onlining on
> > > > ppc64le as it checks if the pages are mapped before unmapping,
> > > > Therefore, let it call kernel_map_linear_page() when setting all pages
> > > > as reserved.
> > > 
> > > This really begs for much more explanation. All the pages should be
> > > unmapped as they get freed AFAIR. So why do we need a special handing
> > > here when this path only offlines free pages?
> > > 
> > 
> > It sounds like this is exact the point to explain the imbalance. When
> > offlining,
> > every page has already been unmapped and marked reserved. When onlining, it
> > tries to free those reserved pages via __online_page_free(). Since those
> > pages
> > are order 0, it goes free_unref_page() which in-turn call
> > kernel_unmap_linear_page() again without been mapped first.
> 
> How is this any different from an initial page being freed to the
> allocator during the boot?
> 

As least for IBM POWER8, it does this during the boot,

early_setup
  early_init_mmu
harsh__early_init_mmu
  htab_initialize [1]
htab_bolt_mapping [2]

where it effectively map all memblock regions just like
kernel_map_linear_page(), so later mem_init() -> memblock_free_all() will unmap
them just fine.

[1]
for_each_memblock(memory, reg) {
base = (unsigned long)__va(reg->base);
size = reg->size;

DBG("creating mapping for region: %lx..%lx (prot: %lx)\n",
base, size, prot);

BUG_ON(htab_bolt_mapping(base, base + size, __pa(base),
prot, mmu_linear_psize, mmu_kernel_ssize));
}

[2] linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;

Re: [PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-26 Thread Qian Cai

On 2/26/19 7:35 AM, Michal Hocko wrote:
> On Mon 25-02-19 14:17:10, Qian Cai wrote:
>> When onlining memory pages, it calls kernel_unmap_linear_page(),
>> However, it does not call kernel_map_linear_page() while offlining
>> memory pages. As the result, it triggers a panic below while onlining on
>> ppc64le as it checks if the pages are mapped before unmapping,
>> Therefore, let it call kernel_map_linear_page() when setting all pages
>> as reserved.
> 
> This really begs for much more explanation. All the pages should be
> unmapped as they get freed AFAIR. So why do we need a special handing
> here when this path only offlines free pages?
> 

It sounds like this is exact the point to explain the imbalance. When offlining,
every page has already been unmapped and marked reserved. When onlining, it
tries to free those reserved pages via __online_page_free(). Since those pages
are order 0, it goes free_unref_page() which in-turn call
kernel_unmap_linear_page() again without been mapped first.

Re: [PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-26 Thread Qian Cai




On 2/26/19 7:13 AM, Souptick Joarder wrote:
> On Tue, Feb 26, 2019 at 12:47 AM Qian Cai  wrote:
>>
>> When onlining memory pages, it calls kernel_unmap_linear_page(),
>> However, it does not call kernel_map_linear_page() while offlining
>> memory pages. As the result, it triggers a panic below while onlining on
>> ppc64le as it checks if the pages are mapped before unmapping,
>> Therefore, let it call kernel_map_linear_page() when setting all pages
>> as reserved.
>>
>> kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
>> Oops: Exception in kernel mode, sig: 5 [#1]
>> LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
>> CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
>> NIP:  c0062670 LR: c006265c CTR: 
>> REGS: c005bf8a75b0 TRAP: 0700   Not tainted  (5.0.0-rc7+)
>> MSR:  8282b033   CR: 28422842  XER: 
>> 
>> CFAR: c0804f44 IRQMASK: 1
>> GPR00: c006265c c005bf8a7840 c1518200 c13cbcc8
>> GPR04: 00080004  ccc457e0 c005c4e341d8
>> GPR08:  0001 c7f4f800 0001
>> GPR12: 2200 c7f4e100  000139c29710
>> GPR16: 000139c29714 000139c29788 c13cbcc8 
>> GPR20: 00034000 c16e05e8  0001
>> GPR24: 00bf50d9 818e  c16e04b8
>> GPR28: f0d00040 006420a2f217 f0d0 00ea1b217034
>> NIP [c0062670] __kernel_map_pages+0x2e0/0x4f0
>> LR [c006265c] __kernel_map_pages+0x2cc/0x4f0
>> Call Trace:
>> [c005bf8a7840] [c006265c] __kernel_map_pages+0x2cc/0x4f0 
>> (unreliable)
>> [c005bf8a78d0] [c028c4a0] free_unref_page_prepare+0x2f0/0x4d0
>> [c005bf8a7930] [c0293144] free_unref_page+0x44/0x90
>> [c005bf8a7970] [c037af24] __online_page_free+0x84/0x110
>> [c005bf8a79a0] [c037b6e0] online_pages_range+0xc0/0x150
>> [c005bf8a7a00] [c005aaa8] walk_system_ram_range+0xc8/0x120
>> [c005bf8a7a50] [c037e710] online_pages+0x280/0x5a0
>> [c005bf8a7b40] [c06419e4] memory_subsys_online+0x1b4/0x270
>> [c005bf8a7bb0] [c0616720] device_online+0xc0/0xf0
>> [c005bf8a7bf0] [c0642570] state_store+0xc0/0x180
>> [c005bf8a7c30] [c0610b2c] dev_attr_store+0x3c/0x60
>> [c005bf8a7c50] [c04c0a50] sysfs_kf_write+0x70/0xb0
>> [c005bf8a7c90] [c04bf40c] kernfs_fop_write+0x10c/0x250
>> [c005bf8a7ce0] [c03e4b18] __vfs_write+0x48/0x240
>> [c005bf8a7d80] [c03e4f68] vfs_write+0xd8/0x210
>> [c005bf8a7dd0] [c03e52f0] ksys_write+0x70/0x120
>> [c005bf8a7e20] [c000b000] system_call+0x5c/0x70
>> Instruction dump:
>> 7fbd5278 7fbd4a78 3e42ffeb 7bbd0640 3a523ac8 7e439378 487a2881 6000
>> e95505f0 7e6aa0ae 6a690080 7929c9c2 <0b09> 7f4aa1ae 7e439378 487a28dd
>>
>> Signed-off-by: Qian Cai 
>> ---
>>  mm/page_alloc.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 10d0f2ed9f69..025fc93d1518 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -8349,6 +8349,7 @@ __offline_isolated_pages(unsigned long start_pfn, 
>> unsigned long end_pfn)
>> for (i = 0; i < (1 << order); i++)
>> SetPageReserved((page+i));
>> pfn += (1 << order);
>> +   kernel_map_pages(page, 1 << order, 1);
> 
> Doubt , Not sure, but does this change will have any impact on
> drivers/base/memory.c#L249
> memory_block_action() ->  offline_pages() ?

Yes, it does.

Re: [PATCH] tmpfs: fix uninitialized return value in shmem_link

2019-02-25 Thread Qian Cai




On 2/25/19 6:58 PM, Linus Torvalds wrote:
> On Mon, Feb 25, 2019 at 2:34 PM Linus Torvalds
>  wrote:
>>
>> On Mon, Feb 25, 2019 at 12:34 PM Hugh Dickins  wrote:
>>>
>>> Seems like a gcc bug? But I don't have a decent recent gcc to hand
>>> to submit a proper report, hope someone else can shed light on it.
>>
>> I don't have a _very_ recent gcc either [..]
> 
> Well, that was quick. Yup, it's considered a gcc bug.
> 
> Sadly, it's just a different version of a really old bug:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18501
> 
> which goes back to 2004.
> 
> Which I guess means we should not expect this to be fixed in gcc any time 
> soon.
> 
> The *good* news (I guess) is that if we have other situations with
> that pattern, and that lack of warning, it really is because gcc will
> have generated code as if it was initialized (to the value that we
> tested it must have been in the one basic block where it *was*
> initialized).
> 
> So it won't leak random kernel data, and with the common error
> condition case (like in this example - checking that we didn't have an
> error) it will actually end up doing the right thing.
> 
> Entirely by mistake, and without a warniing, but still.. It could have
> been much worse. Basically at least for this pattern, "lack of
> warning" ends up meaning "it got initialized to the expected value".
> 
> Of course, that's just gcc. I have no idea what llvm ends up doing.
> 

Clang 7.0:

# clang  -O2 -S -Wall /tmp/test.c
/tmp/test.c:46:6: warning: variable 'ret' is used uninitialized whenever 'if'
condition is false [-Wsometimes-uninitialized]
if (inode->i_nlink) {
^~
/tmp/test.c:60:9: note: uninitialized use occurs here
return ret;
   ^~~
/tmp/test.c:46:2: note: remove the 'if' if its condition is always true
if (inode->i_nlink) {
^~~~
/tmp/test.c:37:9: note: initialize the variable 'ret' to silence this warning
int ret;
   ^
= 0
1 warning generated.

[PATCH] mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC

2019-02-25 Thread Qian Cai

When onlining memory pages, it calls kernel_unmap_linear_page(),
However, it does not call kernel_map_linear_page() while offlining
memory pages. As the result, it triggers a panic below while onlining on
ppc64le as it checks if the pages are mapped before unmapping,
Therefore, let it call kernel_map_linear_page() when setting all pages
as reserved.

kernel BUG at arch/powerpc/mm/hash_utils_64.c:1815!
Oops: Exception in kernel mode, sig: 5 [#1]
LE SMP NR_CPUS=256 DEBUG_PAGEALLOC NUMA pSeries
CPU: 2 PID: 4298 Comm: bash Not tainted 5.0.0-rc7+ #15
NIP:  c0062670 LR: c006265c CTR: 
REGS: c005bf8a75b0 TRAP: 0700   Not tainted  (5.0.0-rc7+)
MSR:  8282b033   CR: 28422842  XER: 

CFAR: c0804f44 IRQMASK: 1
GPR00: c006265c c005bf8a7840 c1518200 c13cbcc8
GPR04: 00080004  ccc457e0 c005c4e341d8
GPR08:  0001 c7f4f800 0001
GPR12: 2200 c7f4e100  000139c29710
GPR16: 000139c29714 000139c29788 c13cbcc8 
GPR20: 00034000 c16e05e8  0001
GPR24: 00bf50d9 818e  c16e04b8
GPR28: f0d00040 006420a2f217 f0d0 00ea1b217034
NIP [c0062670] __kernel_map_pages+0x2e0/0x4f0
LR [c006265c] __kernel_map_pages+0x2cc/0x4f0
Call Trace:
[c005bf8a7840] [c006265c] __kernel_map_pages+0x2cc/0x4f0 
(unreliable)
[c005bf8a78d0] [c028c4a0] free_unref_page_prepare+0x2f0/0x4d0
[c005bf8a7930] [c0293144] free_unref_page+0x44/0x90
[c005bf8a7970] [c037af24] __online_page_free+0x84/0x110
[c005bf8a79a0] [c037b6e0] online_pages_range+0xc0/0x150
[c005bf8a7a00] [c005aaa8] walk_system_ram_range+0xc8/0x120
[c005bf8a7a50] [c037e710] online_pages+0x280/0x5a0
[c005bf8a7b40] [c06419e4] memory_subsys_online+0x1b4/0x270
[c005bf8a7bb0] [c0616720] device_online+0xc0/0xf0
[c005bf8a7bf0] [c0642570] state_store+0xc0/0x180
[c005bf8a7c30] [c0610b2c] dev_attr_store+0x3c/0x60
[c005bf8a7c50] [c04c0a50] sysfs_kf_write+0x70/0xb0
[c005bf8a7c90] [c04bf40c] kernfs_fop_write+0x10c/0x250
[c005bf8a7ce0] [c03e4b18] __vfs_write+0x48/0x240
[c005bf8a7d80] [c03e4f68] vfs_write+0xd8/0x210
[c005bf8a7dd0] [c03e52f0] ksys_write+0x70/0x120
[c005bf8a7e20] [c000b000] system_call+0x5c/0x70
Instruction dump:
7fbd5278 7fbd4a78 3e42ffeb 7bbd0640 3a523ac8 7e439378 487a2881 6000
e95505f0 7e6aa0ae 6a690080 7929c9c2 <0b09> 7f4aa1ae 7e439378 487a28dd

Signed-off-by: Qian Cai 
---
 mm/page_alloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 10d0f2ed9f69..025fc93d1518 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8349,6 +8349,7 @@ __offline_isolated_pages(unsigned long start_pfn, 
unsigned long end_pfn)
for (i = 0; i < (1 << order); i++)
SetPageReserved((page+i));
pfn += (1 << order);
+   kernel_map_pages(page, 1 << order, 1);
}
spin_unlock_irqrestore(>lock, flags);
 }
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] [v2] kasan: turn off asan-stack for clang-8 and earlier

2019-02-22 Thread Qian Cai




On 2/22/19 5:29 PM, Arnd Bergmann wrote:
> Building an arm64 allmodconfig kernel with clang results in over 140 warnings
> about overly large stack frames, the worst ones being:
> 
> drivers/gpu/drm/panel/panel-sitronix-st7789v.c:196:12: error: stack frame 
> size of 20224 bytes in function 'st7789v_prepare'
> drivers/video/fbdev/omap2/omapfb/displays/panel-tpo-td028ttec1.c:196:12: 
> error: stack frame size of 13120 bytes in function 'td028ttec1_panel_enable'
> drivers/usb/host/max3421-hcd.c:1395:1: error: stack frame size of 10048 bytes 
> in function 'max3421_spi_thread'
> drivers/net/wan/slic_ds26522.c:209:12: error: stack frame size of 9664 bytes 
> in function 'slic_ds26522_probe'
> drivers/crypto/ccp/ccp-ops.c:2434:5: error: stack frame size of 8832 bytes in 
> function 'ccp_run_cmd'
> drivers/media/dvb-frontends/stv0367.c:1005:12: error: stack frame size of 
> 7840 bytes in function 'stv0367ter_algo'
> 
> None of these happen with gcc today, and almost all of these are the result
> of a single known issue in llvm.  Hopefully it will eventually get fixed with
> the clang-9 release.
> 
> In the meantime, the best idea I have is to turn off asan-stack for clang-8
> and earlier, so we can produce a kernel that is safe to run.
> 
> I have posted three patches that address the frame overflow warnings that are
> not addressed by turning off asan-stack, so in combination with this change,
> we get much closer to a clean allmodconfig build, which in turn is necessary
> to do meaningful build regression testing.
> 
> It is still possible to turn on the CONFIG_ASAN_STACK option on all versions
> of clang, and it's always enabled for gcc, but when CONFIG_COMPILE_TEST is
> set, the option remains invisible, so allmodconfig and randconfig builds
> (which are normally done with a forced CONFIG_COMPILE_TEST) will still result
> in a mostly clean build.
> 
> Cc: Andrey Ryabinin 
> Cc: Dmitry Vyukov 
> Cc: Nick Desaulniers 
> Cc: Mark Brown 
> Cc: Qian Cai 
> Cc: Kostya Serebryany 
> Cc: Andrey Konovalov 
> Link: https://bugs.llvm.org/show_bug.cgi?id=38809
> Signed-off-by: Arnd Bergmann 
> ---

Reviewed-by: Qian Cai

Re: [PATCH -next] mm/debug: use lx% for atomic64_read() on ppc64le

2019-02-20 Thread Qian Cai

Please ignore this patch.

On 2/20/19 8:08 PM, Qian Cai wrote:
> atomic64_read() on ppc64le returns "long int" while "long long" seems on
> all other arches, so deal the special case for ppc64le.
> 
> In file included from ./include/linux/printk.h:7,
>  from ./include/linux/kernel.h:15,
>  from mm/debug.c:9:
> mm/debug.c: In function 'dump_mm':
> ./include/linux/kern_levels.h:5:18: warning: format '%llx' expects
> argument of type 'long long unsigned int', but argument 19 has type
> 'long int' [-Wformat=]
>  #define KERN_SOH "\001"  /* ASCII Start Of Header */
>   ^~
> ./include/linux/kern_levels.h:8:20: note: in expansion of macro
> 'KERN_SOH'
>  #define KERN_EMERG KERN_SOH "0" /* system is unusable */
> ^~~~
> ./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
>   printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
>  ^~
> mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
>   pr_emerg("mm %px mmap %px seqnum %llu task_size %lu\n"
>   ^~~~
> mm/debug.c:140:17: note: format string is defined here
>"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
>   ~~~^
>   %lx
> 
> Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter")
> Signed-off-by: Qian Cai 
> ---
>  mm/debug.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/debug.c b/mm/debug.c
> index c0b31b6c3877..e4ec3d68833e 100644
> --- a/mm/debug.c
> +++ b/mm/debug.c
> @@ -137,7 +137,12 @@ void dump_mm(const struct mm_struct *mm)
>   "mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
>   "pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count 
> %d\n"
>   "hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
> - "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
> +#ifdef __powerpc64__
> + "pinned_vm %lx "
> +#else
> + "pinned_vm %llx "
> +#endif
> + "data_vm %lx exec_vm %lx stack_vm %lx\n"
>   "start_code %lx end_code %lx start_data %lx end_data %lx\n"
>   "start_brk %lx brk %lx start_stack %lx\n"
>   "arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
>

[PATCH -next] mm/debug: add a cast to u64 for atomic64_read()

2019-02-20 Thread Qian Cai

atomic64_read() on ppc64le returns "long int", so fix the same way as
the commit d549f545e690 ("drm/virtio: use %llu format string form
atomic64_t") by adding a cast to u64, which makes it work on all arches.

In file included from ./include/linux/printk.h:7,
 from ./include/linux/kernel.h:15,
 from mm/debug.c:9:
mm/debug.c: In function 'dump_mm':
./include/linux/kern_levels.h:5:18: warning: format '%llx' expects
argument of type 'long long unsigned int', but argument 19 has type
'long int' [-Wformat=]
 #define KERN_SOH "\001"  /* ASCII Start Of Header */
  ^~
./include/linux/kern_levels.h:8:20: note: in expansion of macro
'KERN_SOH'
 #define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~
./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
  printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
 ^~
mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
  pr_emerg("mm %px mmap %px seqnum %llu task_size %lu\n"
  ^~~~
mm/debug.c:140:17: note: format string is defined here
   "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
  ~~~^
  %lx

Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter")
Signed-off-by: Qian Cai 
---
 mm/debug.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6c3877..45d9eb77b84e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -168,7 +168,7 @@ void dump_mm(const struct mm_struct *mm)
mm_pgtables_bytes(mm),
mm->map_count,
mm->hiwater_rss, mm->hiwater_vm, mm->total_vm, mm->locked_vm,
-   atomic64_read(>pinned_vm),
+   (u64)atomic64_read(>pinned_vm),
mm->data_vm, mm->exec_vm, mm->stack_vm,
mm->start_code, mm->end_code, mm->start_data, mm->end_data,
mm->start_brk, mm->brk, mm->start_stack,
-- 
2.17.2 (Apple Git-113)

[PATCH -next] mm/debug: use lx% for atomic64_read() on ppc64le

2019-02-20 Thread Qian Cai

atomic64_read() on ppc64le returns "long int" while "long long" seems on
all other arches, so deal the special case for ppc64le.

In file included from ./include/linux/printk.h:7,
 from ./include/linux/kernel.h:15,
 from mm/debug.c:9:
mm/debug.c: In function 'dump_mm':
./include/linux/kern_levels.h:5:18: warning: format '%llx' expects
argument of type 'long long unsigned int', but argument 19 has type
'long int' [-Wformat=]
 #define KERN_SOH "\001"  /* ASCII Start Of Header */
  ^~
./include/linux/kern_levels.h:8:20: note: in expansion of macro
'KERN_SOH'
 #define KERN_EMERG KERN_SOH "0" /* system is unusable */
^~~~
./include/linux/printk.h:297:9: note: in expansion of macro 'KERN_EMERG'
  printk(KERN_EMERG pr_fmt(fmt), ##__VA_ARGS__)
 ^~
mm/debug.c:133:2: note: in expansion of macro 'pr_emerg'
  pr_emerg("mm %px mmap %px seqnum %llu task_size %lu\n"
  ^~~~
mm/debug.c:140:17: note: format string is defined here
   "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
  ~~~^
  %lx

Fixes: 70f8a3ca68d3 ("mm: make mm->pinned_vm an atomic64 counter")
Signed-off-by: Qian Cai 
---
 mm/debug.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/debug.c b/mm/debug.c
index c0b31b6c3877..e4ec3d68833e 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -137,7 +137,12 @@ void dump_mm(const struct mm_struct *mm)
"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
"pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count 
%d\n"
"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
-   "pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
+#ifdef __powerpc64__
+   "pinned_vm %lx "
+#else
+   "pinned_vm %llx "
+#endif
+   "data_vm %lx exec_vm %lx stack_vm %lx\n"
"start_code %lx end_code %lx start_data %lx end_data %lx\n"
"start_brk %lx brk %lx start_stack %lx\n"
"arg_start %lx arg_end %lx env_start %lx env_end %lx\n"
-- 
2.17.2 (Apple Git-113)

Re: [PATCH 1/4] kasan: prevent tracing of tags.c

2019-02-20 Thread Qian Cai




On 2/20/19 7:45 AM, Andrey Konovalov wrote:
> Similarly to 0d0c8de8 ("kasan: mark file common so ftrace doesn't trace
> it") add the -pg flag to mm/kasan/tags.c to prevent conflicts with
> tracing.
> 
> Reported-by: Qian Cai 
> Signed-off-by: Andrey Konovalov 

Tested-by: Qian Cai

[PATCH] slub: fix a crash with SLUB_DEBUG + KASAN_SW_TAGS

2019-02-19 Thread Qian Cai

In process_slab(), "p = get_freepointer()" could return a tagged
pointer, but "addr = page_address()" always return a native pointer. As
the result, slab_index() is messed up here,

return (p - addr) / s->size;

All other callers of slab_index() have the same situation where "addr"
is from page_address(), so just need to untag "p".

 # cat /sys/kernel/slab/hugetlbfs_inode_cache/alloc_calls

[18868.759419] Unable to handle kernel paging request at virtual address 
2bff808aa4856d48
[18868.767341] Mem abort info:
[18868.770133]   ESR = 0x9607
[18868.773187]   Exception class = DABT (current EL), IL = 32 bits
[18868.779103]   SET = 0, FnV = 0
[18868.782155]   EA = 0, S1PTW = 0
[18868.785292] Data abort info:
[18868.788170]   ISV = 0, ISS = 0x0007
[18868.792003]   CM = 0, WnR = 0
[18868.794973] swapper pgtable: 64k pages, 48-bit VAs, pgdp = 02498338
[18868.801932] [2bff808aa4856d48] pgd=0097fcfd0003, pud=0097fcfd0003, 
pmd=0097fca30003, pte=00e8008b24850712
[18868.812597] Internal error: Oops: 9607 [#1] SMP
[18868.835088] CPU: 3 PID: 79210 Comm: read_all Tainted: G L
5.0.0-rc7+ #84
[18868.843087] Hardware name: HPE Apollo 70 /C01_APACHE_MB 
, BIOS L50_5.13_1.0.6 07/10/2018
[18868.852915] pstate: 00400089 (nzcv daIf +PAN -UAO)
[18868.857710] pc : get_map+0x78/0xec
[18868.861109] lr : get_map+0xa0/0xec
[18868.864505] sp : aeff808989e3f8e0
[18868.867816] x29: aeff808989e3f940 x28: 80082620
[18868.873128] x27: 100012d47000 x26: 97002500
[18868.878440] x25: 0001 x24: 52ff8008200131f8
[18868.883753] x23: 52ff8008200130a0 x22: 52ff800820013098
[18868.889065] x21: 80082620 x20: 100013172ba0
[18868.894377] x19: 2bff808a8971bc00 x18: 1000148f5538
[18868.899690] x17: 001b x16: 00ff
[18868.905002] x15: 1000148f5000 x14: 00d2
[18868.910314] x13: 0001 x12: 
[18868.915626] x11: 2002 x10: 2bff808aa4856d48
[18868.920937] x9 : 0200 x8 : 68ff80082620ebb0
[18868.926249] x7 :  x6 : 1000105da1dc
[18868.931561] x5 :  x4 : 
[18868.936872] x3 : 0010 x2 : 2bff808a8971bc00
[18868.942184] x1 : 7fe002098800 x0 : 80082620ceb0
[18868.947499] Process read_all (pid: 79210, stack limit = 0xf65b9361)
[18868.954454] Call trace:
[18868.956899]  get_map+0x78/0xec
[18868.959952]  process_slab+0x7c/0x47c
[18868.963526]  list_locations+0xb0/0x3c8
[18868.967273]  alloc_calls_show+0x34/0x40
[18868.971107]  slab_attr_show+0x34/0x48
[18868.974768]  sysfs_kf_seq_show+0x2e4/0x570
[18868.978864]  kernfs_seq_show+0x12c/0x1a0
[18868.982786]  seq_read+0x48c/0xf84
[18868.986099]  kernfs_fop_read+0xd4/0x448
[18868.989935]  __vfs_read+0x94/0x5d4
[18868.993334]  vfs_read+0xcc/0x194
[18868.996560]  ksys_read+0x6c/0xe8
[18868.999786]  __arm64_sys_read+0x68/0xb0
[18869.003622]  el0_svc_handler+0x230/0x3bc
[18869.007544]  el0_svc+0x8/0xc
[18869.010428] Code: d3467d2a 9ac92329 8b0a0e6a f9800151 (c85f7d4b)
[18869.016742] ---[ end trace a383a9a44ff13176 ]---
[18869.021356] Kernel panic - not syncing: Fatal exception
[18869.026705] SMP: stopping secondary CPUs
[18870.254279] SMP: failed to stop secondary CPUs 1-7,32,40,127
[18870.259942] Kernel Offset: disabled
[18870.263434] CPU features: 0x002,2c18
[18870.267358] Memory Limit: none
[18870.270725] ---[ end Kernel panic - not syncing: Fatal exception ]---

Signed-off-by: Qian Cai 
---
 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4a61959e1887..289c22f1b0c4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -311,7 +311,7 @@ static inline void set_freepointer(struct kmem_cache *s, 
void *object, void *fp)
 /* Determine object index from a given position */
 static inline unsigned int slab_index(void *p, struct kmem_cache *s, void 
*addr)
 {
-   return (p - addr) / s->size;
+   return (kasan_reset_tag(p) - addr) / s->size;
 }
 
 static inline unsigned int order_objects(unsigned int order, unsigned int size)
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] kasan: turn off asan-stack for clang-8 and earlier

2019-02-19 Thread Qian Cai




On 2/19/19 7:33 PM, Kostya Serebryany wrote:
>>> Well, I am using clang 8.0 on arm64 and running the kernel just fine for a 
>>> few
>>> weeks now and never trigger a single stack overflow (THREAD_SHIFT = 15) 
>>> because
>>> I never use any of those drivers you mentioned above. I don't think it is a 
>>> good
>>> idea to blankly remove the testing coverage here and affect people don't 
>>> use all
>>> those offensive functions at all.
>>
>> Thanks for the patch, Arnd!  Hopefully we can fix that up in Clang
>> soon.  Qian, I guess the alternative would be to add `-mllvm
>> -asan-stack=0` on potentially up to 140 Makefiles?

Depends on the exact stack consumption of those 140 functions. For example, I
don't care if you turn-off -asan-stack if the current stack size is < 32k.

For those functions always consume like a lot of stack like 8k or more, fix them
in Makefile if not too many of them.

Re: [PATCH] kasan: turn off asan-stack for clang-8 and earlier

2019-02-19 Thread Qian Cai

On Tue, 2019-02-19 at 22:49 +0100, Arnd Bergmann wrote:
> Building an arm64 allmodconfig kernel with clang results in over 140 warnings
> about overly large stack frames, the worst ones being:
> 
> drivers/gpu/drm/panel/panel-sitronix-st7789v.c:196:12: error: stack frame size
> of 20224 bytes in function 'st7789v_prepare'
> drivers/video/fbdev/omap2/omapfb/displays/panel-tpo-td028ttec1.c:196:12:
> error: stack frame size of 13120 bytes in function 'td028ttec1_panel_enable'
> drivers/usb/host/max3421-hcd.c:1395:1: error: stack frame size of 10048 bytes
> in function 'max3421_spi_thread'
> drivers/net/wan/slic_ds26522.c:209:12: error: stack frame size of 9664 bytes
> in function 'slic_ds26522_probe'
> drivers/crypto/ccp/ccp-ops.c:2434:5: error: stack frame size of 8832 bytes in
> function 'ccp_run_cmd'
> drivers/media/dvb-frontends/stv0367.c:1005:12: error: stack frame size of 7840
> bytes in function 'stv0367ter_algo'
> 
> None of these happen with gcc today, and almost all of these are the result
> of a single known bug in llvm.  Hopefully it will eventually get fixed with
> the
> clang-9 release.
> 
> In the meantime, the best idea I have is to turn off asan-stack for clang-8
> and earlier, so we can produce a kernel that is safe to run.
> 
> I have posted three patches that address the frame overflow warnings that are
> not addressed by turning off asan-stack, so in combination with this change,
> we get much closer to a clean allmodconfig build, which in turn is necessary
> to do meaningful build regression testing.

Well, I am using clang 8.0 on arm64 and running the kernel just fine for a few
weeks now and never trigger a single stack overflow (THREAD_SHIFT = 15) because
I never use any of those drivers you mentioned above. I don't think it is a good
idea to blankly remove the testing coverage here and affect people don't use all
those offensive functions at all.

Re: [PATCH] trace: skip hwasan

2019-02-18 Thread Qian Cai




On 2/18/19 10:25 AM, Andrey Konovalov wrote:
> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai  wrote:
>>
>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>> because there is a burst of too much pointer access, and then KASAN will
>> dereference each byte of the shadow address for the tag checking which
>> will kill all the CPUs.
> 
> Hi Qian,
> 
> Could you check if adding "CFLAGS_REMOVE_tags.o = -pg" into
> mm/kasan/Makefile helps with that?

Yes, you nailed it!

Re: [PATCH] trace: skip hwasan

2019-02-18 Thread Qian Cai




On 2/17/19 2:30 AM, Dmitry Vyukov wrote:
> On Sun, Feb 17, 2019 at 5:34 AM Qian Cai  wrote:
>>
>> Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
>> causes the whole system frozen on ThunderX2 systems with 256 CPUs,
>> because there is a burst of too much pointer access, and then KASAN will
>> dereference each byte of the shadow address for the tag checking which
>> will kill all the CPUs.
> 
> Hi Qian,
> 
> Could you please elaborate what exactly happens and who/why kills
> CPUs? Number of memory accesses should not make any difference.
> With hardware support (MTE) it won't be possible to disable
> instrumentation (loads and stores check tags themselves), so it would
> be useful to keep track of exact reasons we disable instrumentation to
> know how to deal with them with hardware support.
> It would be useful to keep this info in the comment in the Makefile.

It turns out sometimes it will trigger a hardware error.

# echo function > /sys/kernel/debug/tracing/current_trace

RAS CONTROLLER: Fatal unrecoverable error detected

*** NBU BAR Error ***


  MPIDR= 0x8100
  CTX_X0= 10001032eb9c
  CTX_X1= 100010205f08
  CTX_X2= 0
  CTX_X3= 100010205efc
  CTX_X4= 8
  CTX_X5= 40
  CTX_X6= 3f
  CTX_X7= 0
  CTX_X8= ff
  CTX_X9= 0808ba65ab46
  CTX_X10= 0808ba65ab45
  CTX_X11= da
  CTX_X12= 10071651
  CTX_X13= fff60658
  CTX_X14= 1000140d5000
  CTX_X15= 100013855578
  CTX_X16= 804b004a
  CTX_X17= 1000100
  CTX_X18= 0
  CTX_X19= 100010205f08
  CTX_X20= 100012531cd0
  CTX_X21= 100010205f08
  CTX_X22= 10001032eb9c
  CTX_X23= 0
  CTX_X24= 100012531cc0
  CTX_X25= 12af
  CTX_X26= fffdba05
  CTX_X27= daff808ba65ab460
  CTX_X28= 100012531cc0
  CTX_X29= 808a2c617320
  CTX_X30= 10001009b5a4
  CTX_X31= 100012531cc0
  CTX_SCR_EL3= 735
  CTX_RUNTIME_SP= 6e545c0
  CTX_SPSR_EL3= 604003c9
  CTX_ELR_EL3= 100010205ecc
 Node 0 NBU 0 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff00
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011ff00

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 1 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff40
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011ff40

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 2 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ff80
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011ff80

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 3 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011ffc0
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011ffc0

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 4 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe00
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011fe00

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 5 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe40
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011fe40

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 6 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fe80
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011fe80

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1
Requ: type : 4 : Write Back
 Node 0 NBU 7 Error report :
 NBU BAR Error
  NBU_REG_BAR_ADDRESS_ERROR_REG0 : 0x0040554c
  NBU_REG_BAR_ADDRESS_ERROR_REG1 : 0x0011fee0
  NBU_REG_BAR_ADDRESS_ERROR_REG2 : 0x0004
  Physical Address : 0x40011fee0

NBU BAR Error : Decoded info :
Agent info : CPU
Core ID : 21
Thread ID : 1

[PATCH] trace: skip hwasan

2019-02-16 Thread Qian Cai

Enabling function tracer with CONFIG_KASAN_SW_TAGS=y (hwasan) tracer
causes the whole system frozen on ThunderX2 systems with 256 CPUs,
because there is a burst of too much pointer access, and then KASAN will
dereference each byte of the shadow address for the tag checking which
will kill all the CPUs.

Signed-off-by: Qian Cai 
---
 kernel/trace/Makefile | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index c2b2148bb1d2..fdd547a68385 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -28,6 +28,11 @@ ifdef CONFIG_GCOV_PROFILE_FTRACE
 GCOV_PROFILE := y
 endif
 
+# Too much pointer access will kill hwasan.
+ifdef CONFIG_KASAN_SW_TAGS
+KASAN_SANITIZE := n
+endif
+
 CFLAGS_trace_benchmark.o := -I$(src)
 CFLAGS_trace_events_filter.o := -I$(src)
 
-- 
2.17.2 (Apple Git-113)

[PATCH] arm64/mm: skip hwasan callbacks for pgtable walker

2019-02-15 Thread Qian Cai

Page table walkers trigger soft lockups below with KASAN_SW_TAGS outline
mode on a large ThunderX2 system, because there is too much overhead to
call check_memory_region() for every memory access where it needs to
dereference every byte of the corresponding KASAN shadow address for the
correct tag.

[   76.531328] watchdog: BUG: soft lockup - CPU#65 stuck for 23s! [swapper/0:1]
[   76.538372] Modules linked in:
[   76.541433] CPU: 65 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc6+ #62
[   76.557697] pstate: 6049 (nZCv daif +PAN -UAO)
[   76.562491] pc : check_memory_region+0x64/0x94
[   76.566934] lr : __hwasan_load8_noabort+0x20/0x2c
[   76.571633] sp : 7eff808ba0247ca0
[   76.574943] x29: 7eff808ba0247cc0 x28: 068cef72
[   76.580256] x27: 0800 x26: 00600793
[   76.585568] x25: 068d x24: 83537b98
[   76.590880] x23: 7eff808ba0247e08 x22: 
[   76.596192] x21: 7eff808ba0247e08 x20: 0008
[   76.601503] x19: 1000100a8d64 x18: 
[   76.606814] x17: 01000100 x16: 
[   76.612125] x15: 100013805578 x14: 100014085000
[   76.617437] x13: 30373a2e x12: 00f00793
[   76.622749] x11: 808ba0247e0f x10: 0808ba0247e0
[   76.628060] x9 : 0808ba0247e0 x8 : 007e
[   76.633371] x7 :  x6 : 0002
[   76.638682] x5 :  x4 : 00e00793
[   76.643994] x3 : 1000100a8d64 x2 : 
[   76.649305] x1 : 0008 x0 : 7eff808ba0247e08
[   76.654617] Call trace:
[   76.657066]  check_memory_region+0x64/0x94
[   76.661162]  __hwasan_load8_noabort+0x20/0x2c
[   76.665519]  note_page+0x84/0x708
[   76.668833]  walk_pgd+0x174/0x258
[   76.672147]  ptdump_check_wx+0x90/0xfc
[   76.675894]  mark_rodata_ro+0x38/0x44
[   76.679557]  kernel_init+0x48/0x180
[   76.683045]  ret_from_fork+0x10/0x18

Signed-off-by: Qian Cai 
---
 arch/arm64/mm/Makefile | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..4b9a7a50faaf 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,9 @@ KASAN_SANITIZE_physaddr.o += n
 
 obj-$(CONFIG_KASAN)+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o:= n
+
+ifdef CONFIG_KASAN_SW_TAGS
+ifdef CONFIG_KASAN_OUTLINE
+KASAN_SANITIZE_dump.o  := n
+endif
+endif
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] kasan, slub: fix more conflicts with CONFIG_SLAB_FREELIST_HARDENED

2019-02-13 Thread Qian Cai




On 2/13/19 7:27 PM, Andrey Konovalov wrote:
> On Thu, Feb 14, 2019 at 1:25 AM Andrey Konovalov  
> wrote:
>>
>> When CONFIG_KASAN_SW_TAGS is enabled, ptr_addr might be tagged.
>> Normally, this doesn't cause any issues, as both set_freepointer()
>> and get_freepointer() are called with a pointer with the same tag.
>> However, there are some issues with CONFIG_SLUB_DEBUG code. For
>> example, when __free_slub() iterates over objects in a cache, it
>> passes untagged pointers to check_object(). check_object() in turns
>> calls get_freepointer() with an untagged pointer, which causes the
>> freepointer to be restored incorrectly.
>>
>> Add kasan_reset_tag to freelist_ptr(). Also add a detailed comment.
>>
>> Signed-off-by: Andrey Konovalov 
> 
> Reported-by: Qian Cai 

Tested-by: Qian Cai

Re: [PATCH] slub: untag object before slab end

2019-02-13 Thread Qian Cai

On Wed, 2019-02-13 at 11:31 +0100, Andrey Konovalov wrote:
> On Wed, Feb 13, 2019 at 3:06 AM Qian Cai  wrote:
> > 
> > get_freepointer() could return NULL if there is no more free objects in
> > the slab. However, it could return a tagged pointer (like
> > 0x2200) with KASAN_SW_TAGS which would escape the NULL
> > object checking in check_valid_pointer() and trigger errors below, so
> > untag the object before checking for a NULL object there.
> 
> I think this solution is just masking the issue. get_freepointer()
> shouldn't return tagged NULLs. Apparently when we save a freelist
> pointer, the object where the pointer gets written is tagged
> differently, than this same object when the pointer gets read. I found
> one case where this happens (the last patch out my 5 patch series),
> but apparently there are more.

Well, the problem is that,

__free_slab
  for_each_object(p, s, page_address(page) [1]
check_object(s, page, p ...)
  get_freepointer(s, p)

[1]: p += s->size

page_address() tags the address using page_kasan_tag(page), so each "p" here has
that tag.

However, at beginning in allocate_slab(), it tags each object with a random tag,
and then calls set_freepointer(s, p, NULL)

As the result, get_freepointer() returns a tagged NULL because it never be able
to obtain the original tag of the object anymore, and this calculation is now
wrong.

return (void *)((unsigned long)ptr ^ s->random ^ ptr_addr);

This also explain why this patch also works, as it unifies the tags.

https://marc.info/?l=linux-mm=154955366113951=2

Re: [PATCH v2 3/5] kmemleak: account for tagged pointers when calculating pointer range

2019-02-13 Thread Qian Cai

On Wed, 2019-02-13 at 14:58 +0100, Andrey Konovalov wrote:
> kmemleak keeps two global variables, min_addr and max_addr, which store
> the range of valid (encountered by kmemleak) pointer values, which it
> later uses to speed up pointer lookup when scanning blocks.
> 
> With tagged pointers this range will get bigger than it needs to be.
> This patch makes kmemleak untag pointers before saving them to min_addr
> and max_addr and when performing a lookup.
> 
> Signed-off-by: Andrey Konovalov 

Tested-by: Qian Cai

Re: [PATCH 5/5] kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED

2019-02-12 Thread Qian Cai




On 2/12/19 11:07 AM, Qian Cai wrote:
> https://git.sr.ht/~cai/linux-debug/tree/master/dmesg
> 

FYI, I just send a patch to take care of this.
https://marc.info/?l=linux-mm=155002356527913=2

[PATCH] slub: untag object before slab end

2019-02-12 Thread Qian Cai

[   36.212111]  __free_slab+0x9c/0x31c
[   36.215638]  discard_slab+0x78/0xa8
[   36.219165]  kfree+0x918/0x980
[   36.59]  destroy_sched_domain+0xa0/0xcc
[   36.226489]  cpu_attach_domain+0x304/0xb34
[   36.230631]  build_sched_domains+0x4654/0x495c
[   36.235125]  sched_init_domains+0x184/0x1d8
[   36.239357]  sched_init_smp+0x38/0xd4
[   36.243060]  kernel_init_freeable+0x67c/0x1104
[   36.247555]  kernel_init+0x18/0x2a4
[   36.251083]  ret_from_fork+0x10/0x18

Signed-off-by: Qian Cai 
---

Depends on slub-fix-slab_consistency_checks-kasan_sw_tags.patch.

 mm/slub.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 4a61959e1887..2fd1cf39914c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -503,11 +503,11 @@ static inline int check_valid_pointer(struct kmem_cache 
*s,
 {
void *base;
 
+   object = kasan_reset_tag(object);
if (!object)
return 1;
 
base = page_address(page);
-   object = kasan_reset_tag(object);
object = restore_red_left(s, object);
if (object < base || object >= base + page->objects * s->size ||
(object - base) % s->size) {
-- 
2.17.2 (Apple Git-113)

Re: [PATCH 5/5] kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED

2019-02-12 Thread Qian Cai




On 2/12/19 9:42 AM, Andrey Konovalov wrote:
> On Tue, Feb 12, 2019 at 2:43 PM Qian Cai  wrote:
>>
>>
>>
>> On 2/12/19 8:26 AM, Andrey Konovalov wrote:
>>> Hm, did you apply all 6 patches (the one that you sent and these five)
>> Yes.
> 
> I'm failing to reproduce this in QEMU. You're still using the same
> config, right? Could you share whole dmesg until the first BUG?
> 

Yes, same config and,

https://git.sr.ht/~cai/linux-debug/tree/master/dmesg

Re: [PATCH 5/5] kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED

2019-02-12 Thread Qian Cai




On 2/12/19 8:26 AM, Andrey Konovalov wrote:
> Hm, did you apply all 6 patches (the one that you sent and these five)
Yes.

Re: [PATCH 5/5] kasan, slub: fix conflicts with CONFIG_SLAB_FREELIST_HARDENED

2019-02-11 Thread Qian Cai




On 2/11/19 4:59 PM, Andrey Konovalov wrote:
> CONFIG_SLAB_FREELIST_HARDENED hashes freelist pointer with the address
> of the object where the pointer gets stored. With tag based KASAN we don't
> account for that when building freelist, as we call set_freepointer() with
> the first argument untagged. This patch changes the code to properly
> propagate tags throughout the loop.
> 
> Reported-by: Qian Cai 
> Signed-off-by: Andrey Konovalov 
> ---
>  mm/slub.c | 20 +++-
>  1 file changed, 7 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index ce874a5c9ee7..0d32f8d30752 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -303,11 +303,6 @@ static inline void set_freepointer(struct kmem_cache *s, 
> void *object, void *fp)
>   __p < (__addr) + (__objects) * (__s)->size; \
>   __p += (__s)->size)
>  
> -#define for_each_object_idx(__p, __idx, __s, __addr, __objects) \
> - for (__p = fixup_red_left(__s, __addr), __idx = 1; \
> - __idx <= __objects; \
> - __p += (__s)->size, __idx++)
> -
>  /* Determine object index from a given position */
>  static inline unsigned int slab_index(void *p, struct kmem_cache *s, void 
> *addr)
>  {
> @@ -1655,17 +1650,16 @@ static struct page *allocate_slab(struct kmem_cache 
> *s, gfp_t flags, int node)
>   shuffle = shuffle_freelist(s, page);
>  
>   if (!shuffle) {
> - for_each_object_idx(p, idx, s, start, page->objects) {
> - if (likely(idx < page->objects)) {
> - next = p + s->size;
> - next = setup_object(s, page, next);
> - set_freepointer(s, p, next);
> - } else
> - set_freepointer(s, p, NULL);
> - }
>   start = fixup_red_left(s, start);
>   start = setup_object(s, page, start);
>   page->freelist = start;
> + for (idx = 0, p = start; idx < page->objects - 1; idx++) {
> + next = p + s->size;
> + next = setup_object(s, page, next);
> + set_freepointer(s, p, next);
> + p = next;
> + }
> + set_freepointer(s, p, NULL);
>   }
>  
>   page->inuse = page->objects;
> 

Well, this one patch does not work here, as it throws endless errors below
during boot. Still need this patch to fix it.

https://marc.info/?l=linux-mm=154955366113951=2

[   85.744772] BUG kmemleak_object (Tainted: GBL   ): Freepointer
corrupt
[   85.744776]
-
[   85.744776]
[   85.744788] INFO: Allocated in create_object+0x88/0x9c8 age=2564 cpu=153 
pid=1
[   85.744797]  kmem_cache_alloc+0x39c/0x4ec
[   85.744803]  create_object+0x88/0x9c8
[   85.744811]  kmemleak_alloc+0xbc/0x180
[   85.744818]  kmem_cache_alloc+0x3ec/0x4ec
[   85.744825]  acpi_ut_create_generic_state+0x64/0xc4
[   85.744832]  acpi_ut_create_pkg_state+0x24/0x1c8
[   85.744840]  acpi_ut_walk_package_tree+0x268/0x564
[   85.744848]  acpi_ns_init_one_package+0x80/0x114
[   85.744856]  acpi_ns_init_one_object+0x214/0x3d8
[   85.744862]  acpi_ns_walk_namespace+0x288/0x384
[   85.744869]  acpi_walk_namespace+0xac/0xe8
[   85.744877]  acpi_ns_initialize_objects+0x50/0x98
[   85.744883]  acpi_load_tables+0xac/0x120
[   85.744891]  acpi_init+0x128/0x850
[   85.744898]  do_one_initcall+0x3ac/0x8c0
[   85.744906]  kernel_init_freeable+0xcdc/0x1104
[   85.744916] INFO: Freed in free_object_rcu+0x200/0x228 age=3 cpu=153 pid=0
[   85.744923]  free_object_rcu+0x200/0x228
[   85.744931]  rcu_process_callbacks+0xb00/0x12c0
[   85.744937]  __do_softirq+0x644/0xfd0
[   85.744944]  irq_exit+0x29c/0x370
[   85.744952]  __handle_domain_irq+0xe0/0x1c4
[   85.744958]  gic_handle_irq+0x1c4/0x3b0
[   85.744964]  el1_irq+0xb0/0x140
[   85.744971]  arch_cpu_idle+0x26c/0x594
[   85.744978]  default_idle_call+0x44/0x5c
[   85.744985]  do_idle+0x180/0x260
[   85.744993]  cpu_startup_entry+0x24/0x28
[   85.745001]  secondary_start_kernel+0x36c/0x440
[   85.745009] INFO: Slab 0x(ptrval) objects=91 used=0
fp=0x(ptrval) flags=0x17ffc000200
[   85.745015] INFO: Object 0x(ptrval) @offset=35296 
fp=0x(ptrval)

k4.226750] Redzone (ptrval): bb bb bb bb bb bb bb bb bb bb bb bb bb
bb bb bb  
[   84.22[   84.226765] ORedzone (ptrptrval): 5a worker/223:0 Tainted: G
   BL5.0.0-rc6+ #36
[   84.226790] Hardware name: HPE Apollo 70 /C01_APACHE_MB ,
BIOS L50_5.13_1.0.6 07/10/2018
[   84.226798] Workqueue: events free_obj_work
[   84.226802] Call trace:
[   84.2268

[PATCH] slub: remove an unused addr argument

2019-02-11 Thread Qian Cai

"addr" function argument is not used in alloc_consistency_checks() at
all, so remove it.

Fixes: becfda68abca ("slub: convert SLAB_DEBUG_FREE to SLAB_CONSISTENCY_CHECKS")
Signed-off-by: Qian Cai 
---
 mm/slub.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 075ebc529788..4a61959e1887 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1077,8 +1077,7 @@ static void setup_object_debug(struct kmem_cache *s, 
struct page *page,
 }
 
 static inline int alloc_consistency_checks(struct kmem_cache *s,
-   struct page *page,
-   void *object, unsigned long addr)
+   struct page *page, void *object)
 {
if (!check_slab(s, page))
return 0;
@@ -1099,7 +1098,7 @@ static noinline int alloc_debug_processing(struct 
kmem_cache *s,
void *object, unsigned long addr)
 {
if (s->flags & SLAB_CONSISTENCY_CHECKS) {
-   if (!alloc_consistency_checks(s, page, object, addr))
+   if (!alloc_consistency_checks(s, page, object))
goto bad;
}
 
-- 
2.17.2 (Apple Git-113)

[PATCH] slub: fix SLAB_CONSISTENCY_CHECKS + KASAN_SW_TAGS

2019-02-08 Thread Qian Cai

]  show_stack+0x20/0x2c
[0.00]  __dump_stack+0x20/0x28
[0.00]  dump_stack+0xa0/0xfc
[0.00]  print_trailer+0x1bc/0x1d0
[0.00]  object_err+0x40/0x50
[0.00]  alloc_debug_processing+0xf0/0x19c
[0.00]  ___slab_alloc+0x554/0x704
[0.00]  kmem_cache_alloc+0x2f8/0x440
[0.00]  radix_tree_node_alloc+0x90/0x2fc
[0.00]  idr_get_free+0x1e8/0x6d0
[0.00]  idr_alloc_u32+0x11c/0x2a4
[0.00]  idr_alloc+0x74/0xe0
[0.00]  worker_pool_assign_id+0x5c/0xbc
[0.00]  workqueue_init_early+0x49c/0xd50
[0.00]  start_kernel+0x52c/0xac4
[0.00] FIX radix_tree_node: Marking all objects used
[0.00]

Signed-off-by: Qian Cai 
---
 mm/slub.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/slub.c b/mm/slub.c
index 1e3d0ec4e200..075ebc529788 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -507,6 +507,7 @@ static inline int check_valid_pointer(struct kmem_cache *s,
return 1;
 
base = page_address(page);
+   object = kasan_reset_tag(object);
object = restore_red_left(s, object);
if (object < base || object >= base + page->objects * s->size ||
(object - base) % s->size) {
-- 
2.17.2 (Apple Git-113)

[PATCH -next] x86_64: tidy up KASAN_EXTRA

2019-02-05 Thread Qian Cai

The commit 585a4c1f9978 ("kasan: remove use after scope bugs
detection.") removed KASAN_EXTRA, so tidy up leftovers.

Signed-off-by: Qian Cai 
---
 arch/x86/include/asm/page_64_types.h | 4 
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 0ce558a8150d..8f657286d599 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -7,11 +7,7 @@
 #endif
 
 #ifdef CONFIG_KASAN
-#ifdef CONFIG_KASAN_EXTRA
-#define KASAN_STACK_ORDER 2
-#else
 #define KASAN_STACK_ORDER 1
-#endif
 #else
 #define KASAN_STACK_ORDER 0
 #endif
-- 
2.17.2 (Apple Git-113)

[PATCH -next] mm/compaction: no stuck in __reset_isolation_pfn()

2019-02-05 Thread Qian Cai

The commit c68d77911c23 ("mm, compaction: be selective about what
pageblocks to clear skip hints") introduced an infinite loop if a pfn is
invalid, it will loop again without increasing page counters. It can be
reproduced by running LTP tests on an arm64 server.

 # oom01 (swapping)
 # hugemmap01
tst_test.c:1096: INFO: Timeout per run is 0h 05m 00s
mem.c:814: INFO: set nr_hugepages to 128
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Test timeouted, sending SIGKILL!
Cannot kill test processes!
Congratulation, likely test hit a kernel bug.

Also, triggers soft lockups.

[  456.232228] watchdog: BUG: soft lockup - CPU#122 stuck for 22s! 
[kswapd0:1375]
[  456.273354] pstate: 8049 (Nzcv daif +PAN -UAO)
[  456.278143] pc : pfn_valid+0x54/0xdc
[  456.281713] lr : __reset_isolation_pfn+0x3a8/0x584
[  456.369358] Call trace:
[  456.371798]  pfn_valid+0x54/0xdc
[  456.375019]  __reset_isolation_pfn+0x3a8/0x584
[  456.379455]  __reset_isolation_suitable+0x1bc/0x280
[  456.384325]  reset_isolation_suitable+0xb8/0xe0
[  456.388847]  kswapd+0xd08/0x1048
[  456.392067]  kthread+0x2f4/0x30c
[  456.395289]  ret_from_fork+0x10/0x18

Fixes: c68d77911c23 ("mm, compaction: be selective about what pageblocks to 
clear skip hints")
Signed-off-by: Qian Cai 
---
 mm/compaction.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 03804ab412f3..1cc871da3fda 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -282,17 +282,16 @@ __reset_isolation_pfn(struct zone *zone, unsigned long 
pfn, bool check_source,
end_page = pfn_to_page(pfn);
 
do {
-   if (!pfn_valid_within(pfn))
-   continue;
-
-   if (check_source && PageLRU(page)) {
-   clear_pageblock_skip(page);
-   return true;
-   }
+   if (pfn_valid_within(pfn)) {
+   if (check_source && PageLRU(page)) {
+   clear_pageblock_skip(page);
+   return true;
+   }
 
-   if (check_target && PageBuddy(page)) {
-   clear_pageblock_skip(page);
-   return true;
+   if (check_target && PageBuddy(page)) {
+   clear_pageblock_skip(page);
+   return true;
+   }
}
 
page += (1 << PAGE_ALLOC_COSTLY_ORDER);
-- 
2.17.2 (Apple Git-113)

Re: mm: race in put_and_wait_on_page_locked()

2019-02-05 Thread Qian Cai



>> Cai, can you please check if you can reproduce this issue in your
>> environment with 5.0-rc5?
> 
> Yes, please do - practical confirmation more convincing than my certainty.

Indeed, I am no longer be able to reproduce this anymore.

Re: mm: race in put_and_wait_on_page_locked()

2019-02-04 Thread Qian Cai




On 2/4/19 3:42 PM, Hugh Dickins wrote:
> On Mon, 4 Feb 2019, Artem Savkov wrote:
> 
>> Hi Hugh,
>>
>> Your recent patch 9a1ea439b16b "mm: put_and_wait_on_page_locked() while
>> page is migrated" seems to have introduced a race into page migration
>> process. I have a host that eagerly reproduces the following BUG under
>> stress:
>>
>> [  302.847402] page:f0021700 count:0 mapcount:0 
>> mapping:c000b2710bb0 index:0x19
>> [  302.848096] xfs_address_space_operations [xfs] 
>> [  302.848100] name:"libc-2.28.so" 
>> [  302.848244] flags: 0x380006(referenced|uptodate)
>> [  302.848521] raw: 00380006 5deadbeef100 5deadbeef200 
>> 
>> [  302.848724] raw: 0019  0001 
>> c000bc0b1000
>> [  302.848919] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
>> [  302.849076] page->mem_cgroup:c000bc0b1000
>> [  302.849269] [ cut here ]
>> [  302.849397] kernel BUG at include/linux/mm.h:546!
>> [  302.849586] Oops: Exception in kernel mode, sig: 5 [#1]
>> [  302.849711] LE SMP NR_CPUS=2048 NUMA pSeries
>> [  302.849839] Modules linked in: pseries_rng sunrpc xts vmx_crypto 
>> virtio_balloon xfs libcrc32c virtio_net net_failover virtio_console failover 
>> virtio_blk
>> [  302.850400] CPU: 3 PID: 8759 Comm: cc1 Not tainted 5.0.0-rc4+ #36
>> [  302.850571] NIP:  c039c8b8 LR: c039c8b4 CTR: 
>> c080a0e0
>> [  302.850758] REGS: c000b0d7f7e0 TRAP: 0700   Not tainted  (5.0.0-rc4+)
>> [  302.850952] MSR:  80029033   CR: 48024422  
>> XER: 
>> [  302.851150] CFAR: c03ff584 IRQMASK: 0 
>> [  302.851150] GPR00: c039c8b4 c000b0d7fa70 c1bcca00 
>> 0021 
>> [  302.851150] GPR04: c000b044c628 0007 55a0 
>> c1fc3760 
>> [  302.851150] GPR08: 0007  c000b0d7c000 
>> c000b0d7f5ff 
>> [  302.851150] GPR12: 4400 c0003fffae80  
>>  
>> [  302.851150] GPR16:    
>>  
>> [  302.851150] GPR20: c000689f5aa8 c0002a13ee48  
>> c1da29b0 
>> [  302.851150] GPR24: c1bf7d80 c000689f5a00  
>>  
>> [  302.851150] GPR28: c1bf9e80 c000b0d7fab8 0001 
>> f0021700 
>> [  302.852914] NIP [c039c8b8] put_and_wait_on_page_locked+0x398/0x3d0
>> [  302.853080] LR [c039c8b4] put_and_wait_on_page_locked+0x394/0x3d0
>> [  302.853235] Call Trace:
>> [  302.853305] [c000b0d7fa70] [c039c8b4] 
>> put_and_wait_on_page_locked+0x394/0x3d0 (unreliable)
>> [  302.853540] [c000b0d7fb10] [c047b838] 
>> __migration_entry_wait+0x178/0x250
>> [  302.853738] [c000b0d7fb50] [c040c928] do_swap_page+0xd78/0xf60
>> [  302.853997] [c000b0d7fbd0] [c0411078] 
>> __handle_mm_fault+0xbf8/0xe80
>> [  302.854187] [c000b0d7fcb0] [c0411548] 
>> handle_mm_fault+0x248/0x450
>> [  302.854379] [c000b0d7fd00] [c0078ca4] 
>> __do_page_fault+0x2d4/0xdf0
>> [  302.854877] [c000b0d7fde0] [c00797f8] do_page_fault+0x38/0xf0
>> [  302.855057] [c000b0d7fe20] [c000a7c4] 
>> handle_page_fault+0x18/0x38
>> [  302.855300] Instruction dump:
>> [  302.855432] 4bfffcf0 6000 3948 4bfffd20 6000 6000 
>> 3c82ff36 7fe3fb78 
>> [  302.855689] fb210068 38843b78 48062f09 6000 <0fe0> 6000 
>> 3b41 3b61 
>> [  302.855950] ---[ end trace a52140e0f9751ae0 ]---
>>
>> What seems to be happening is migrate_page_move_mapping() calling
>> page_ref_freeze() on another cpu somewhere between __migration_entry_wait()
>> taking a reference and wait_on_page_bit_common() calling page_put().
> 
> Thank you for reporting, Artem.
> 
> And see the mm thread https://marc.info/?l=linux-mm=154821775401218=2
> 
> That was on arm64, you are on power I think: both point towards xfs
> (Cai could not reproduce it on ext4), but that should not be taken too
> seriously - it could just be easier to reproduce on one than the other.

Agree, although I have never been able to trigger it for ext4 running LTP
migrate_pages03 exclusively overnight (500+ iterations) and spontaneously for a
few weeks now. It might just be lucky.

[PATCH -next] hugetlbfs: a terminator for hugetlb_param_specs[]

2019-02-04 Thread Qian Cai

Booting up an arm64 server with CONFIG_VALIDATE_FS_PARSER=n triggers a
out-of-bounds error below, due to the commit 2284cf59cbce ("hugetlbfs:
Convert to fs_context") missed a terminator for hugetlb_param_specs[],
and causes this loop in fs_lookup_key(),

for (p = desc->specs; p->name; p++)

could not exit properly due to p->name is never be NULL.

[   91.575203] BUG: KASAN: global-out-of-bounds in fs_lookup_key+0x60/0x94
[   91.581810] Read of size 8 at addr 200010deeb10 by task mount/2461
[   91.597350] Call trace:
[   91.597357]  dump_backtrace+0x0/0x2b0
[   91.597361]  show_stack+0x24/0x30
[   91.597373]  dump_stack+0xc0/0xf8
[   91.623263]  print_address_description+0x64/0x2b0
[   91.627965]  kasan_report+0x150/0x1a4
[   91.627970]  __asan_report_load8_noabort+0x30/0x3c
[   91.627974]  fs_lookup_key+0x60/0x94
[   91.627977]  fs_parse+0x104/0x990
[   91.627986]  hugetlbfs_parse_param+0xc4/0x5e8
[   91.651081]  vfs_parse_fs_param+0x2e4/0x378
[   91.658118]  vfs_parse_fs_string+0xbc/0x12c
[   91.658122]  do_mount+0x11f0/0x1640
[   91.658125]  ksys_mount+0xc0/0xd0
[   91.658129]  __arm64_sys_mount+0xcc/0xe4
[   91.658137]  el0_svc_handler+0x28c/0x338
[   91.681740]  el0_svc+0x8/0xc

Fixes: 2284cf59cbce ("hugetlbfs: Convert to fs_context")
Signed-off-by: Qian Cai 
---
 fs/hugetlbfs/inode.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index abf0c2eb834e..4f352743930f 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -81,6 +81,7 @@ static const struct fs_parameter_spec hugetlb_param_specs[] = 
{
fsparam_string("pagesize",  Opt_pagesize),
fsparam_string("size",  Opt_size),
fsparam_u32   ("uid",   Opt_uid),
+   {}
 };
 
 static const struct fs_parameter_description hugetlb_fs_parameters = {
-- 
2.17.2 (Apple Git-113)

[PATCH -next] efi/arm64: return zero from ptdump_init()

2019-02-04 Thread Qian Cai

The commit e2a2e56e4082 ("arm64: dump: no need to check return value of
debugfs_create functions") converted ptdump_debugfs_register() from
void, but forgot to fix the efi version of ptdump_init().

drivers/firmware/efi/arm-runtime.c: In function 'ptdump_init':
drivers/firmware/efi/arm-runtime.c:52:9: error: void value not ignored as it 
ought to be
   52 |  return ptdump_debugfs_register(_ptdump_info, "efi_page_tables");
  | ^~~~
drivers/firmware/efi/arm-runtime.c:53:1: warning: control reaches end of 
non-void function [-Wreturn-type]

Fixes: e2a2e56e4082 ("arm64: dump: no need to check return value of 
debugfs_create functions")
Signed-off-by: Qian Cai 
---
 drivers/firmware/efi/arm-runtime.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/firmware/efi/arm-runtime.c 
b/drivers/firmware/efi/arm-runtime.c
index 352bd2473162..b308f5bb3737 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -46,10 +46,10 @@ static struct ptdump_info efi_ptdump_info = {
 
 static int __init ptdump_init(void)
 {
-   if (!efi_enabled(EFI_RUNTIME_SERVICES))
-   return 0;
+   if (efi_enabled(EFI_RUNTIME_SERVICES))
+   ptdump_debugfs_register(_ptdump_info, "efi_page_tables");
 
-   return ptdump_debugfs_register(_ptdump_info, "efi_page_tables");
+   return 0;
 }
 device_initcall(ptdump_init);
 
-- 
2.17.2 (Apple Git-113)

[tip:efi/urgent] efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker

2019-02-02 Thread tip-bot for Qian Cai

Commit-ID:  74c953ca5f6b4d5f1daa1ef34f4317e15c1a2987
Gitweb: https://git.kernel.org/tip/74c953ca5f6b4d5f1daa1ef34f4317e15c1a2987
Author: Qian Cai 
AuthorDate: Sat, 2 Feb 2019 10:50:17 +0100
Committer:  Ingo Molnar 
CommitDate: Sat, 2 Feb 2019 11:27:29 +0100

efi/arm64: Fix debugfs crash by adding a terminator for ptdump marker

When reading 'efi_page_tables' debugfs triggers an out-of-bounds access here:

  arch/arm64/mm/dump.c: 282
  if (addr >= st->marker[1].start_address) {

called from:

  arch/arm64/mm/dump.c: 331
  note_page(st, addr, 2, pud_val(pud));

because st->marker++ is is called after "UEFI runtime end" which is the
last element in addr_marker[]. Therefore, add a terminator like the one
for kernel_page_tables, so it can be skipped to print out non-existent
markers.

Here's the KASAN bug report:

  # cat /sys/kernel/debug/efi_page_tables
  ---[ UEFI runtime start ]---
  0x2000-0x2001  64K PTE   RW NX SHD AF ...
  0x2020-0x2134   17664K PTE   RW NX SHD AF ...
  ...
  0x2192-0x2195 192K PTE   RW x  SHD AF ...
  0x2195-0x219a 320K PTE   RW NX SHD AF ...
  ---[ UEFI runtime end ]---
  ---[ (null) ]---
  ---[ (null) ]---

   BUG: KASAN: global-out-of-bounds in note_page+0x1f0/0xac0
   Read of size 8 at addr 2000123f2ac0 by task read_all/42464
   Call trace:
dump_backtrace+0x0/0x298
show_stack+0x24/0x30
dump_stack+0xb0/0xdc
print_address_description+0x64/0x2b0
kasan_report+0x150/0x1a4
__asan_report_load8_noabort+0x30/0x3c
note_page+0x1f0/0xac0
walk_pgd+0xb4/0x244
ptdump_walk_pgd+0xec/0x140
ptdump_show+0x40/0x50
seq_read+0x3f8/0xad0
full_proxy_read+0x9c/0xc0
__vfs_read+0xfc/0x4c8
vfs_read+0xec/0x208
ksys_read+0xd0/0x15c
__arm64_sys_read+0x84/0x94
el0_svc_handler+0x258/0x304
el0_svc+0x8/0xc

  The buggy address belongs to the variable:
   __compound_literal.0+0x20/0x800

  Memory state around the buggy address:
   2000123f2980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   2000123f2a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa
  >2000123f2a80: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 00
^
   2000123f2b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
   2000123f2b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0

[ ardb: fix up whitespace ]
[ mingo: fix up some moar ]

Signed-off-by: Qian Cai 
Signed-off-by: Ard Biesheuvel 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-...@vger.kernel.org
Fixes: 9d80448ac92b ("efi/arm64: Add debugfs node to dump UEFI runtime page 
tables")
Link: http://lkml.kernel.org/r/20190202095017.13799-2-ard.biesheu...@linaro.org
Signed-off-by: Ingo Molnar 
---
 drivers/firmware/efi/arm-runtime.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/firmware/efi/arm-runtime.c 
b/drivers/firmware/efi/arm-runtime.c
index 23ea1ed409d1..352bd2473162 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/efi/arm-runtime.c
@@ -37,8 +37,9 @@ extern u64 efi_system_table;
 static struct ptdump_info efi_ptdump_info = {
.mm = _mm,
.markers= (struct addr_marker[]){
-   { 0,"UEFI runtime start" },
-   { DEFAULT_MAP_WINDOW_64, "UEFI runtime end" }
+   { 0,"UEFI runtime start" },
+   { DEFAULT_MAP_WINDOW_64,"UEFI runtime end" },
+   { -1,   NULL }
},
.base_addr  = 0,
 };

Re: linux-next: Fixes tags needs some work in the akpm-current tree

2019-01-30 Thread Qian Cai




On 1/30/19 7:05 PM, Stephen Rothwell wrote:
> In commit
> 
>   6e9ed490d190 ("mm/page_owner: fix for deferred struct page init")
> 
> Fixes tag
> 
>   Fixes: fe53ca54270 ("mm: use early_pfn_to_nid in page_ext_init")
> 
> has these problem(s):
> 
>   - SHA1 should be at least 12 digits long
> Can be fixed by setting core.abbrev to 12 (or more) or (for git v2.11
> or later) just making sure it is not set (or set to "auto").
> 

I think it is better to replace with this version instead,

https://lore.kernel.org/lkml/20190115202812.75820-1-...@lca.pw/

[RESEND PATCH] slab: kmemleak no scan alien caches

2019-01-29 Thread Qian Cai

Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),

alc = kmalloc_node(memsize, gfp, node);
init_arraycache(>ac, entries, batch);
kmemleak_no_scan(ac);

Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the later by lifting kmemleak_no_scan()
out of init_arraycache().

There is another place calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.

[   32.258841] kmemleak: Found object by alias at 0x8007b9aa7e38
[   32.258847] CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
[   32.258851] Call trace:
[   32.258858]  dump_backtrace+0x0/0x168
[   32.258863]  show_stack+0x24/0x30
[   32.258868]  dump_stack+0x88/0xb0
[   32.258873]  lookup_object+0x84/0xac
[   32.258877]  find_and_get_object+0x84/0xe4
[   32.258882]  kmemleak_no_scan+0x74/0xf4
[   32.258887]  setup_kmem_cache_node+0x2b4/0x35c
[   32.258892]  __do_tune_cpucache+0x250/0x2d4
[   32.258896]  do_tune_cpucache+0x4c/0xe4
[   32.258901]  enable_cpucache+0xc8/0x110
[   32.258905]  setup_cpu_cache+0x40/0x1b8
[   32.258909]  __kmem_cache_create+0x240/0x358
[   32.258913]  create_cache+0xc0/0x198
[   32.258918]  kmem_cache_create_usercopy+0x158/0x20c
[   32.258922]  kmem_cache_create+0x50/0x64
[   32.258928]  fsnotify_init+0x58/0x6c
[   32.258932]  do_one_initcall+0x194/0x388
[   32.258937]  kernel_init_freeable+0x668/0x688
[   32.258941]  kernel_init+0x18/0x124
[   32.258946]  ret_from_fork+0x10/0x18
[   32.258950] kmemleak: Object 0x8007b9aa7e00 (size 256):
[   32.258954] kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
[   32.258958] kmemleak:   min_count = 1
[   32.258962] kmemleak:   count = 0
[   32.258965] kmemleak:   flags = 0x1
[   32.258969] kmemleak:   checksum = 0
[   32.258972] kmemleak:   backtrace:
[   32.258977]  kmemleak_alloc+0x84/0xb8
[   32.258982]  kmem_cache_alloc_node_trace+0x31c/0x3a0
[   32.258987]  __kmalloc_node+0x58/0x78
[   32.258991]  setup_kmem_cache_node+0x26c/0x35c
[   32.258996]  __do_tune_cpucache+0x250/0x2d4
[   32.259001]  do_tune_cpucache+0x4c/0xe4
[   32.259005]  enable_cpucache+0xc8/0x110
[   32.259010]  setup_cpu_cache+0x40/0x1b8
[   32.259014]  __kmem_cache_create+0x240/0x358
[   32.259018]  create_cache+0xc0/0x198
[   32.259022]  kmem_cache_create_usercopy+0x158/0x20c
[   32.259026]  kmem_cache_create+0x50/0x64
[   32.259031]  fsnotify_init+0x58/0x6c
[   32.259035]  do_one_initcall+0x194/0x388
[   32.259039]  kernel_init_freeable+0x668/0x688
[   32.259043]  kernel_init+0x18/0x124
[   32.259048] kmemleak: Not scanning unknown object at 0x8007b9aa7e38
[   32.259052] CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
[   32.259056] Call trace:
[   32.259060]  dump_backtrace+0x0/0x168
[   32.259065]  show_stack+0x24/0x30
[   32.259070]  dump_stack+0x88/0xb0
[   32.259074]  kmemleak_no_scan+0x90/0xf4
[   32.259078]  setup_kmem_cache_node+0x2b4/0x35c
[   32.259083]  __do_tune_cpucache+0x250/0x2d4
[   32.259088]  do_tune_cpucache+0x4c/0xe4
[   32.259092]  enable_cpucache+0xc8/0x110
[   32.259096]  setup_cpu_cache+0x40/0x1b8
[   32.259100]  __kmem_cache_create+0x240/0x358
[   32.259104]  create_cache+0xc0/0x198
[   32.259108]  kmem_cache_create_usercopy+0x158/0x20c
[   32.259112]  kmem_cache_create+0x50/0x64
[   32.259116]  fsnotify_init+0x58/0x6c
[   32.259120]  do_one_initcall+0x194/0x388
[   32.259125]  kernel_init_freeable+0x668/0x688
[   32.259129]  kernel_init+0x18/0x124
[   32.259133]  ret_from_fork+0x10/0x18

Fixes: 1fe00d50a9e8 (slab: factor out initialization of array cache)
Signed-off-by: Qian Cai 
---
 mm/slab.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 78eb8c5bf4e4..0aff454f007b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -550,14 +550,6 @@ static void start_cpu_timer(int cpu)
 
 static void init_arraycache(struct array_cache *ac, int limit, int batch)
 {
-   /*
-* The array_cache structures contain pointers to free object.
-* However, when such objects are allocated or transferred to another
-* cache the pointers are not cleared and they could be counted as
-* valid references during a kmemleak scan. Therefore, kmemleak must
-* not scan such objects.
-*/
-   kmemleak_no_scan(ac);
if (ac) {
ac->avail = 0;
ac->limit = limit;
@@ -573,6 +565,14 @@ static struct array_cache *alloc_arraycache(int node, int 
entries,
struct array_cache *ac = NULL;
 
ac = kmalloc_node(memsize, gfp, node);
+   /*
+* The array_cache structures contain pointers to free object.
+* However, when such objects are allocated or transferred to another
+* cache the pointers are not cleared and they could be counted as
+* valid references during a kmemleak scan. Th

[PATCH] efi/arm64: add a terminator for ptdump marker

2019-01-29 Thread Qian Cai

Read efi_page_tables debugfs triggering an out-of-bounds access here,

arch/arm64/mm/dump.c: 282
if (addr >= st->marker[1].start_address) {

from,

arch/arm64/mm/dump.c: 331
note_page(st, addr, 2, pud_val(pud));

because st->marker++ is is called after "UEFI runtime end" which is the
last element in addr_marker[]. Therefore, add a terminator like the one
for kernel_page_tables, so it can be skipped to print out non-existent
markers.

 # cat /sys/kernel/debug/efi_page_tables
---[ UEFI runtime start ]---
0x2000-0x2001  64K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2020-0x2134   17664K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2134-0x2135  64K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2135-0x2137 128K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2137-0x2138  64K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2138-0x213b 192K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x213b-0x2149 896K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2149-0x214b 128K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x214b-0x214e 192K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x214e-0x214f  64K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x214f-0x2154 320K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2154-0x2156 128K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2156-0x2158 128K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2158-0x215a 128K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x215a-0x215d 192K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x215d-0x215e  64K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x215e-0x2160 128K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2160-0x2180   2M PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2180-0x2184 256K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2184-0x218e 640K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x218e-0x2190 128K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2190-0x2192 128K PTE   RW NX SHD AF NG
 UXN MEM/NORMAL
0x2192-0x2195 192K PTE   RW x  SHD AF NG
 UXN MEM/NORMAL
0x2195-0x219a 320K PTE   RW NX SHD AF NG
 UXN DEVICE/nGnRE
---[ UEFI runtime end ]---
---[ (null) ]---
---[ (null) ]---

[12126.163970] BUG: KASAN: global-out-of-bounds in note_page+0x1f0/0xac0
[12126.170404] Read of size 8 at addr 2000123f2ac0 by task read_all/42464
[12126.199520] Call trace:
[12126.201972]  dump_backtrace+0x0/0x298
[12126.205627]  show_stack+0x24/0x30
[12126.208944]  dump_stack+0xb0/0xdc
[12126.212258]  print_address_description+0x64/0x2b0
[12126.216954]  kasan_report+0x150/0x1a4
[12126.220610]  __asan_report_load8_noabort+0x30/0x3c
[12126.225392]  note_page+0x1f0/0xac0
[12126.228786]  walk_pgd+0xb4/0x244
[12126.232005]  ptdump_walk_pgd+0xec/0x140
[12126.235833]  ptdump_show+0x40/0x50
[12126.239237]  seq_read+0x3f8/0xad0
[12126.242548]  full_proxy_read+0x9c/0xc0
[12126.246290]  __vfs_read+0xfc/0x4c8
[12126.249684]  vfs_read+0xec/0x208
[12126.252904]  ksys_read+0xd0/0x15c
[12126.256210]  __arm64_sys_read+0x84/0x94
[12126.260048]  el0_svc_handler+0x258/0x304
[12126.263963]  el0_svc+0x8/0xc
[12126.266834]
[12126.268317] The buggy address belongs to the variable:
[12126.273458]  __compound_literal.0+0x20/0x800
[12126.277718]
[12126.279201] Memory state around the buggy address:
[12126.283987]  2000123f2980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[12126.291200]  2000123f2a00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
fa
[12126.298412] >2000123f2a80: fa fa fa fa 00 00 00 00 fa fa fa fa 00 00 00 
00
[12126.305624]^
[12126.310927]  2000123f2b00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[12126.318140]  2000123f2b80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0

Fixes: 9d80448ac92b ("efi/arm64: Add debugfs node to dump UEFI runtime
page tables")
Signed-off-by: Qian Cai 
---
 drivers/firmware/efi/arm-runtime.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/arm-runtime.c 
b/drivers/firmware/efi/arm-runtime.c
index 23ea1ed409d1..1d28fd933297 100644
--- a/drivers/firmware/efi/arm-runtime.c
+++ b/drivers/firmware/

Re: [PATCH] backing-dev: no need to check return value of debugfs_create functions

2019-01-22 Thread Qian Cai

On 1/22/19 1:33 PM, Greg Kroah-Hartman wrote:
> On Tue, Jan 22, 2019 at 06:19:08PM +0100, Sebastian Andrzej Siewior wrote:
>> On 2019-01-22 17:25:03 [+0100], Greg Kroah-Hartman wrote:
>  }
>  
>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
>  {
> - debugfs_remove(bdi->debug_stats);
> - debugfs_remove(bdi->debug_dir);
> + debugfs_remove_recursive(bdi->debug_dir);

 this won't remove it.
>>>
>>> Which is fine, you don't care.
>>
>> but if you cat the stats file then it will dereference the bdi struct
>> which has been free(), right?
> 
> Maybe, I don't know, your code is long gone, it doesn't matter :)
> 
>>> But step back, how could that original call be NULL?  That only happens
>>> if you pass it a bad parent dentry (which you didn't), or the system is
>>> totally out of memory (in which case you don't care as everything else
>>> is on fire).
>>
>> debugfs_get_inode() could do -ENOMEM and then the directory creation
>> fails with NULL.
> 
> And if that happens, your system has worse problems :)

Well, there are cases that people are running longevity testing on debug kernels
that including OOM and reading all files in sysfs test cases.

Admittedly, the situation right now is not all that healthy as many things are
unable to survive in a low-memory situation, i.e., kmemleak, dma-api debug etc
could just disable themselves.

That's been said, it certainly not necessary to make the situation worse by
triggering a NULL pointer dereferencing or KASAN use-after-free warnings because
of those patches.

[PATCH v4] mm/hotplug: invalid PFNs from pfn_to_online_page()

2019-01-22 Thread Qian Cai

On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is
uninitialized.

This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.

Historically, the commit 68709f45385a ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().

Since the commit 2d070eab2e82 ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.

Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.

[1]
[  102.195320] Unable to handle kernel NULL pointer dereference at virtual 
address 0006
[  102.204113] Mem abort info:
[  102.206921]   ESR = 0x9605
[  102.209997]   Exception class = DABT (current EL), IL = 32 bits
[  102.215926]   SET = 0, FnV = 0
[  102.218993]   EA = 0, S1PTW = 0
[  102.222150] Data abort info:
[  102.225047]   ISV = 0, ISS = 0x0005
[  102.228887]   CM = 0, WnR = 0
[  102.231866] user pgtable: 64k pages, 48-bit VAs, pgdp = (ptrval)
[  102.238572] [0006] pgd=, pud=
[  102.245448] Internal error: Oops: 9605 [#1] SMP
[  102.264062] CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
[  102.280403] pstate: 6049 (nZCv daif +PAN -UAO)
[  102.280409] pc : page_mapping+0x24/0x144
[  102.280415] lr : __dump_page+0x34/0x3dc
[  102.292923] sp : 3a5cfd10
[  102.296229] x29: 3a5cfd10 x28: 802f
[  102.301533] x27:  x26: 00277d00
[  102.306835] x25: 10791f56 x24: 7fe0
[  102.312138] x23: 10772f8b x22: 1125f670
[  102.317442] x21: 11311000 x20: 10772f8b
[  102.322747] x19: fffe x18: 
[  102.328049] x17:  x16: 
[  102.52] x15:  x14: 802698b19600
[  102.338654] x13: 802698b1a200 x12: 802698b16f00
[  102.343956] x11: 802698b1a400 x10: 1400
[  102.349260] x9 : 0001 x8 : 1121a000
[  102.354563] x7 :  x6 : 102c53b8
[  102.359868] x5 :  x4 : 0003
[  102.365173] x3 : 0100 x2 : 
[  102.370476] x1 : 10772f8b x0 : 
[  102.375782] Process kmemleak (pid: 1408, stack limit = 0x(ptrval))
[  102.382648] Call trace:
[  102.385091]  page_mapping+0x24/0x144
[  102.388659]  __dump_page+0x34/0x3dc
[  102.392140]  dump_page+0x28/0x4c
[  102.395363]  kmemleak_scan+0x4ac/0x680
[  102.399106]  kmemleak_scan_thread+0xb4/0xdc
[  102.403285]  kthread+0x12c/0x13c
[  102.406509]  ret_from_fork+0x10/0x18
[  102.410080] Code: d503201f f9400660 3640 d1000413 (f9400661)
[  102.416357] ---[ end trace 4d4bd7f573490c8e ]---
[  102.420966] Kernel panic - not syncing: Fatal exception
[  102.426293] SMP: stopping secondary CPUs
[  102.431830] Kernel Offset: disabled
[  102.435311] CPU features: 0x002,2c38
[  102.439223] Memory Limit: none
[  102.442384] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning")
Acked-by: Michal Hocko 
Signed-off-by: Qian Cai 
---

v4: avoid unsafe macros with side effects.
v3: change the "Fixes" line.
v2: update the changelog; keep the bound check; use pfn_valid_within().

 include/linux/memory_hotplug.h | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 07da5c6c5ba0..368267c1b71b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -21,14 +21,16 @@ struct vmem_altmap;
  * walkers which rely on the fully initialized page->flags and others
  * should use this rather than pfn_valid && pfn_to_page
  */
-#define pfn_to_online_page(pfn)\
-({ \
-   struct page *___page = NULL;\
-   unsigned long ___nr = pfn_to_section_nr(pfn);   \
-   \
-   if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr))\
-   ___page = pfn_to_page(p

[PATCH v3] mm/hotplug: invalid PFNs from pfn_to_online_page()

2019-01-21 Thread Qian Cai

On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is
uninitialized.

This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.

Historically, the commit 68709f45385a ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().

Since the commit 2d070eab2e82 ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.

Therefore, let pfn_to_online_page() call pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.

[1]
[  102.195320] Unable to handle kernel NULL pointer dereference at virtual 
address 0006
[  102.204113] Mem abort info:
[  102.206921]   ESR = 0x9605
[  102.209997]   Exception class = DABT (current EL), IL = 32 bits
[  102.215926]   SET = 0, FnV = 0
[  102.218993]   EA = 0, S1PTW = 0
[  102.222150] Data abort info:
[  102.225047]   ISV = 0, ISS = 0x0005
[  102.228887]   CM = 0, WnR = 0
[  102.231866] user pgtable: 64k pages, 48-bit VAs, pgdp = (ptrval)
[  102.238572] [0006] pgd=, pud=
[  102.245448] Internal error: Oops: 9605 [#1] SMP
[  102.264062] CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
[  102.280403] pstate: 6049 (nZCv daif +PAN -UAO)
[  102.280409] pc : page_mapping+0x24/0x144
[  102.280415] lr : __dump_page+0x34/0x3dc
[  102.292923] sp : 3a5cfd10
[  102.296229] x29: 3a5cfd10 x28: 802f
[  102.301533] x27:  x26: 00277d00
[  102.306835] x25: 10791f56 x24: 7fe0
[  102.312138] x23: 10772f8b x22: 1125f670
[  102.317442] x21: 11311000 x20: 10772f8b
[  102.322747] x19: fffe x18: 
[  102.328049] x17:  x16: 
[  102.52] x15:  x14: 802698b19600
[  102.338654] x13: 802698b1a200 x12: 802698b16f00
[  102.343956] x11: 802698b1a400 x10: 1400
[  102.349260] x9 : 0001 x8 : 1121a000
[  102.354563] x7 :  x6 : 102c53b8
[  102.359868] x5 :  x4 : 0003
[  102.365173] x3 : 0100 x2 : 
[  102.370476] x1 : 10772f8b x0 : 
[  102.375782] Process kmemleak (pid: 1408, stack limit = 0x(ptrval))
[  102.382648] Call trace:
[  102.385091]  page_mapping+0x24/0x144
[  102.388659]  __dump_page+0x34/0x3dc
[  102.392140]  dump_page+0x28/0x4c
[  102.395363]  kmemleak_scan+0x4ac/0x680
[  102.399106]  kmemleak_scan_thread+0xb4/0xdc
[  102.403285]  kthread+0x12c/0x13c
[  102.406509]  ret_from_fork+0x10/0x18
[  102.410080] Code: d503201f f9400660 3640 d1000413 (f9400661)
[  102.416357] ---[ end trace 4d4bd7f573490c8e ]---
[  102.420966] Kernel panic - not syncing: Fatal exception
[  102.426293] SMP: stopping secondary CPUs
[  102.431830] Kernel Offset: disabled
[  102.435311] CPU features: 0x002,2c38
[  102.439223] Memory Limit: none
[  102.442384] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: 9f1eb38e0e11 ("mm, kmemleak: little optimization while scanning")
Acked-by: Michal Hocko 
Signed-off-by: Qian Cai 
---

v3: change the "Fixes" line.
v2: update the changelog; keep the bound check; use pfn_valid_within().

 include/linux/memory_hotplug.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 07da5c6c5ba0..cdeecd9bd87e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -21,14 +21,15 @@ struct vmem_altmap;
  * walkers which rely on the fully initialized page->flags and others
  * should use this rather than pfn_valid && pfn_to_page
  */
-#define pfn_to_online_page(pfn)\
-({ \
-   struct page *___page = NULL;\
-   unsigned long ___nr = pfn_to_section_nr(pfn);   \
-   \
-   if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr))\
-   ___page = pfn_to_page(pfn); \
-

[PATCH v2] mm/hotplug: invalid PFNs from pfn_to_online_page()

2019-01-21 Thread Qian Cai

On an arm64 ThunderX2 server, the first kmemleak scan would crash [1]
with CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is
not directly mapped (MEMBLOCK_NOMAP). Hence, the page->flags is
uninitialized.

This is due to the commit 9f1eb38e0e11 ("mm, kmemleak: little
optimization while scanning") starts to use pfn_to_online_page() instead
of pfn_valid(). However, in the CONFIG_MEMORY_HOTPLUG=y case,
pfn_to_online_page() does not call memblock_is_map_memory() while
pfn_valid() does.

Historically, the commit 68709f45385a ("arm64: only consider memblocks
with NOMAP cleared for linear mapping") causes pages marked as nomap
being no long reassigned to the new zone in memmap_init_zone() by
calling __init_single_page().

Since the commit 2d070eab2e82 ("mm: consider zone which is not fully
populated to have holes") introduced pfn_to_online_page() and was
designed to return a valid pfn only, but it is clearly broken on arm64.

Therefore, let pfn_to_online_page() calls pfn_valid_within(), so it can
handle nomap thanks to the commit f52bb98f5ade ("arm64: mm: always
enable CONFIG_HOLES_IN_ZONE"), while it will be optimized away on
architectures where have no HOLES_IN_ZONE.

[1]
[  102.195320] Unable to handle kernel NULL pointer dereference at virtual 
address 0006
[  102.204113] Mem abort info:
[  102.206921]   ESR = 0x9605
[  102.209997]   Exception class = DABT (current EL), IL = 32 bits
[  102.215926]   SET = 0, FnV = 0
[  102.218993]   EA = 0, S1PTW = 0
[  102.222150] Data abort info:
[  102.225047]   ISV = 0, ISS = 0x0005
[  102.228887]   CM = 0, WnR = 0
[  102.231866] user pgtable: 64k pages, 48-bit VAs, pgdp = (ptrval)
[  102.238572] [0006] pgd=, pud=
[  102.245448] Internal error: Oops: 9605 [#1] SMP
[  102.264062] CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
[  102.280403] pstate: 6049 (nZCv daif +PAN -UAO)
[  102.280409] pc : page_mapping+0x24/0x144
[  102.280415] lr : __dump_page+0x34/0x3dc
[  102.292923] sp : 3a5cfd10
[  102.296229] x29: 3a5cfd10 x28: 802f
[  102.301533] x27:  x26: 00277d00
[  102.306835] x25: 10791f56 x24: 7fe0
[  102.312138] x23: 10772f8b x22: 1125f670
[  102.317442] x21: 11311000 x20: 10772f8b
[  102.322747] x19: fffe x18: 
[  102.328049] x17:  x16: 
[  102.52] x15:  x14: 802698b19600
[  102.338654] x13: 802698b1a200 x12: 802698b16f00
[  102.343956] x11: 802698b1a400 x10: 1400
[  102.349260] x9 : 0001 x8 : 1121a000
[  102.354563] x7 :  x6 : 102c53b8
[  102.359868] x5 :  x4 : 0003
[  102.365173] x3 : 0100 x2 : 
[  102.370476] x1 : 10772f8b x0 : 
[  102.375782] Process kmemleak (pid: 1408, stack limit = 0x(ptrval))
[  102.382648] Call trace:
[  102.385091]  page_mapping+0x24/0x144
[  102.388659]  __dump_page+0x34/0x3dc
[  102.392140]  dump_page+0x28/0x4c
[  102.395363]  kmemleak_scan+0x4ac/0x680
[  102.399106]  kmemleak_scan_thread+0xb4/0xdc
[  102.403285]  kthread+0x12c/0x13c
[  102.406509]  ret_from_fork+0x10/0x18
[  102.410080] Code: d503201f f9400660 3640 d1000413 (f9400661)
[  102.416357] ---[ end trace 4d4bd7f573490c8e ]---
[  102.420966] Kernel panic - not syncing: Fatal exception
[  102.426293] SMP: stopping secondary CPUs
[  102.431830] Kernel Offset: disabled
[  102.435311] CPU features: 0x002,2c38
[  102.439223] Memory Limit: none
[  102.442384] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fixes: 2d070eab2e82 ("mm: consider zone which is not fully populated to
have holes")
Signed-off-by: Qian Cai 
---

v2: update the changelog; keep the bound check; use pfn_valid_within().

 include/linux/memory_hotplug.h | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 07da5c6c5ba0..cdeecd9bd87e 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -21,14 +21,15 @@ struct vmem_altmap;
  * walkers which rely on the fully initialized page->flags and others
  * should use this rather than pfn_valid && pfn_to_page
  */
-#define pfn_to_online_page(pfn)\
-({ \
-   struct page *___page = NULL;\
-   unsigned long ___nr = pfn_to_section_nr(pfn);   \
-   \
-   if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr))\
-   ___page = pfn_to_page(pfn); \
-   ___pag

Re: kmemleak panic

2019-01-18 Thread Qian Cai




On 1/18/19 12:05 PM, Marc Gonzalez wrote:
> On 18/01/2019 17:14, Qian Cai wrote:
> 
>> This looks like something different from the original "invalid PFNs from
>> pfn_to_online_page()" issue. What's your .config ?
> 
> Here's my defconfig:
> 
> # CONFIG_SWAP is not set
> CONFIG_NO_HZ_IDLE=y
> CONFIG_HIGH_RES_TIMERS=y
> CONFIG_PREEMPT=y
> CONFIG_IKCONFIG=y
> CONFIG_IKCONFIG_PROC=y
> CONFIG_LOG_BUF_SHIFT=20
> CONFIG_BLK_DEV_INITRD=y
> CONFIG_ARCH_QCOM=y
> # CONFIG_EFI is not set
> # CONFIG_SUSPEND is not set
> CONFIG_PM=y
> # CONFIG_MQ_IOSCHED_DEADLINE is not set
> # CONFIG_MQ_IOSCHED_KYBER is not set
> CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
> CONFIG_DEVTMPFS=y
> CONFIG_DEVTMPFS_MOUNT=y
> # CONFIG_BLK_DEV is not set
> # CONFIG_INPUT_KEYBOARD is not set
> # CONFIG_INPUT_MOUSE is not set
> # CONFIG_SERIO_SERPORT is not set
> CONFIG_VT_HW_CONSOLE_BINDING=y
> CONFIG_LEGACY_PTY_COUNT=16
> CONFIG_SERIAL_8250=y
> CONFIG_SERIAL_MSM=y
> CONFIG_SERIAL_MSM_CONSOLE=y
> CONFIG_SERIAL_DEV_BUS=y
> # CONFIG_HW_RANDOM is not set
> # CONFIG_HWMON is not set
> CONFIG_REGULATOR=y
> CONFIG_REGULATOR_FIXED_VOLTAGE=y
> CONFIG_REGULATOR_QCOM_SMD_RPM=y
> # CONFIG_HID is not set
> # CONFIG_USB_SUPPORT is not set
> # CONFIG_COMMON_CLK_XGENE is not set
> CONFIG_COMMON_CLK_QCOM=y
> # CONFIG_QCOM_A53PLL is not set
> # CONFIG_QCOM_CLK_APCS_MSM8916 is not set
> CONFIG_QCOM_CLK_SMD_RPM=y
> CONFIG_MSM_GCC_8998=y
> CONFIG_HWSPINLOCK=y
> CONFIG_HWSPINLOCK_QCOM=y
> CONFIG_MAILBOX=y
> CONFIG_QCOM_APCS_IPC=y
> CONFIG_RPMSG_QCOM_GLINK_RPM=y
> CONFIG_RPMSG_QCOM_GLINK_SMEM=y
> CONFIG_RPMSG_QCOM_SMD=y
> CONFIG_RPMSG_VIRTIO=y
> CONFIG_QCOM_COMMAND_DB=y
> CONFIG_QCOM_SMEM=y
> CONFIG_QCOM_SMD_RPM=y
> CONFIG_TMPFS=y
> CONFIG_DEBUG_INFO=y
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_KMEMLEAK=y
> CONFIG_DEBUG_KMEMLEAK_EARLY_LOG_SIZE=4000
> CONFIG_FUNCTION_TRACER=y
> 

This is truncated. I'll need your whole .config file, so I can try to reproduce
it here on an arm64 ThunderX2 server.

Re: [PATCH v2] rbtree: fix the red root

2019-01-18 Thread Qian Cai




On 1/16/19 9:37 AM, Esme wrote:
> I have been off but back now, I had fetch'd current again and the diagnostics 
> look a bit different, maybe I just got lucky.  Instead of fork'ng the test 
> case (which is fairly aggressive in any case), interacting from the serial 
> port with sig-int ^C tend's to trigger enough to hit something.  I'll get the 
> page_owner sorted soon.
> 
> How I'm running;
> 
> qemu-system-x86_64 -kernel /home/files/dl/linux//arch/x86/boot/bzImage 
> -append console=ttyS0 root=/dev/sda debug earlyprintk=serial slub_debug=QUZFP 
> page_owner=on -hda stretch.img -net user,hostfwd=tcp::10021-:22 -net nic 
> -enable-kvm -nographic -m 2G -smp 2
> 
> It's somewhat random I guess that in the last two CPU context dump's printed 
> out, we see RAX and CR2 off by 4 from one another.
> 
> root@syzkaller:~# gcc -o test3 test3.c
> [  392.754148] ata1: lost interrupt (Status 0x50)
> [  392.754478] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 
> frozen
> [  392.759687] ata1.00: failed command: READ DMA
> [  392.761902] ata1.00: cmd c8/00:86:00:00:00/00:00:00:00:00/e0 tag 0 dma 
> 68608 out
> [  392.761902]  res 40/00:01:00:00:00/00:00:00:00:00/a0 Emask 0x4 
> (timeout)
> [  392.768541] ata1.00: status: { DRDY }
> [  392.769532] ata1: soft resetting link
> [  392.937942] ata1.00: configured for MWDMA2
> [  392.945624] ata1: EH complete

While you are gathering page_owner (or kdump), it might be useful to use virtio
storage driver instead of legacy IDE here, as looks like this ATA was busted.

Re: kmemleak panic

2019-01-18 Thread Qian Cai




On 1/18/19 10:36 AM, Marc Gonzalez wrote:
> On 18/01/2019 15:34, Catalin Marinas wrote:
> 
>> On Fri, Jan 18, 2019 at 02:36:46PM +0100, Marc Gonzalez wrote:
>>
>>> Trying to diagnose a separate issue, I enabled a raft of debugging options,
>>> including kmemleak. However, it looks like kmemleak itself is crashing?
>>>
>>> We seem to be crashing on this code:
>>>
>>> kasan_disable_current();
>>> pointer = *ptr;
>>> kasan_enable_current();
>>
>> There was another regression reported recently:
>>
>> http://lkml.kernel.org/r/51e79597-21ef-3073-9036-cfc33291f...@lca.pw
>>
>> See if reverting commit 9f1eb38e0e113 (mm, kmemleak: little optimization 
>> while
>> scanning) fixes it.
>>
> 
> [Drop LAKML, add LKML, add recipients]
> 
> Bug is easy to reproduce:
> 
> boot
> mount -t debugfs nodev /sys/kernel/debug/ 
> echo scan > /sys/kernel/debug/kmemleak
> 
> Unable to handle kernel paging request at virtual address ffc021e0
> Mem abort info:
>   ESR = 0x9606
>   Exception class = DABT (current EL), IL = 32 bits
>   SET = 0, FnV = 0
>   EA = 0, S1PTW = 0
> Data abort info:
>   ISV = 0, ISS = 0x0006
>   CM = 0, WnR = 0
> swapper pgtable: 4k pages, 39-bit VAs, pgdp = (ptrval)
> [ffc021e0] pgd=00017e3ba803, pud=00017e3ba803, 
> pmd=
> Internal error: Oops: 9606 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 0 PID: 635 Comm: exe Not tainted 5.0.0-rc1 #16
> Hardware name: Qualcomm Technologies, Inc. MSM8998 v1 MTP (DT)
> pstate: 8085 (Nzcv daIf -PAN -UAO)
> pc : scan_block+0x70/0x190
> lr : scan_block+0x6c/0x190
> sp : ff80133b3b20
> x29: ff80133b3b20 x28: ffc0fdbaf018 
> x27: ffc02200 x26: 0080 
> x25: ff80118bdf70 x24: ffc0f8cc8000 
> x23: ff8010bd8000 x22: ff8010bd8830 
> x21: ffc021e00ff9 x20: ffc0f8cc8050 
> x19: ffc021e0 x18: 25fd 
> x17: 0200 x16:  
> x15: ff8010c24dd8 x14: 25f9 
> x13: 445b0e6c x12: ffc0f5a96658 
> x11: 0001 x10: ff8010bae688 
> x9 : ff8010baf000 x8 : ff8010bae688 
> x7 : 0003 x6 :  
> x5 : ff801133ad60 x4 : 2878 
> x3 : ff8010c24d88 x2 : 7c512d102eca1300 
> x1 : ffc0f5a81b00 x0 :  
> Process exe (pid: 635, stack limit = 0x(ptrval))
> Call trace:
>  scan_block+0x70/0x190
>  scan_gray_list+0x108/0x1c0
>  kmemleak_scan+0x33c/0x7c0
>  kmemleak_write+0x410/0x4b0
>  full_proxy_write+0x68/0xa0
>  __vfs_write+0x60/0x190
>  vfs_write+0xac/0x1a0
>  ksys_write+0x6c/0xe0
>  __arm64_sys_write+0x24/0x30
>  el0_svc_handler+0xc0/0x160
>  el0_svc+0x8/0xc
> Code: f9000fb4 d503201f 97d2 35000580 (f9400260) 
> ---[ end trace 8797ac2fea89abd6 ]---
> note: exe[635] exited with preempt_count 2
> 
> 
> 
> Reverting 9f1eb38e0e1131e75cc4ac684391b25d70282589 does not help:

This looks like something different from the original "invalid PFNs from
pfn_to_online_page()" issue. What's your .config ?

kmemleak scan crash due to invalid PFNs

2019-01-17 Thread Qian Cai

On an arm64 ThunderX2 server, the first kmemleak scan would crash with
CONFIG_DEBUG_VM_PGFLAGS=y due to page_to_nid() found a pfn that is not directly
mapped. Hence, the page->flags is not initialized.

Reverted 9f1eb38e0e113 (mm, kmemleak: little optimization while scanning) fixed
the problem.

[  102.195320] Unable to handle kernel NULL pointer dereference at virtual
address 0006
[  102.204113] Mem abort info:
[  102.206921]   ESR = 0x9605
[  102.209997]   Exception class = DABT (current EL), IL = 32 bits
[  102.215926]   SET = 0, FnV = 0
[  102.218993]   EA = 0, S1PTW = 0
[  102.222150] Data abort info:
[  102.225047]   ISV = 0, ISS = 0x0005
[  102.228887]   CM = 0, WnR = 0
[  102.231866] user pgtable: 64k pages, 48-bit VAs, pgdp = (ptrval)
[  102.238572] [0006] pgd=, pud=
[  102.245448] Internal error: Oops: 9605 [#1] SMP
[  102.264062] CPU: 60 PID: 1408 Comm: kmemleak Not tainted 5.0.0-rc2+ #8
[  102.280403] pstate: 6049 (nZCv daif +PAN -UAO)
[  102.280409] pc : page_mapping+0x24/0x144
[  102.280415] lr : __dump_page+0x34/0x3dc
[  102.292923] sp : 3a5cfd10
[  102.296229] x29: 3a5cfd10 x28: 802f
[  102.301533] x27:  x26: 00277d00
[  102.306835] x25: 10791f56 x24: 7fe0
[  102.312138] x23: 10772f8b x22: 1125f670
[  102.317442] x21: 11311000 x20: 10772f8b
[  102.322747] x19: fffe x18: 
[  102.328049] x17:  x16: 
[  102.52] x15:  x14: 802698b19600
[  102.338654] x13: 802698b1a200 x12: 802698b16f00
[  102.343956] x11: 802698b1a400 x10: 1400
[  102.349260] x9 : 0001 x8 : 1121a000
[  102.354563] x7 :  x6 : 102c53b8
[  102.359868] x5 :  x4 : 0003
[  102.365173] x3 : 0100 x2 : 
[  102.370476] x1 : 10772f8b x0 : 
[  102.375782] Process kmemleak (pid: 1408, stack limit = 0x(ptrval))
[  102.382648] Call trace:
[  102.385091]  page_mapping+0x24/0x144
[  102.388659]  __dump_page+0x34/0x3dc
[  102.392140]  dump_page+0x28/0x4c
[  102.395363]  kmemleak_scan+0x4ac/0x680
[  102.399106]  kmemleak_scan_thread+0xb4/0xdc
[  102.403285]  kthread+0x12c/0x13c
[  102.406509]  ret_from_fork+0x10/0x18
[  102.410080] Code: d503201f f9400660 3640 d1000413 (f9400661)
[  102.416357] ---[ end trace 4d4bd7f573490c8e ]---
[  102.420966] Kernel panic - not syncing: Fatal exception
[  102.426293] SMP: stopping secondary CPUs
[  102.431830] Kernel Offset: disabled
[  102.435311] CPU features: 0x002,2c38
[  102.439223] Memory Limit: none
[  102.442384] ---[ end Kernel panic - not syncing: Fatal exception ]---

[PATCH] slab: kmemleak no scan alien caches

2019-01-16 Thread Qian Cai

Kmemleak throws endless warnings during boot due to in
__alloc_alien_cache(),

alc = kmalloc_node(memsize, gfp, node);
init_arraycache(>ac, entries, batch);
kmemleak_no_scan(ac);

Kmemleak does not track the array cache (alc->ac) but the alien cache
(alc) instead, so let it track the later by lifting kmemleak_no_scan()
out of init_arraycache().

There is another place calls init_arraycache(), but
alloc_kmem_cache_cpus() uses the percpu allocation where will never be
considered as a leak.

[   32.258841] kmemleak: Found object by alias at 0x8007b9aa7e38
[   32.258847] CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
[   32.258851] Call trace:
[   32.258858]  dump_backtrace+0x0/0x168
[   32.258863]  show_stack+0x24/0x30
[   32.258868]  dump_stack+0x88/0xb0
[   32.258873]  lookup_object+0x84/0xac
[   32.258877]  find_and_get_object+0x84/0xe4
[   32.258882]  kmemleak_no_scan+0x74/0xf4
[   32.258887]  setup_kmem_cache_node+0x2b4/0x35c
[   32.258892]  __do_tune_cpucache+0x250/0x2d4
[   32.258896]  do_tune_cpucache+0x4c/0xe4
[   32.258901]  enable_cpucache+0xc8/0x110
[   32.258905]  setup_cpu_cache+0x40/0x1b8
[   32.258909]  __kmem_cache_create+0x240/0x358
[   32.258913]  create_cache+0xc0/0x198
[   32.258918]  kmem_cache_create_usercopy+0x158/0x20c
[   32.258922]  kmem_cache_create+0x50/0x64
[   32.258928]  fsnotify_init+0x58/0x6c
[   32.258932]  do_one_initcall+0x194/0x388
[   32.258937]  kernel_init_freeable+0x668/0x688
[   32.258941]  kernel_init+0x18/0x124
[   32.258946]  ret_from_fork+0x10/0x18
[   32.258950] kmemleak: Object 0x8007b9aa7e00 (size 256):
[   32.258954] kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
[   32.258958] kmemleak:   min_count = 1
[   32.258962] kmemleak:   count = 0
[   32.258965] kmemleak:   flags = 0x1
[   32.258969] kmemleak:   checksum = 0
[   32.258972] kmemleak:   backtrace:
[   32.258977]  kmemleak_alloc+0x84/0xb8
[   32.258982]  kmem_cache_alloc_node_trace+0x31c/0x3a0
[   32.258987]  __kmalloc_node+0x58/0x78
[   32.258991]  setup_kmem_cache_node+0x26c/0x35c
[   32.258996]  __do_tune_cpucache+0x250/0x2d4
[   32.259001]  do_tune_cpucache+0x4c/0xe4
[   32.259005]  enable_cpucache+0xc8/0x110
[   32.259010]  setup_cpu_cache+0x40/0x1b8
[   32.259014]  __kmem_cache_create+0x240/0x358
[   32.259018]  create_cache+0xc0/0x198
[   32.259022]  kmem_cache_create_usercopy+0x158/0x20c
[   32.259026]  kmem_cache_create+0x50/0x64
[   32.259031]  fsnotify_init+0x58/0x6c
[   32.259035]  do_one_initcall+0x194/0x388
[   32.259039]  kernel_init_freeable+0x668/0x688
[   32.259043]  kernel_init+0x18/0x124
[   32.259048] kmemleak: Not scanning unknown object at 0x8007b9aa7e38
[   32.259052] CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
[   32.259056] Call trace:
[   32.259060]  dump_backtrace+0x0/0x168
[   32.259065]  show_stack+0x24/0x30
[   32.259070]  dump_stack+0x88/0xb0
[   32.259074]  kmemleak_no_scan+0x90/0xf4
[   32.259078]  setup_kmem_cache_node+0x2b4/0x35c
[   32.259083]  __do_tune_cpucache+0x250/0x2d4
[   32.259088]  do_tune_cpucache+0x4c/0xe4
[   32.259092]  enable_cpucache+0xc8/0x110
[   32.259096]  setup_cpu_cache+0x40/0x1b8
[   32.259100]  __kmem_cache_create+0x240/0x358
[   32.259104]  create_cache+0xc0/0x198
[   32.259108]  kmem_cache_create_usercopy+0x158/0x20c
[   32.259112]  kmem_cache_create+0x50/0x64
[   32.259116]  fsnotify_init+0x58/0x6c
[   32.259120]  do_one_initcall+0x194/0x388
[   32.259125]  kernel_init_freeable+0x668/0x688
[   32.259129]  kernel_init+0x18/0x124
[   32.259133]  ret_from_fork+0x10/0x18

Fixes: 1fe00d50a9e8 (slab: factor out initialization of array cache)
Signed-off-by: Qian Cai 
---
 mm/slab.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 78eb8c5bf4e4..0aff454f007b 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -550,14 +550,6 @@ static void start_cpu_timer(int cpu)
 
 static void init_arraycache(struct array_cache *ac, int limit, int batch)
 {
-   /*
-* The array_cache structures contain pointers to free object.
-* However, when such objects are allocated or transferred to another
-* cache the pointers are not cleared and they could be counted as
-* valid references during a kmemleak scan. Therefore, kmemleak must
-* not scan such objects.
-*/
-   kmemleak_no_scan(ac);
if (ac) {
ac->avail = 0;
ac->limit = limit;
@@ -573,6 +565,14 @@ static struct array_cache *alloc_arraycache(int node, int 
entries,
struct array_cache *ac = NULL;
 
ac = kmalloc_node(memsize, gfp, node);
+   /*
+* The array_cache structures contain pointers to free object.
+* However, when such objects are allocated or transferred to another
+* cache the pointers are not cleared and they could be counted as
+* valid references during a kmemleak scan. Th

[PATCH] Revert "mm: use early_pfn_to_nid in page_ext_init"

2019-01-15 Thread Qian Cai

This reverts commit fe53ca54270a757f0a28ee6bf3a54d952b550ed0.

When booting a system with "page_owner=on",

start_kernel
  page_ext_init
invoke_init_callbacks
  init_section_page_ext
init_page_owner
  init_early_allocated_pages
init_zones_in_node
  init_pages_in_zone
lookup_page_ext
  page_to_nid

The issue here is that page_to_nid() will not work since some page
flags have no node information until later in page_alloc_init_late() due
to DEFERRED_STRUCT_PAGE_INIT. Hence, it could trigger an out-of-bounds
access with an invalid nid.

[8.666047] UBSAN: Undefined behaviour in ./include/linux/mm.h:1104:50
[8.672603] index 7 is out of range for type 'zone [5]'

Also, kernel will panic since flags were poisoned earlier with,

CONFIG_DEBUG_VM_PGFLAGS=y
CONFIG_NODE_NOT_IN_PAGE_FLAGS=n

start_kernel
  setup_arch
pagetable_init
  paging_init
sparse_init
  sparse_init_nid
memblock_alloc_try_nid_raw

It did not handle it well in init_pages_in_zone() which ends up calling
page_to_nid().

page:ea000420 is uninitialized and poisoned
raw:    
raw:    
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
page_owner info is not active (free page?)
kernel BUG at include/linux/mm.h:990!
RIP: 0010:init_page_owner+0x486/0x520

This means that assumptions behind commit fe53ca54270a ("mm: use
early_pfn_to_nid in page_ext_init") are incomplete. Therefore, revert
the commit for now. A proper way to move the page_owner initialization
to sooner is to hook into memmap initialization.

Acked-by: Michal Hocko 
Signed-off-by: Qian Cai 
---
 init/main.c   | 3 ++-
 mm/page_ext.c | 4 +---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/init/main.c b/init/main.c
index e2e80ca3165a..c86a1c8f19f4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -695,7 +695,6 @@ asmlinkage __visible void __init start_kernel(void)
initrd_start = 0;
}
 #endif
-   page_ext_init();
kmemleak_init();
setup_per_cpu_pageset();
numa_policy_init();
@@ -1131,6 +1130,8 @@ static noinline void __init kernel_init_freeable(void)
sched_init_smp();
 
page_alloc_init_late();
+   /* Initialize page ext after all struct pages are initialized. */
+   page_ext_init();
 
do_basic_setup();
 
diff --git a/mm/page_ext.c b/mm/page_ext.c
index ae44f7adbe07..8c78b8d45117 100644
--- a/mm/page_ext.c
+++ b/mm/page_ext.c
@@ -398,10 +398,8 @@ void __init page_ext_init(void)
 * We know some arch can have a nodes layout such as
 * -pfn-->
 * N0 | N1 | N2 | N0 | N1 | N2|
-*
-* Take into account DEFERRED_STRUCT_PAGE_INIT.
 */
-   if (early_pfn_to_nid(pfn) != nid)
+   if (pfn_to_nid(pfn) != nid)
continue;
if (init_section_page_ext(pfn, nid))
goto oom;
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] kvm: add proper frame pointer logic for vmx

2019-01-15 Thread Qian Cai




On 1/15/19 12:49 PM, Sean Christopherson wrote:
> On Tue, Jan 15, 2019 at 09:48:45AM -0800, Sean Christopherson wrote:
>> On Tue, Jan 15, 2019 at 11:43:20AM -0500, Qian Cai wrote:
>>>
>>>
>>> On 1/15/19 2:13 AM, Paolo Bonzini wrote:
>>>> Hmm, maybe like this:
>>>>
>>>> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
>>>> index bcef2c7e9bc4..33122fa9d4bd 100644
>>>> --- a/arch/x86/kvm/vmx/vmenter.S
>>>> +++ b/arch/x86/kvm/vmx/vmenter.S
>>>> @@ -26,19 +26,17 @@ ENTRY(vmx_vmenter)
>>>>ret
>>>>
>>>>  2:vmlaunch
>>>> +3:
>>>>ret
>>>>
>>>> -3:cmpb $0, kvm_rebooting
>>>> -  jne 4f
>>>> -  call kvm_spurious_fault
>>>> -4:ret
>>>> -
>>>>.pushsection .fixup, "ax"
>>>> -5:jmp 3b
>>>> +4:cmpb $0, kvm_rebooting
>>>> +  jne 3b
>>>> +  jmp kvm_spurious_fault
>>>>.popsection
>>>>
>>>> -  _ASM_EXTABLE(1b, 5b)
>>>> -  _ASM_EXTABLE(2b, 5b)
>>>> +  _ASM_EXTABLE(1b, 4b)
>>>> +  _ASM_EXTABLE(2b, 4b)
>>>>
>>>>  ENDPROC(vmx_vmenter)
>>>
>>> No, that will not work. The problem is in vmx.o where I just sent another 
>>> patch
>>> for it.
>>>
>>> I can see there are five options to solve it.
>>>
>>> 1) always inline vmx_vcpu_run()
>>> 2) always noinline vmx_vcpu_run()
>>> 3) add -fdiable-ipa-fnsplit option to Makefile for vmx.o
>>> 4) let STACK_FRAME_NON_STANDARD support part.* syntax.
>>
>> What is ".part." and where does it come from?  Searching for information
>> is futile, the term is too generic.
> 
> And never mind, my eyes glazed over -fdiable-ipa-fnsplit.

For example, this works too,

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 69b3a7c30013..990dfc254e71 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -4,7 +4,7 @@ ccflags-y += -Iarch/x86/kvm

 CFLAGS_x86.o := -I.
 CFLAGS_svm.o := -I.
-CFLAGS_vmx.o := -I.
+CFLAGS_vmx.o := -I. -fdisable-tree-fnsplit

 KVM := ../../../virt/kvm

Re: [PATCH] kvm: add proper frame pointer logic for vmx

2019-01-15 Thread Qian Cai




On 1/15/19 11:43 AM, Qian Cai wrote:
> 
> 
> On 1/15/19 2:13 AM, Paolo Bonzini wrote:
>> Hmm, maybe like this:
>>
>> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
>> index bcef2c7e9bc4..33122fa9d4bd 100644
>> --- a/arch/x86/kvm/vmx/vmenter.S
>> +++ b/arch/x86/kvm/vmx/vmenter.S
>> @@ -26,19 +26,17 @@ ENTRY(vmx_vmenter)
>>  ret
>>
>>  2:  vmlaunch
>> +3:
>>  ret
>>
>> -3:  cmpb $0, kvm_rebooting
>> -jne 4f
>> -call kvm_spurious_fault
>> -4:  ret
>> -
>>  .pushsection .fixup, "ax"
>> -5:  jmp 3b
>> +4:  cmpb $0, kvm_rebooting
>> +jne 3b
>> +jmp kvm_spurious_fault
>>  .popsection
>>
>> -_ASM_EXTABLE(1b, 5b)
>> -_ASM_EXTABLE(2b, 5b)
>> +_ASM_EXTABLE(1b, 4b)
>> +_ASM_EXTABLE(2b, 4b)
>>
>>  ENDPROC(vmx_vmenter)
> 
> No, that will not work. The problem is in vmx.o where I just sent another 
> patch
> for it.
> 
> I can see there are five options to solve it.
> 
> 1) always inline vmx_vcpu_run()
> 2) always noinline vmx_vcpu_run()
> 3) add -fdiable-ipa-fnsplit option to Makefile for vmx.o
> 4) let STACK_FRAME_NON_STANDARD support part.* syntax.
> 5) trim-down vmx_vcpu_run() even more to not causing splitting by GCC.
> 
> Option 1) and 2) seems give away the decision for user with
> CONFIG_CC_OPTIMIZE_FOR_(PERFORMANCE/SIZE).
> 
> Option 3) prevents other functions there for splitting for optimization.
> 
> Option 4) and 5) seems tricky to implement.
> 
> I am not more leaning to 3) as only other fuction will miss splitting is
> vmx_segment_access_rights().
> 

Option 3) seems a bit tricky to implement too, as it needs to check for the
compiler version before to see if the option is available before put it to the
CFLAGS.

Re: [PATCH] kvm: add proper frame pointer logic for vmx

2019-01-15 Thread Qian Cai




On 1/15/19 11:34 AM, Sean Christopherson wrote:
> On Tue, Jan 15, 2019 at 08:13:22AM +0100, Paolo Bonzini wrote:
>> On 15/01/19 08:04, Qian Cai wrote:
>>>
>>>
>>> On 1/15/19 1:44 AM, Qian Cai wrote:
>>>> compilation warning since v5.0-rc1,
>>>>
>>>> arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_run.part.17()+0x3171:
>>>> call without frame pointer save/setup
> 
> The warning is complaining about vmx_vcpu_run() in vmx.c, not vmenter.S.
> The rule being "broken" is that a call is made without creating a stack
> frame, and vmx_vmenter() obviously makes no calls.
> 
> E.g., manually running objtool check:
> 
> $ tools/objtool/objtool check arch/x86/kvm/vmx/vmenter.o
> $ tools/objtool/objtool check arch/x86/kvm/vmx/vmx.o
> arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_run.part.19()+0x83e: 
> call without frame pointer save/setup
> 
> I put "broken" in quotes because AFAICT we're not actually violating the
> rule.  From tools/objtool/Documentation/stack-validation.txt:
> 
>If it's a GCC-compiled .c file, the error may be because the function
>uses an inline asm() statement which has a "call" instruction.  An
>asm() statement with a call instruction must declare the use of the
>stack pointer in its output operand.  On x86_64, this means adding
>the ASM_CALL_CONSTRAINT as an output constraint:
> 
>  asm volatile("call func" : ASM_CALL_CONSTRAINT);
> 
>Otherwise the stack frame may not get created before the call.
> 
> 
> The asm() blob that calls vmx_vmenter() uses ASM_CALL_CONSTRAINT, and
> the resulting asm output generates a frame pointer, e.g. this is from
> the vmx.o that objtool warns on:
> 
> Dump of assembler code for function vmx_vcpu_run:
>0x7440 <+0>: e8 00 00 00 00  callq  0x7445 
>0x7445 <+5>: 55  push   %rbp
>0x7446 <+6>: 48 89 e5mov%rsp,%rbp
> 
> 
> The warning only shows up in certain configs, e.g. I was only able to
> reproduce this using the .config provided by lkp.  Even explicitly
> enabling CONFIG_FRAME_POINTERS and CONFIG_STACK_VALIDATION didn't
> trigger the warning using my usual config.
> 
> And all that being said, I'm pretty sure this isn't related to the call
> to vmx_vmenter() at all, but rather is something that was exposed by
> removing __noclone from vmx_vcpu_run().
> 
> E.g. I still get the warning if I comment out the call to vmx_vmenter,
> it just shifts to something else (and continues to shift I comment out
> more calls).  The warning goes away if I re-add __noclone, regardless
> of whether or not commit 2bcbd406715d ("Revert "compiler-gcc: disable
> -ftracer for __noclone functions"") is applied.

It complained the call right here at the end of vmx_vcpu_run().

"callq 19eb6" in __read_once_size() via atomic_read()

and then jump back to vmx_vcpu_run() again.

/root/linux-debug/arch/x86/kvm/vmx/vmx.c:6650
}
   19e94:   e8 00 00 00 00  callq  19e99 


__read_once_size():
/root/linux-debug/./include/linux/compiler.h:191
   19e99:   48 c7 c7 00 00 00 00mov$0x0,%rdi
   19ea0:   e8 00 00 00 00  callq  19ea5 

   19ea5:   e9 8b dd ff ff  jmpq   17c35 

   19eaa:   48 8b bd 48 ff ff ffmov-0xb8(%rbp),%rdi
   19eb1:   e8 00 00 00 00  callq  19eb6 

   19eb6:   e9 b8 df ff ff  jmpq   17e73 


vmx_vcpu_run():
/root/linux-debug/arch/x86/kvm/vmx/vmx.c:6621
vcpu->arch.regs_dirty = 0;
   19ebb:   48 89 f7mov%rsi,%rdi
   19ebe:   e8 00 00 00 00  callq  19ec3 

   19ec3:   e9 f1 e0 ff ff  jmpq   17fb9

Re: [PATCH] kvm: add proper frame pointer logic for vmx

2019-01-15 Thread Qian Cai

On 1/15/19 2:13 AM, Paolo Bonzini wrote:
> Hmm, maybe like this:
> 
> diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
> index bcef2c7e9bc4..33122fa9d4bd 100644
> --- a/arch/x86/kvm/vmx/vmenter.S
> +++ b/arch/x86/kvm/vmx/vmenter.S
> @@ -26,19 +26,17 @@ ENTRY(vmx_vmenter)
>   ret
> 
>  2:   vmlaunch
> +3:
>   ret
> 
> -3:   cmpb $0, kvm_rebooting
> - jne 4f
> - call kvm_spurious_fault
> -4:   ret
> -
>   .pushsection .fixup, "ax"
> -5:   jmp 3b
> +4:   cmpb $0, kvm_rebooting
> + jne 3b
> + jmp kvm_spurious_fault
>   .popsection
> 
> - _ASM_EXTABLE(1b, 5b)
> - _ASM_EXTABLE(2b, 5b)
> + _ASM_EXTABLE(1b, 4b)
> + _ASM_EXTABLE(2b, 4b)
> 
>  ENDPROC(vmx_vmenter)

No, that will not work. The problem is in vmx.o where I just sent another patch
for it.

I can see there are five options to solve it.

1) always inline vmx_vcpu_run()
2) always noinline vmx_vcpu_run()
3) add -fdiable-ipa-fnsplit option to Makefile for vmx.o
4) let STACK_FRAME_NON_STANDARD support part.* syntax.
5) trim-down vmx_vcpu_run() even more to not causing splitting by GCC.

Option 1) and 2) seems give away the decision for user with
CONFIG_CC_OPTIMIZE_FOR_(PERFORMANCE/SIZE).

Option 3) prevents other functions there for splitting for optimization.

Option 4) and 5) seems tricky to implement.

I am not more leaning to 3) as only other fuction will miss splitting is
vmx_segment_access_rights().

[PATCH] kvm: always inline vmx_vcpu_run()

2019-01-15 Thread Qian Cai

__noclone was removed from vmx_vcpu_run() in 453eafbe65f which causes a
compilation warning using GCC 8,

arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_run.part.17()+0x3171:
call without frame pointer save/setup

GCC decides to splits the function into small parts for optimization
which confuses STACK_FRAME_NON_STANDARD(vmx_vcpu_run). Hence, just
inline the whole function to avoid this.

Fixes: 453eafbe65f (KVM: VMX: Move VM-Enter + VM-Exit handling to
non-inline sub-routines)
Signed-off-by: Qian Cai 
---
 arch/x86/kvm/vmx/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6915f10e584..d38b275bdcc3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6362,7 +6362,7 @@ static void vmx_update_hv_timer(struct kvm_vcpu *vcpu)
vmx->loaded_vmcs->hv_timer_armed = false;
 }
 
-static void vmx_vcpu_run(struct kvm_vcpu *vcpu)
+static __always_inline void vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
unsigned long cr3, cr4, evmcs_rsp;
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] kvm: add proper frame pointer logic for vmx

2019-01-14 Thread Qian Cai




On 1/15/19 1:44 AM, Qian Cai wrote:
> compilation warning since v5.0-rc1,
> 
> arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_run.part.17()+0x3171:
> call without frame pointer save/setup
> 
> Fixes: 453eafbe65f (KVM: VMX: Move VM-Enter + VM-Exit handling to
> non-inline sub-routines)

Oops, wrong fix. Back to square one.

[PATCH] kvm: add proper frame pointer logic for vmx

2019-01-14 Thread Qian Cai

compilation warning since v5.0-rc1,

arch/x86/kvm/vmx/vmx.o: warning: objtool: vmx_vcpu_run.part.17()+0x3171:
call without frame pointer save/setup

Fixes: 453eafbe65f (KVM: VMX: Move VM-Enter + VM-Exit handling to
non-inline sub-routines)
Signed-off-by: Qian Cai 
---
 arch/x86/kvm/vmx/vmenter.S | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index bcef2c7e9bc4..874dd3030dee 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include 
 #include 
+#include 
 
.text
 
@@ -20,18 +21,22 @@
  */
 ENTRY(vmx_vmenter)
/* EFLAGS.ZF is set if VMCS.LAUNCHED == 0 */
+   FRAME_BEGIN
je 2f
 
 1: vmresume
+   FRAME_END
ret
 
 2: vmlaunch
+   FRAME_END
ret
 
 3: cmpb $0, kvm_rebooting
jne 4f
call kvm_spurious_fault
-4: ret
+4: FRAME_END
+   ret
 
.pushsection .fixup, "ax"
 5: jmp 3b
-- 
2.17.2 (Apple Git-113)

Re: [PATCH v2] rbtree: fix the red root

2019-01-14 Thread Qian Cai



> [  114.913404] Padding 6913c65d: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.915437] Padding 2d53f25c: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.917390] Padding 78f7d621: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.919402] Padding 63547658: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.921414] Padding 1a301f4e: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.923364] Padding 46589d24: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.925340] Padding 08fb13da: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.927291] Padding ae5cc298: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.929239] Padding d49cc239: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.931177] Padding d66ad6f5: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.933110] Padding 069ad671: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.934986] Padding ffaf648c: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.936895] Padding c96d1b58: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.938848] Padding 768e4920: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.940965] Padding 0d06b43c: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.942890] Padding af5ae9fa: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.944790] Padding 6b526f1e: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  
> [  114.946727] Padding 9c8dffe3: 00 00 00 00 00 00 00 00 00 00 00 00 
> 00 00 00 00  

Another testing angle,

It might be something is doing __GFP_ZERO and overwriting those slabs. This also
matched the red root corruption, since RB_RED is bit 0.

One thing to try is to enable page_owner=on in the command-line, and then obtain
sorted_page_owner.txt before and after running the reproducer.

$ cd tools/vm
$ make page_owner_sort

$ cat /sys/kernel/debug/page_owner > page_owner_full.txt
$ grep -v ^PFN page_owner_full.txt > page_owner.txt
$ ./page_owner_sort page_owner.txt sorted_page_owner.txt

[PATCH v2] page_poison: play nicely with KASAN

2019-01-14 Thread Qian Cai

KASAN does not play well with the page poisoning
(CONFIG_PAGE_POISONING). It triggers false positives in the allocation
path,

BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
Read of size 8 at addr 1f80 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
Call Trace:
 dump_stack+0xe0/0x19a
 print_address_description.cold.2+0x9/0x28b
 kasan_report.cold.3+0x7a/0xb5
 __asan_report_load8_noabort+0x19/0x20
 memchr_inv+0x2ea/0x330
 kernel_poison_pages+0x103/0x3d5
 get_page_from_freelist+0x15e7/0x4d90

because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.

Also, false positives in free path,

BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
Write of size 4096 at addr 112cc000 by task swapper/0/1
CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
Call Trace:
 dump_stack+0xe0/0x19a
 print_address_description.cold.2+0x9/0x28b
 kasan_report.cold.3+0x7a/0xb5
 check_memory_region+0x22d/0x250
 memset+0x28/0x40
 kernel_poison_pages+0x29e/0x3d5
 __free_pages_ok+0x75f/0x13e0

due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.

Signed-off-by: Qian Cai 
---

v2: use kasan_disable/enable_current() instead.

 mm/page_alloc.c  | 2 +-
 mm/page_poison.c | 4 
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d295c9bc01a8..906250a9b89c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1945,8 +1945,8 @@ inline void post_alloc_hook(struct page *page, unsigned 
int order,
 
arch_alloc_page(page, order);
kernel_map_pages(page, 1 << order, 1);
-   kernel_poison_pages(page, 1 << order, 1);
kasan_alloc_pages(page, order);
+   kernel_poison_pages(page, 1 << order, 1);
set_page_owner(page, order, gfp_flags);
 }
 
diff --git a/mm/page_poison.c b/mm/page_poison.c
index f0c15e9017c0..21d4f97cb49b 100644
--- a/mm/page_poison.c
+++ b/mm/page_poison.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static bool want_page_poisoning __read_mostly;
 
@@ -40,7 +41,10 @@ static void poison_page(struct page *page)
 {
void *addr = kmap_atomic(page);
 
+   /* KASAN still think the page is in-use, so skip it. */
+   kasan_disable_current();
memset(addr, PAGE_POISON, PAGE_SIZE);
+   kasan_enable_current();
kunmap_atomic(addr);
 }
 
-- 
2.17.2 (Apple Git-113)

Re: [PATCH v2] rbtree: fix the red root

2019-01-14 Thread Qian Cai

Unfortunately, I could not trigger any of those here both in a bare-metal and
virtual machines. All I triggered were hung tasks and soft-lockup due to fork 
bomb.

The only other thing I can think of is to setup kdump to capture a vmcore when
either GPF or BUG() happens, and then share the vmcore somewhere, so I might
pork around to see where the memory corruption looks like.

Re: [PATCH v2] rbtree: fix the red root

2019-01-14 Thread Qian Cai




On 1/14/19 1:23 AM, Esme wrote:
> I did not yet verify the previous branches but did tune out kmemleak 
> (CONFIG_DEBUG_MEMLEAK no longer set) as it seemed a bit obtrusive in this 
> matter, this is what I see now (note redzone?).
> /Esme
> 
>   114.826116] 
> =
> [  114.828121] BUG kmalloc-64 (Tainted: GW): Padding 
> overwritten. 0x6913c65d-0x6e410492
> [  114.830551] 
> -
> [  114.830551]
> [  114.832755] INFO: Slab 0x54f47c55 objects=19 used=19 fp=0x 
>  (null) flags=0x1fffc010200
> [  114.835063] CPU: 0 PID: 6310 Comm: x Tainted: GB   W 5.0.0-rc2 
> #15
> [  114.836829] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.1-1ubuntu1 04/01/2014
> [  114.838847] Call Trace:
> [  114.839497]  dump_stack+0x1d8/0x2c6
> [  114.840274]  ? dump_stack_print_info.cold.1+0x20/0x20
> [  114.841402]  slab_err+0xab/0xcf
> [  114.842103]  ? __asan_report_load1_noabort+0x14/0x20
> [  114.843244]  ? memchr_inv+0x2c1/0x330
> [  114.844059]  slab_pad_check.part.50.cold.87+0x27/0x81

Confused again. Those slab_pad_check() looks like only with SLUB, but you said
you used SLAB. What else did you change?

Re: [PATCH v2] rbtree: fix the red root

2019-01-13 Thread Qian Cai




On 1/13/19 9:20 PM, David Lechner wrote:
> On 1/11/19 8:58 PM, Michel Lespinasse wrote:
>> On Fri, Jan 11, 2019 at 3:47 PM David Lechner  wrote:
>>>
>>> On 1/11/19 2:58 PM, Qian Cai wrote:
>>>> A GPF was reported,
>>>>
>>>> kasan: CONFIG_KASAN_INLINE enabled
>>>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>>>> general protection fault:  [#1] SMP KASAN
>>>>   kasan_die_handler.cold.22+0x11/0x31
>>>>   notifier_call_chain+0x17b/0x390
>>>>   atomic_notifier_call_chain+0xa7/0x1b0
>>>>   notify_die+0x1be/0x2e0
>>>>   do_general_protection+0x13e/0x330
>>>>   general_protection+0x1e/0x30
>>>>   rb_insert_color+0x189/0x1480
>>>>   create_object+0x785/0xca0
>>>>   kmemleak_alloc+0x2f/0x50
>>>>   kmem_cache_alloc+0x1b9/0x3c0
>>>>   getname_flags+0xdb/0x5d0
>>>>   getname+0x1e/0x20
>>>>   do_sys_open+0x3a1/0x7d0
>>>>   __x64_sys_open+0x7e/0xc0
>>>>   do_syscall_64+0x1b3/0x820
>>>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>
>>>> It turned out,
>>>>
>>>> gparent = rb_red_parent(parent);
>>>> tmp = gparent->rb_right; <-- GPF was triggered here.
>>>>
>>>> Apparently, "gparent" is NULL which indicates "parent" is rbtree's root
>>>> which is red. Otherwise, it will be treated properly a few lines above.
>>>>
>>>> /*
>>>>    * If there is a black parent, we are done.
>>>>    * Otherwise, take some corrective action as,
>>>>    * per 4), we don't want a red root or two
>>>>    * consecutive red nodes.
>>>>    */
>>>> if(rb_is_black(parent))
>>>>    break;
>>>>
>>>> Hence, it violates the rule #1 (the root can't be red) and need a fix
>>>> up, and also add a regression test for it. This looks like was
>>>> introduced by 6d58452dc06 where it no longer always paint the root as
>>>> black.
>>>>
>>>> Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only
>>>> when necessary)
>>>> Reported-by: Esme 
>>>> Tested-by: Joey Pabalinas 
>>>> Signed-off-by: Qian Cai 
>>>> ---
>>>
>>> Tested-by: David Lechner 
>>> FWIW, this fixed the following crash for me:
>>>
>>> Unable to handle kernel NULL pointer dereference at virtual address 0004
>>
>> Just to clarify, do you have a way to reproduce this crash without the fix ?
> 
> I am starting to suspect that my crash was caused by some new code
> in the drm-misc-next tree that might be causing a memory corruption.
> It threw me off that the stack trace didn't contain anything related
> to drm.
> 
> See: https://patchwork.freedesktop.org/patch/276719/
>

It may be useful for those who could reproduce this issue to turn on those
memory corruption debug options to narrow down a bit.

CONFIG_DEBUG_PAGEALLOC=y
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y
CONFIG_KASAN=y
CONFIG_KASAN_GENERIC=y
CONFIG_SLUB_DEBUG_ON=y

Re: [PATCH v2] rbtree: fix the red root

2019-01-11 Thread Qian Cai




On 1/11/19 6:16 PM, Matthew Wilcox wrote:
> On Fri, Jan 11, 2019 at 03:58:43PM -0500, Qian Cai wrote:
>> diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
>> index b7055b2a07d3..afad0213a117 100644
>> --- a/lib/rbtree_test.c
>> +++ b/lib/rbtree_test.c
>> @@ -345,6 +345,17 @@ static int __init rbtree_test_init(void)
>>  check(0);
>>  }
>>  
>> +/*
>> + * a little regression test to catch a bug may be introduced by
>> + * 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only when
>> + * necessary)
>> + */
>> +insert(nodes, );
>> +nodes->rb.__rb_parent_color = RB_RED;
>> +insert(nodes + 1, );
>> +erase(nodes + 1, );
>> +erase(nodes, );
> 
> That's not a fair test!  You're poking around in the data structure to
> create the situation.  This test would have failed before 6d58452dc06 too.
> How do we create a tree that has a red parent at root, only using insert()
> and erase()?
> 

If only I knew how to reproduce this myself, I might be able to figure out how
it ends up with the red root in the first place.

[PATCH v2] rbtree: fix the red root

2019-01-11 Thread Qian Cai

A GPF was reported,

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
kasan_die_handler.cold.22+0x11/0x31
notifier_call_chain+0x17b/0x390
atomic_notifier_call_chain+0xa7/0x1b0
notify_die+0x1be/0x2e0
do_general_protection+0x13e/0x330
general_protection+0x1e/0x30
rb_insert_color+0x189/0x1480
create_object+0x785/0xca0
kmemleak_alloc+0x2f/0x50
kmem_cache_alloc+0x1b9/0x3c0
getname_flags+0xdb/0x5d0
getname+0x1e/0x20
do_sys_open+0x3a1/0x7d0
__x64_sys_open+0x7e/0xc0
do_syscall_64+0x1b3/0x820
entry_SYSCALL_64_after_hwframe+0x49/0xbe

It turned out,

gparent = rb_red_parent(parent);
tmp = gparent->rb_right; <-- GPF was triggered here.

Apparently, "gparent" is NULL which indicates "parent" is rbtree's root
which is red. Otherwise, it will be treated properly a few lines above.

/*
 * If there is a black parent, we are done.
 * Otherwise, take some corrective action as,
 * per 4), we don't want a red root or two
 * consecutive red nodes.
 */
if(rb_is_black(parent))
break;

Hence, it violates the rule #1 (the root can't be red) and need a fix
up, and also add a regression test for it. This looks like was
introduced by 6d58452dc06 where it no longer always paint the root as
black.

Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only
when necessary)
Reported-by: Esme 
Tested-by: Joey Pabalinas 
Signed-off-by: Qian Cai 
---

v2: add a regression test.

 lib/rbtree.c  |  7 +++
 lib/rbtree_test.c | 11 +++
 2 files changed, 18 insertions(+)

diff --git a/lib/rbtree.c b/lib/rbtree.c
index d3ff682fd4b8..acc969ad8de9 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -127,6 +127,13 @@ __rb_insert(struct rb_node *node, struct rb_root *root,
break;
 
gparent = rb_red_parent(parent);
+   if (unlikely(!gparent)) {
+   /*
+* The root is red so correct it.
+*/
+   rb_set_parent_color(parent, NULL, RB_BLACK);
+   break;
+   }
 
tmp = gparent->rb_right;
if (parent != tmp) {/* parent == gparent->rb_left */
diff --git a/lib/rbtree_test.c b/lib/rbtree_test.c
index b7055b2a07d3..afad0213a117 100644
--- a/lib/rbtree_test.c
+++ b/lib/rbtree_test.c
@@ -345,6 +345,17 @@ static int __init rbtree_test_init(void)
check(0);
}
 
+   /*
+* a little regression test to catch a bug may be introduced by
+* 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only when
+* necessary)
+*/
+   insert(nodes, );
+   nodes->rb.__rb_parent_color = RB_RED;
+   insert(nodes + 1, );
+   erase(nodes + 1, );
+   erase(nodes, );
+
printk(KERN_ALERT "augmented rbtree testing");
 
init();
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] rbtree: fix the red root

2019-01-11 Thread Qian Cai

On Fri, 2019-01-11 at 09:31 -0800, Matthew Wilcox wrote:
> On Fri, Jan 11, 2019 at 11:51:45AM -0500, Qian Cai wrote:
> > Reported-by: Esme 
> > Signed-off-by: Qian Cai 
> 
> What change introduced this bug?  We need a Fixes: line so the stable
> people know how far to backport this fix.

It looks like,

Fixes: 6d58452dc06 (rbtree: adjust root color in rb_insert_color() only when
necessary)

where it no longer always paint the root as black.

Also, copying this fix for the original author and reviewer.

https://lore.kernel.org/lkml/2019065145.23628-1-...@lca.pw/

[PATCH] rbtree: fix the red root

2019-01-11 Thread Qian Cai

A GFP was reported,

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
kasan_die_handler.cold.22+0x11/0x31
notifier_call_chain+0x17b/0x390
atomic_notifier_call_chain+0xa7/0x1b0
notify_die+0x1be/0x2e0
do_general_protection+0x13e/0x330
general_protection+0x1e/0x30
rb_insert_color+0x189/0x1480
create_object+0x785/0xca0
kmemleak_alloc+0x2f/0x50
kmem_cache_alloc+0x1b9/0x3c0
getname_flags+0xdb/0x5d0
getname+0x1e/0x20
do_sys_open+0x3a1/0x7d0
__x64_sys_open+0x7e/0xc0
do_syscall_64+0x1b3/0x820
entry_SYSCALL_64_after_hwframe+0x49/0xbe

It turned out,

gparent = rb_red_parent(parent);
tmp = gparent->rb_right; <-- GFP triggered here.

Apparently, "gparent" is NULL which indicates "parent" is rbtree's root
which is red. Otherwise, it will be treated properly a few lines above.

/*
 * If there is a black parent, we are done.
 * Otherwise, take some corrective action as,
 * per 4), we don't want a red root or two
 * consecutive red nodes.
 */
if(rb_is_black(parent))
break;

Hence, it violates the rule #1 and need a fix up.

Reported-by: Esme 
Signed-off-by: Qian Cai 
---
 lib/rbtree.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/lib/rbtree.c b/lib/rbtree.c
index d3ff682fd4b8..acc969ad8de9 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -127,6 +127,13 @@ __rb_insert(struct rb_node *node, struct rb_root *root,
break;
 
gparent = rb_red_parent(parent);
+   if (unlikely(!gparent)) {
+   /*
+* The root is red so correct it.
+*/
+   rb_set_parent_color(parent, NULL, RB_BLACK);
+   break;
+   }
 
tmp = gparent->rb_right;
if (parent != tmp) {/* parent == gparent->rb_left */
-- 
2.17.2 (Apple Git-113)

Re: kernel BUG at kernel/sched/core.c:3490!

2019-01-11 Thread Qian Cai

On Fri, 2019-01-11 at 16:07 +0530, Kohli, Gaurav wrote:
> 
> On 1/7/2019 11:26 PM, Oleg Nesterov wrote:
> > pr_crit("XXX: %ld %d\n", current->state, current->on_rq);
> 
> Can we also add flags, this may help to know the path of problem:
> 
>   pr_crit("XXX: %ld %d 0x%x\n", current->state, current->on_rq,    
> current->flags);
> 

XXX: 0 1 0x40844c

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai




On 1/10/19 10:15 PM, Esme wrote:
>>> [ 75.793150] RIP: 0010:rb_insert_color+0x189/0x1480
>>
>> What's in that line? Try,
>>
>> $ ./scripts/faddr2line vmlinux rb_insert_color+0x189/0x1480
> 
> rb_insert_color+0x189/0x1480:
> __rb_insert at /home/files/git/linux/lib/rbtree.c:131
> (inlined by) rb_insert_color at /home/files/git/linux/lib/rbtree.c:452
> 

gparent = rb_red_parent(parent);

tmp = gparent->rb_right; <-- GFP triggered here.

It suggests gparent is NULL. Looks like it misses a check there because parent
is the top node.

>>
>> What's steps to reproduce this?
> 
> The steps is the kernel config provided (proc.config) and I double checked 
> the attached C code from the qemu image (attached here).  If the kernel does 
> not immediately crash, a ^C will cause the fault to be noticed.  The report 
> from earlier is the report from the same code, my assumption was that the 
> possible pool/redzone corruption is making it a bit tricky to pin down.
> 
> If you would like alternative kernel settings please let me know, I can do 
> that, also, my current test-bench has about 256 core's on x64, 64 of them are 
> bare metal and 32 are arm64.  Any possible preferred configuration tweaks I'm 
> all ears, I'll be including some of these steps you suggested to me in 
> any/additional upcoming threads (Thank you for that so far and future 
> suggestions).
> 
> Also, there is some occasionally varying stacks depending on the corruption, 
> so this stack just now (another execution of test3.c);

I am unable to reproduce any of those here. What's is the output of
/proc/cmdline in your guest when this happens?

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai




On 1/10/19 5:58 PM, Esme wrote:
> The console debug/stacks/info from just now.  The previous config, current 
> kernel from github.
> --
> Esme
> 
> [   75.783231] kasan: CONFIG_KASAN_INLINE enabled
> [   75.785870] kasan: GPF could be caused by NULL-ptr deref or user memory 
> access
> [   75.787695] general protection fault:  [#1] SMP KASAN
> [   75.789084] CPU: 0 PID: 3434 Comm: systemd-journal Not tainted 5.0.0-rc1+ 
> #5
> [   75.790938] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.1-1ubuntu1 04/01/2014
> [   75.793150] RIP: 0010:rb_insert_color+0x189/0x1480

What's in that line? Try,

$ ./scripts/faddr2line vmlinux rb_insert_color+0x189/0x1480

What's steps to reproduce this?

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai

On Thu, 2019-01-10 at 21:35 +, Esme wrote:
> The repro.report is from a different test system, I pulled the attached config
> from proc (attached);
> 

So, if the report is not right one. Where is the right crash stack trace then
that using the exact same config.?

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai

On Thu, 2019-01-10 at 20:47 +, Esme wrote:
> Sure thing;
> 
> cmdline;
> qemu-system-x86_64 -kernel linux//arch/x86/boot/bzImage -append console=ttyS0
> root=/dev/sda debug earlyprintk=serial slub_debug=QUZ -hda stretch.img -net
> user,hostfwd=tcp::10021-:22 -net nic -enable-kvm -nographic -m 2G -smp 2
> -pidfile
> 
> CONFIG_PAGE*; (full file attached);
> 
> # CONFIG_DEBUG_PAGEALLOC is not set
> CONFIG_PAGE_POISONING=y
> CONFIG_PAGE_POISONING_NO_SANITY=y
> # CONFIG_PAGE_POISONING_ZERO is not set
> # CONFIG_DEBUG_PAGE_REF is not set
> CONFIG_FAIL_PAGE_ALLOC=y

Confused.

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1896410.html

It said 5.0.0-rc1+

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1896410/repro.repor
t

It said 4.20.0+, and it also have,

"general protection fault:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI"

which indicated CONFIG_DEBUG_PAGEALLOC=y but your .config said NO.

However, it looks like a mess that KASAN does not play well with all those
SLUB_DEBUG, CONFIG_DEBUG_PAGEALLOC etc, because it essentially step into each
others' toes by redzoning, poisoning in allocate and free pages.

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai

On Thu, 2019-01-10 at 11:58 -0800, James Bottomley wrote:
> On Thu, 2019-01-10 at 19:12 +, Esme wrote:
> > Sorry for the resend some mail servers rejected the mime type.
> > 
> > Hi, I've been getting more into Kernel stuff lately and forged ahead
> > with some syzkaller bug finding.  I played with reducing it further
> > as you can see from the attached c code but am moving on and hope to
> > get better about this process moving forward as I'm still building
> > out my test systems/debugging tools.
> > 
> > Attached is the report and C repro that still triggers on a fresh git
> > pull as of a few minutes ago, if you need anything else please let me
> > know.
> > Esme
> > 
> > Linux syzkaller 5.0.0-rc1+ #5 SMP Tue Jan 8 20:39:33 EST 2019 x86_64
> > GNU/Linux
> 
> I'm not sure I'm reading this right, but it seems that a simple
> allocation inside block/scsi_ioctl.h
> 
>   buffer = kzalloc(bytes, q->bounce_gfp | GFP_USER| __GFP_NOWARN);
> 
> (where bytes is < 4k) caused a slub padding check failure on free. 
> From the internal details, the freeing entity seems to be KASAN as part
> of its quarantine reduction (albeit triggered by this kzalloc).  I'm
> not remotely familiar with what KASAN is doing, but it seems the memory
> corruption problem is somewhere within the KASAN tracking?
> 
> I added linux-mm in case they can confirm this diagnosis or give me a
> pointer to what might be wrong in scsi.
> 

Well, need your .config and /proc/cmdline then.

Re: PROBLEM: syzkaller found / pool corruption-overwrite / page in user-area or NULL

2019-01-10 Thread Qian Cai

On Thu, 2019-01-10 at 11:58 -0800, James Bottomley wrote:
> On Thu, 2019-01-10 at 19:12 +, Esme wrote:
> > Sorry for the resend some mail servers rejected the mime type.
> > 
> > Hi, I've been getting more into Kernel stuff lately and forged ahead
> > with some syzkaller bug finding.  I played with reducing it further
> > as you can see from the attached c code but am moving on and hope to
> > get better about this process moving forward as I'm still building
> > out my test systems/debugging tools.
> > 
> > Attached is the report and C repro that still triggers on a fresh git
> > pull as of a few minutes ago, if you need anything else please let me
> > know.
> > Esme
> > 
> > Linux syzkaller 5.0.0-rc1+ #5 SMP Tue Jan 8 20:39:33 EST 2019 x86_64
> > GNU/Linux
> 
> I'm not sure I'm reading this right, but it seems that a simple
> allocation inside block/scsi_ioctl.h
> 
>   buffer = kzalloc(bytes, q->bounce_gfp | GFP_USER| __GFP_NOWARN);
> 
> (where bytes is < 4k) caused a slub padding check failure on free. 
> From the internal details, the freeing entity seems to be KASAN as part
> of its quarantine reduction (albeit triggered by this kzalloc).  I'm
> not remotely familiar with what KASAN is doing, but it seems the memory
> corruption problem is somewhere within the KASAN tracking?
> 
> I added linux-mm in case they can confirm this diagnosis or give me a
> pointer to what might be wrong in scsi.
> 

Did you enable page_poison with PAGE_POISONING_ZERO=y ?

Re: [RESEND PATCH] x86_64: increase stack size for KASAN_EXTRA

2019-01-09 Thread Qian Cai

On 1/9/19 5:02 PM, Andrew Morton wrote:
>> --- a/arch/x86/include/asm/page_64_types.h
>> +++ b/arch/x86/include/asm/page_64_types.h
>> @@ -7,7 +7,11 @@
>>  #endif
>>  
>>  #ifdef CONFIG_KASAN
>> +#ifdef CONFIG_KASAN_EXTRA
>> +#define KASAN_STACK_ORDER 2
>> +#else
>>  #define KASAN_STACK_ORDER 1
>> +#endif
>>  #else
>>  #define KASAN_STACK_ORDER 0
>>  #endif
> 
> I'm suspecting this logic could be performed in Kconfig, for all
> architectures.  Add a new always-defined CONFIG_KASAN_STACK_ORDER?
> 

Sounds doable, but KASAN only support x86_64, arm64, s390 and xtensa so far. I
am not sure I care about the later two nor I have any way to test them if they
suffer the same problem.

I'll keep looking if more arches start to implement KASAN, and then might be a
good time to reduce all those code duplication.

[RESEND PATCH] x86_64: increase stack size for KASAN_EXTRA

2019-01-09 Thread Qian Cai

If the kernel is configured with KASAN_EXTRA, the stack size is
increasted significantly due to enable this option will set
"-fstack-reuse" to "none" in GCC [1]. As the results, it could trigger
stack overrun quite often with 32k stack size compiled using GCC 8. For
example, this reproducer

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/\
syscalls/madvise/madvise06.c

could trigger a "corrupted stack end detected inside scheduler" very
reliably with CONFIG_SCHED_STACK_END_CHECK enabled.

There are just too many functions that could have a large stack with
KASAN_EXTRA due to large local variables that have been called over and
over again without being able to reuse the stacks. Some noticiable ones
are,

size
7648 shrink_page_list
3584 xfs_rmap_convert
3312 migrate_page_move_mapping
3312 dev_ethtool
3200 migrate_misplaced_transhuge_page
3168 copy_process

There are other 49 functions are over 2k in size while compiling kernel
with "-Wframe-larger-than=" even with a related minimal config on this
machine. Hence, it is too much work to change Makefiles for each object
to compile without "-fsanitize-address-use-after-scope" individually.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81715#c23

Although there is a patch in GCC 9 to help the situation, GCC 9 probably
won't be released in a few months and then it probably take another
6-month to 1-year for all major distros to include it as a default.
Hence, the stack usage with KASAN_EXTRA can be revisited again in 2020
when GCC 9 is everywhere. Until then, this patch will help users avoid
stack overrun.

This has already been fixed for arm64 for the same reason via
6e8830674ea (arm64: kasan: Increase stack size for KASAN_EXTRA).

Signed-off-by: Qian Cai 
---
 arch/x86/include/asm/page_64_types.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 8f657286d599..0ce558a8150d 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -7,7 +7,11 @@
 #endif
 
 #ifdef CONFIG_KASAN
+#ifdef CONFIG_KASAN_EXTRA
+#define KASAN_STACK_ORDER 2
+#else
 #define KASAN_STACK_ORDER 1
+#endif
 #else
 #define KASAN_STACK_ORDER 0
 #endif
-- 
2.17.2 (Apple Git-113)

Re: [PATCH] xfs: silence lockdep false positives when freezing

2019-01-09 Thread Qian Cai

On Thu, 2019-01-10 at 08:01 +1100, Dave Chinner wrote:
> On Wed, Jan 09, 2019 at 03:53:29PM -0500, Qian Cai wrote:
> > Easy to reproduce:
> > 
> > 1. run LTP oom02 workload to let kswapd acquire this locking order:
> >    fs_reclaim -> sb_internal.
> > 
> >  # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
> > [826b9172] >s_umount_key#27
> > [5fa8b2ac] sb_pagefaults
> > [33f1247e] sb_internal
> > [9e9a9664] fs_reclaim
> > 
> > 2. freeze XFS.
> >   # fsfreeze -f /home
> > 
> > Dave mentioned that this is due to a lockdep limitation - "IOWs, this is
> > a false positive, caused by the fact that xfs_trans_alloc() is called
> > from both above and below memory reclaim as well as within /every level/
> > of freeze processing. Lockdep is unable to describe the staged flush
> > logic in the freeze process that prevents deadlocks from occurring, and
> > hence we will pretty much always see false positives in the freeze
> > path". Hence, just temporarily disable lockdep in that path.
> 
> NACK. Turning off lockdep is not a solution, it just prevents
> lockdep from finding and reporting real issues.
> 

Well, it is a trade-off. It is turned on right after that path. All those false
positives leave unfixed are also going to render lockdep less useful.

[PATCH] xfs: silence lockdep false positives when freezing

2019-01-09 Thread Qian Cai

Easy to reproduce:

1. run LTP oom02 workload to let kswapd acquire this locking order:
   fs_reclaim -> sb_internal.

 # grep -i fs_reclaim -C 3 /proc/lockdep_chains | grep -C 5 sb_internal
[826b9172] >s_umount_key#27
[5fa8b2ac] sb_pagefaults
[33f1247e] sb_internal
[9e9a9664] fs_reclaim

2. freeze XFS.
  # fsfreeze -f /home

Dave mentioned that this is due to a lockdep limitation - "IOWs, this is
a false positive, caused by the fact that xfs_trans_alloc() is called
from both above and below memory reclaim as well as within /every level/
of freeze processing. Lockdep is unable to describe the staged flush
logic in the freeze process that prevents deadlocks from occurring, and
hence we will pretty much always see false positives in the freeze
path". Hence, just temporarily disable lockdep in that path.

==
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Tainted: GW
--
fsfreeze/4346 is trying to acquire lock:
26f1d784 (fs_reclaim){+.+.}, at: fs_reclaim_acquire.part.19+0x5/0x30

but task is already holding lock:
72bfc54b (sb_internal){}, at: percpu_down_write+0xb4/0x650

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (sb_internal){}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   __sb_start_write+0x17f/0x260
   xfs_trans_alloc+0x62b/0x9d0
   xfs_setfilesize_trans_alloc+0xd4/0x360
   xfs_submit_ioend+0x4af/0xa40
   xfs_vm_writepage+0x10f/0x180
   pageout.isra.2+0xbf2/0x26b0
   shrink_page_list+0x3e16/0xae70
   shrink_inactive_list+0x64f/0x1cd0
   shrink_node_memcg+0x80a/0x2490
   shrink_node+0x33d/0x13e0
   balance_pgdat+0xa8f/0x18b0
   kswapd+0x881/0x1120
   kthread+0x32c/0x3f0
   ret_from_fork+0x27/0x50

-> #0 (fs_reclaim){+.+.}:
   validate_chain.isra.14+0x11af/0x3b50
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   fs_reclaim_acquire.part.19+0x29/0x30
   fs_reclaim_acquire+0x19/0x20
   kmem_cache_alloc+0x3e/0x3f0
   kmem_zone_alloc+0x79/0x150
   xfs_trans_alloc+0xfa/0x9d0
   xfs_sync_sb+0x86/0x170
   xfs_log_sbcount+0x10f/0x140
   xfs_quiesce_attr+0x134/0x270
   xfs_fs_freeze+0x4a/0x70
   freeze_super+0x1af/0x290
   do_vfs_ioctl+0xedc/0x16c0
   ksys_ioctl+0x41/0x80
   __x64_sys_ioctl+0x73/0xa9
   do_syscall_64+0x18f/0xd23
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

 Possible unsafe locking scenario:

   CPU0CPU1
   
  lock(sb_internal);
   lock(fs_reclaim);
   lock(sb_internal);
  lock(fs_reclaim);

 *** DEADLOCK ***

4 locks held by fsfreeze/4346:
 #0: b478ef56 (sb_writers#8){}, at: percpu_down_write+0xb4/0x650
 #1: 1ec487a9 (>s_umount_key#28){}, at: 
freeze_super+0xda/0x290
 #2: 3edbd5a0 (sb_pagefaults){}, at: percpu_down_write+0xb4/0x650
 #3: 72bfc54b (sb_internal){}, at: percpu_down_write+0xb4/0x650

stack backtrace:
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 fs_reclaim_acquire.part.19+0x29/0x30
 fs_reclaim_acquire+0x19/0x20
 kmem_cache_alloc+0x3e/0x3f0
 kmem_zone_alloc+0x79/0x150
 xfs_trans_alloc+0xfa/0x9d0
 xfs_sync_sb+0x86/0x170
 xfs_log_sbcount+0x10f/0x140
 xfs_quiesce_attr+0x134/0x270
 xfs_fs_freeze+0x4a/0x70
 freeze_super+0x1af/0x290
 do_vfs_ioctl+0xedc/0x16c0
 ksys_ioctl+0x41/0x80
 __x64_sys_ioctl+0x73/0xa9
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Qian Cai 
---
 fs/xfs/libxfs/xfs_sb.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index b5a82acd7dfe..ec83cb8289fa 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -965,8 +965,11 @@ xfs_sync_sb(
struct xfs_trans*tp;
int error;
 
+   /* Silence lockdep false positives in the freeze path. */
+   lockdep_off();
error = xfs_trans_alloc(mp, _RES(mp)->tr_sb, 0, 0,
XFS_TRANS_NO_WRITECOUNT, );
+   lockdep_on();
if (error)
return error;
 
-- 
2.17.2 (Apple Git-113)

Re: lockdep warning while reading sysfs

2019-01-09 Thread Qian Cai



> You stripped out the stack trace at the bottom that shows the inversion
> :/
> 

Sorry, I thought it is the same as in #0, but here it is the whole thing.

WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Not tainted
--
read_all/2954 is trying to acquire lock:
c63ff499 (mem_hotplug_lock.rw_sem){}, at: 
show_slab_objects+0x16c/0x450

but task is already holding lock:
47ae17d7 (kn->count#70){}, at: kernfs_seq_start+0x79/0x170

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (kn->count#70){}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   __kernfs_remove+0x72f/0x9a0
   kernfs_remove_by_name_ns+0x45/0x90
   sysfs_remove_link+0x3c/0xa0
   sysfs_slab_add+0x1bd/0x330
   __kmem_cache_create+0x166/0x1c0
   create_cache+0xcf/0x1f0
   kmem_cache_create_usercopy+0x1aa/0x270
   kmem_cache_create+0x16/0x20
   mlx5_init_fs+0x195/0x1a10 [mlx5_core]
   mlx5_load_one+0x1106/0x1e90 [mlx5_core]
   init_one+0x864/0xd60 [mlx5_core]
   local_pci_probe+0xda/0x190
   work_for_cpu_fn+0x56/0xa0
   process_one_work+0xad7/0x1b80
   worker_thread+0x8ff/0x1370
   kthread+0x32c/0x3f0
   ret_from_fork+0x27/0x50

-> #2 (slab_mutex){+.+.}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   __mutex_lock+0x168/0x1730
   mutex_lock_nested+0x1b/0x20
   kmem_cache_create_usercopy+0x45/0x270
   kmem_cache_create+0x16/0x20
   ptlock_cache_init+0x24/0x2d
   start_kernel+0x40e/0x7e0
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0xef/0xf6
   secondary_startup_64+0xb6/0xc0

-> #1 (memcg_cache_ids_sem){}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   down_read+0x92/0x130
   memcg_get_cache_ids+0x15/0x20
   kmem_cache_create_usercopy+0x37/0x270
   kmem_cache_create+0x16/0x20
   ptlock_cache_init+0x24/0x2d
   start_kernel+0x40e/0x7e0
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0xef/0xf6
   secondary_startup_64+0xb6/0xc0

-> #0 (mem_hotplug_lock.rw_sem){}:
   validate_chain.isra.14+0x11af/0x3b50
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   get_online_mems+0x3d/0x80
   show_slab_objects+0x16c/0x450
   total_objects_show+0x13/0x20
   slab_attr_show+0x1e/0x30
   sysfs_kf_seq_show+0x1d5/0x470
   kernfs_seq_show+0x1fa/0x2c0
   seq_read+0x3f7/0x1050
   kernfs_fop_read+0x126/0x650
   __vfs_read+0xeb/0xf20
   vfs_read+0x103/0x290
   ksys_read+0xfa/0x260
   __x64_sys_read+0x73/0xb0
   do_syscall_64+0x18f/0xd23
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  lock(kn->count#70);
   lock(slab_mutex);
   lock(kn->count#70);
  lock(mem_hotplug_lock.rw_sem);

 *** DEADLOCK ***

3 locks held by read_all/2954:
 #0: e8745902 (>lock){+.+.}, at: seq_read+0x6b/0x1050
 #1: bb9fa87a (>mutex){+.+.}, at: kernfs_seq_start+0x4f/0x170
 #2: 47ae17d7 (kn->count#70){}, at: kernfs_seq_start+0x79/0x170

stack backtrace:
CPU: 100 PID: 2954 Comm: read_all Kdump: loaded Not tainted 5.0.0-rc1+ #60
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 
09/07/2018
Call Trace:
 dump_stack+0xe0/0x19a
 print_circular_bug.isra.10.cold.34+0x2f4/0x435
 check_prev_add.constprop.19+0xca1/0x15f0
 validate_chain.isra.14+0x11af/0x3b50
 __lock_acquire+0x728/0x1200
 lock_acquire+0x269/0x5a0
 get_online_mems+0x3d/0x80
 show_slab_objects+0x16c/0x450
 total_objects_show+0x13/0x20
 slab_attr_show+0x1e/0x30
 sysfs_kf_seq_show+0x1d5/0x470
 kernfs_seq_show+0x1fa/0x2c0
 seq_read+0x3f7/0x1050
 kernfs_fop_read+0x126/0x650
 __vfs_read+0xeb/0xf20
 vfs_read+0x103/0x290
 ksys_read+0xfa/0x260
 __x64_sys_read+0x73/0xb0
 do_syscall_64+0x18f/0xd23
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f01bb12
Code: 94 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b6 0f 1f 80 00 00 00 00 f3
0f 1e fa 8b 05 36 d9 20 00 85 c0 75 12 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 56 c3
0f 1f 44 00 00 41 54 49 89 d4 55 48 89
RSP: 002b:7ffd480fc058 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 7ffd480fc100 RCX: 7f01bb12
RDX: 03ff RSI: 7ffd480fc500 RDI: 0003
RBP: 7f01b040f000 R08: 0020 R09: 
R10:  R11: 0246 R12: 0003
R13: 7ffd480fc500 R14: 0028 R15: 0003

lockdep warning while reading sysfs

2019-01-08 Thread Qian Cai

LTP: starting read_all_sys (read_all -d /sys -q -r 10 -e 
/sys/power/wakeup_count)

Suppose this simply by reading files in /sys/kernel/slab/* would trigger this.
Basically, it acquired kn->count#69 in kernfs_seq_start():

mutex_lock(>mutex);
if (!kernfs_get_active(of->kn))

in kernfs_get_active():

if (kernfs_lockdep(kn))
rwsem_acquire_read(>dep_map, 0, 1, _RET_IP_);

Then, it will acquires mem_hotplug_lock.rw_sem in show_slab_objects() ->
get_online_mems()

Then, another CPU acquired mem_hotplug_lock.rw_sem, and then calls
secondary_startup() I guess it it from the CPU hotplug path to trigger a 
deadlock.

==
WARNING: possible circular locking dependency detected
5.0.0-rc1+ #60 Not tainted
--
read_all/7952 is trying to acquire lock:
19f12603 (mem_hotplug_lock.rw_sem){}, at: 
show_slab_objects+0x16c/0x450

but task is already holding lock:
8804717f (kn->count#69){}, at: kernfs_seq_start+0x79/0x170

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (kn->count#69){}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   __kernfs_remove+0x72f/0x9a0
   kernfs_remove_by_name_ns+0x45/0x90
   sysfs_remove_link+0x3c/0xa0
   sysfs_slab_add+0x1bd/0x330
   __kmem_cache_create+0x166/0x1c0
   create_cache+0xcf/0x1f0
   kmem_cache_create_usercopy+0x1aa/0x270
   kmem_cache_create+0x16/0x20
   mlx5_init_fs+0x195/0x1a10 [mlx5_core]
   mlx5_load_one+0x1106/0x1e90 [mlx5_core]
   init_one+0x864/0xd60 [mlx5_core]
   local_pci_probe+0xda/0x190
   work_for_cpu_fn+0x56/0xa0
   process_one_work+0xad7/0x1b80
   worker_thread+0x8ff/0x1370
   kthread+0x32c/0x3f0
   ret_from_fork+0x27/0x50

-> #2 (slab_mutex){+.+.}:
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   __mutex_lock+0x168/0x1730
   mutex_lock_nested+0x1b/0x20
   kmem_cache_create_usercopy+0x45/0x270
   kmem_cache_create+0x16/0x20
   ptlock_cache_init+0x24/0x2d
   start_kernel+0x40e/0x7e0
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0xef/0xf6
   secondary_startup_64+0xb6/0xc0

-> #1 (memcg_cache_ids_sem){}:
   ptlock_cache_init+0x24/0x2d
   start_kernel+0x40e/0x7e0
   x86_64_start_reservations+0x24/0x26
   x86_64_start_kernel+0xef/0xf6
   secondary_startup_64+0xb6/0xc0

-> #0 (mem_hotplug_lock.rw_sem){}:
   validate_chain.isra.14+0x11af/0x3b50
   __lock_acquire+0x728/0x1200
   lock_acquire+0x269/0x5a0
   get_online_mems+0x3d/0x80
   show_slab_objects+0x16c/0x450
   total_objects_show+0x13/0x20
   slab_attr_show+0x1e/0x30
   sysfs_kf_seq_show+0x1d5/0x470
   kernfs_seq_show+0x1fa/0x2c0
   seq_read+0x3f7/0x1050
   kernfs_fop_read+0x126/0x650
   __vfs_read+0xeb/0xf20
   vfs_read+0x103/0x290
   ksys_read+0xfa/0x260
   __x64_sys_read+0x73/0xb0
   do_syscall_64+0x18f/0xd23
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

other info that might help us debug this:

Chain exists of:
  mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#69

 Possible unsafe locking scenario:

   CPU0CPU1
   CPU0CPU1
   
  lock(kn->count#69);
   lock(slab_mutex);
   lock(kn->count#69);
  lock(mem_hotplug_lock.rw_sem);


3 locks held by read_all/7952:
 #0: 05c4ddec (>lock){+.+.}, at: seq_read+0x6b/0x1050
 #1: c2f2e854 (>mutex){+.+.}, at: kernfs_seq_start+0x4f/0x170
 #2: 8804717f (kn->count#69){}, at: kernfs_seq_start+0x79/0x170

Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

2019-01-08 Thread Qian Cai

On 1/8/19 5:02 PM, Andrew Morton wrote:
> 
> It's unclear (to me) where we stand with this patch.  Shold we proceed
> with v3 for now, or is something else planned?

I don't have anything else plan for this right now. Michal particular don't like
that 4-line ifdef which supposes to avoid an immediately regression (arguably
small) that existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected
that would start to miss tens of thousands early page allocation call sites. So,
feel free to choose v2 of this which has no ifdef if you agree with Michal too.
I am fine either way.

Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

2019-01-08 Thread Qian Cai

On Tue, 2019-01-08 at 09:20 +0100, Michal Hocko wrote:
> On Mon 07-01-19 20:53:08, Qian Cai wrote:
> > 
> > 
> > On 1/7/19 1:43 PM, Michal Hocko wrote:
> > > On Fri 04-01-19 15:18:08, Qian Cai wrote:
> > > [...]
> > > > Though, I can't see any really benefit of this approach apart from
> > > > "beautify"
> > > 
> > > This is not about beautifying! This is about making the code long term
> > > maintainable. As you can see it is just too easy to break it with the
> > > current scheme. And that is bad especially when the code is broken
> > > because of an optimization.
> > > 
> > 
> > Understood, but the code is now fixed. If there is something fundamentally
> > broken in the future, it may be a good time then to create a looks like
> > hundred-line cleanup patch for long-term maintenance at the same time to fix
> > real bugs.
> 
> Yeah, so revert = fix and redisign the thing to make the code more
> robust longterm + allow to catch more allocation. I really fail to see
> why this has to be repeated several times in this thread. Really.
> 

Again, this will introduce a immediately regression (arguably small) that
existing page_owner users with DEFERRED_STRUCT_PAGE_INIT deselected that would
start to miss tens of thousands early page allocation call sites.

I think the disagreement comes from that you want to deal with this passively
rather than proactively that you said "I am pretty sure we will hear about that
when that happens. And act accordingly", but I think it is better to fix it now
rather than later with a 4-line ifdef which you don't like.

I suppose someone else needs to make a judgment call for this as we are in a
"you can't convince me and I can't convince you" situation right now.

Re: [PATCH v2] kmemleak: survive in a low-memory situation

2019-01-07 Thread Qian Cai




On 1/7/19 9:06 PM, Qian Cai wrote:
> 
> 
> On 1/7/19 5:43 AM, Catalin Marinas wrote:
>> On Thu, Jan 03, 2019 at 06:07:35PM +0100, Michal Hocko wrote:
>>>>> On Wed 02-01-19 13:06:19, Qian Cai wrote:
>>>>> [...]
>>>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>>>>>> index f9d9dc250428..9e1aa3b7df75 100644
>>>>>> --- a/mm/kmemleak.c
>>>>>> +++ b/mm/kmemleak.c
>>>>>> @@ -576,6 +576,16 @@ static struct kmemleak_object 
>>>>>> *create_object(unsigned long ptr, size_t size,
>>>>>>  struct rb_node **link, *rb_parent;
>>>>>>  
>>>>>>  object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
>>>>>> +#ifdef CONFIG_PREEMPT_COUNT
>>>>>> +if (!object) {
>>>>>> +/* last-ditch effort in a low-memory situation */
>>>>>> +if (irqs_disabled() || is_idle_task(current) || 
>>>>>> in_atomic())
>>>>>> +gfp = GFP_ATOMIC;
>>>>>> +else
>>>>>> +gfp = gfp_kmemleak_mask(gfp) | 
>>>>>> __GFP_DIRECT_RECLAIM;
>>>>>> +object = kmem_cache_alloc(object_cache, gfp);
>>>>>> +}
>>>>>> +#endif
>> [...]
>>> I will not object to this workaround but I strongly believe that
>>> kmemleak should rethink the metadata allocation strategy to be really
>>> robust.
>>
>> This would be nice indeed and it was discussed last year. I just haven't
>> got around to trying anything yet:
>>
>> https://marc.info/?l=linux-mm=152812489819532
>>
> 
> It could be helpful to apply this 10-line patch first if has no fundamental
> issue, as it survives probably 50 times running LTP oom* workloads without a
> single kmemleak allocation failure.
> 
> Of course, if someone is going to embed kmemleak metadata into slab objects
> itself soon, this workaround is not needed.
> 

Well, it is really hard to tell even if someone get eventually redesign kmemleak
to embed the metadata into slab objects alone would survive LTP oom* workloads,
because it seems still use separate metadata for non-slab objects where kmemleak
allocation could fail like it right now and disable itself again.

Re: [PATCH v2] kmemleak: survive in a low-memory situation

2019-01-07 Thread Qian Cai




On 1/7/19 5:43 AM, Catalin Marinas wrote:
> On Thu, Jan 03, 2019 at 06:07:35PM +0100, Michal Hocko wrote:
>>>> On Wed 02-01-19 13:06:19, Qian Cai wrote:
>>>> [...]
>>>>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>>>>> index f9d9dc250428..9e1aa3b7df75 100644
>>>>> --- a/mm/kmemleak.c
>>>>> +++ b/mm/kmemleak.c
>>>>> @@ -576,6 +576,16 @@ static struct kmemleak_object 
>>>>> *create_object(unsigned long ptr, size_t size,
>>>>>   struct rb_node **link, *rb_parent;
>>>>>  
>>>>>   object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
>>>>> +#ifdef CONFIG_PREEMPT_COUNT
>>>>> + if (!object) {
>>>>> + /* last-ditch effort in a low-memory situation */
>>>>> + if (irqs_disabled() || is_idle_task(current) || in_atomic())
>>>>> + gfp = GFP_ATOMIC;
>>>>> + else
>>>>> + gfp = gfp_kmemleak_mask(gfp) | __GFP_DIRECT_RECLAIM;
>>>>> + object = kmem_cache_alloc(object_cache, gfp);
>>>>> + }
>>>>> +#endif
> [...]
>> I will not object to this workaround but I strongly believe that
>> kmemleak should rethink the metadata allocation strategy to be really
>> robust.
> 
> This would be nice indeed and it was discussed last year. I just haven't
> got around to trying anything yet:
> 
> https://marc.info/?l=linux-mm=152812489819532
> 

It could be helpful to apply this 10-line patch first if has no fundamental
issue, as it survives probably 50 times running LTP oom* workloads without a
single kmemleak allocation failure.

Of course, if someone is going to embed kmemleak metadata into slab objects
itself soon, this workaround is not needed.

Re: [PATCH v3] mm/page_owner: fix for deferred struct page init

2019-01-07 Thread Qian Cai

On 1/7/19 1:43 PM, Michal Hocko wrote:
> On Fri 04-01-19 15:18:08, Qian Cai wrote:
> [...]
>> Though, I can't see any really benefit of this approach apart from "beautify"
> 
> This is not about beautifying! This is about making the code long term
> maintainable. As you can see it is just too easy to break it with the
> current scheme. And that is bad especially when the code is broken
> because of an optimization.
> 

Understood, but the code is now fixed. If there is something fundamentally
broken in the future, it may be a good time then to create a looks like
hundred-line cleanup patch for long-term maintenance at the same time to fix
real bugs.

< 5 6 7 8 9 10 11 12 13 >

901 - 1000 of 1200 matches

Mail list logo