Re: [PATCH] time: Fix overwrite err unexpected in clock_adjtime32
在 2021/4/12 23:58, Richard Cochran 写道: > On Mon, Apr 12, 2021 at 02:52:11PM +0000, chenjun (AM) wrote: >> 在 2021/4/12 22:20, Richard Cochran 写道: >>> On Mon, Apr 12, 2021 at 12:45:51PM +, Chen Jun wrote: >>>> the correct error is covered by put_old_timex32. >>> >>> Well, the non-negative return code (TIME_OK, TIME_INS, etc) is >>> clobbered by put_old_timex32(). >>> >>>> Fixes: f1f1d5ebd10f ("posix-timers: Introduce a syscall for clock tuning.") >>> >>> This is not the correct commit for the "Fixes" tag. Please find the >>> actual commit that introduced the issue. >>> >>> In commit f1f1d5ebd10f the code looked like this... >>> >>> long compat_sys_clock_adjtime(clockid_t which_clock, >>> struct compat_timex __user *utp) >>> { >>> struct timex txc; >>> mm_segment_t oldfs; >>> int err, ret; >>> >>> err = compat_get_timex(, utp); >>> if (err) >>> return err; >>> >>> oldfs = get_fs(); >>> set_fs(KERNEL_DS); >>> ret = sys_clock_adjtime(which_clock, (struct timex __user *) >>> ); >>> set_fs(oldfs); >>> >>> err = compat_put_timex(utp, ); >>> if (err) >>> return err; >>> >>> return ret; >>> } > > Look at the code ^^^ > >> The implement of clock_adjtime32 is similar to compat_sys_clock_adjtime. >> And I think f1f1d5ebd10 introduced the problem actually. > > See how 'ret' and 'err' are two separate variables? It makes a difference. > > Thanks, > Richard > Oh, yee.. Very thanks. 3a4d44b616 ("ntp: Move adjtimex related compat syscalls to native counterparts") made the change. -- Regards Chen Jun
Re: [PATCH] time: Fix overwrite err unexpected in clock_adjtime32
在 2021/4/12 22:20, Richard Cochran 写道: > On Mon, Apr 12, 2021 at 12:45:51PM +, Chen Jun wrote: >> the correct error is covered by put_old_timex32. > > Well, the non-negative return code (TIME_OK, TIME_INS, etc) is > clobbered by put_old_timex32(). > >> Fixes: f1f1d5ebd10f ("posix-timers: Introduce a syscall for clock tuning.") > > This is not the correct commit for the "Fixes" tag. Please find the > actual commit that introduced the issue. > > In commit f1f1d5ebd10f the code looked like this... > > long compat_sys_clock_adjtime(clockid_t which_clock, > struct compat_timex __user *utp) > { > struct timex txc; > mm_segment_t oldfs; > int err, ret; > > err = compat_get_timex(, utp); > if (err) > return err; > > oldfs = get_fs(); > set_fs(KERNEL_DS); > ret = sys_clock_adjtime(which_clock, (struct timex __user *) > ); > set_fs(oldfs); > > err = compat_put_timex(utp, ); > if (err) > return err; > > return ret; > } > f1f1d5ebd10: Introduce compat_sys_clock_adjtime 62a6fa97684: rename compat_sys_clock_adjtime to COMPAT_SYSCALL_DEFINE2(clock_adjtime 3a4d44b6162: move COMPAT_SYSCALL_DEFINE2(clock_adjtime from kernel/compat.c to kernel/time/posix-timers.c 8dabe7245bb: COMPAT_SYSCALL_DEFINE2(clock_adjtime, .. -> SYSCALL_DEFINE2(clock_adjtime32, .. The implement of clock_adjtime32 is similar to compat_sys_clock_adjtime. And I think f1f1d5ebd10 introduced the problem actually. > Thanks, > Richard > > > >> Signed-off-by: Chen Jun >> --- >> kernel/time/posix-timers.c | 4 ++-- >> 1 file changed, 2 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c >> index bf540f5a..dd5697d 100644 >> --- a/kernel/time/posix-timers.c >> +++ b/kernel/time/posix-timers.c >> @@ -1191,8 +1191,8 @@ SYSCALL_DEFINE2(clock_adjtime32, clockid_t, >> which_clock, >> >> err = do_clock_adjtime(which_clock, ); >> >> -if (err >= 0) >> -err = put_old_timex32(utp, ); >> +if (err >= 0 && put_old_timex32(utp, )) >> +return -EFAULT; >> >> return err; >> } >> -- >> 2.9.4 >> > -- Regards Chen Jun
Re: [question] insert ko failed because count_plts return 0 when CONFIG_RANDOMIZE_BASE is not set
在 2021/3/24 16:29, Ard Biesheuvel 写道: > On Wed, 24 Mar 2021 at 08:27, chenjun (AM) wrote: >> >> Hi >> >> I make a Image for arm64 (without CONFIG_RANDOMIZE_BASE). And a ko (13M) >> can not be inserted. >> > > How many large modules have you loaded already? The module region is > only 128 MB, so if your modules are huge, you may run out of space. > > Please check the kernel VA address and the load address of the module, > and check whether they are more than 128 MB apart. > Thanks Ard I will check it. One more question, why is CONFIG_ARM64_MODULE_PLTS depended on CONFIG_RANDOMIZE_BASE? > >> WARNING: CPU: 2 PID: 1998 at arch/arm64/kernel/module-plts.c:39 >> module_emit_plt_entry+0x100/0x118 >> ... >> Call trace: >> module_emit_plt_entry+0x100/0x118 >> apply_relocate_add+0x34c/0x570 >> ... >> >> I think the problem is that: >> in apply_relocate_add: >>case R_AARCH64_CALL26: >>ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26, >> AARCH64_INSN_IMM_26); >> >>if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && >>ovf == -ERANGE) { >>val = module_emit_plt_entry(me, sechdrs, >> loc, [i], sym); realoc_insn_imm return -ERANGE (because the ko is >> too big?) >> >> in module_emit_plt_entry: >> WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries) >> pltsec->plt_max_entries is 0 if CONFIG_RANDOMIZE_BASE is not be set. >> >> a257e02 arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum >> #843419 >> static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int >> num, >> - Elf64_Word dstidx) >> + Elf64_Word dstidx, Elf_Shdr *dstsec) >> { >> ... >>switch (ELF64_R_TYPE(rela[i].r_info)) { >>case R_AARCH64_JUMP26: >>case R_AARCH64_CALL26: >> + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE)) >> + break; >> + >> >> Why we need break if !IS_ENABLED(CONFIG_RANDOMIZE_BASE)? or any >> restrictions on ko? >> >> I comment out this part of the code. the ko could be inserted, and seems >> to work well. So is it a accepted way? or any solution for my case? >> >> -- >> Regards >> Chen Jun >> > -- Regards Chen Jun
[question] insert ko failed because count_plts return 0 when CONFIG_RANDOMIZE_BASE is not set
Hi I make a Image for arm64 (without CONFIG_RANDOMIZE_BASE). And a ko (13M) can not be inserted. WARNING: CPU: 2 PID: 1998 at arch/arm64/kernel/module-plts.c:39 module_emit_plt_entry+0x100/0x118 .. Call trace: module_emit_plt_entry+0x100/0x118 apply_relocate_add+0x34c/0x570 .. I think the problem is that: in apply_relocate_add: case R_AARCH64_CALL26: ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26, AARCH64_INSN_IMM_26); if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) && ovf == -ERANGE) { val = module_emit_plt_entry(me, sechdrs, loc, [i], sym); realoc_insn_imm return -ERANGE (because the ko is too big?) in module_emit_plt_entry: WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries) pltsec->plt_max_entries is 0 if CONFIG_RANDOMIZE_BASE is not be set. a257e02 arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum #843419 static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num, - Elf64_Word dstidx) + Elf64_Word dstidx, Elf_Shdr *dstsec) { .. switch (ELF64_R_TYPE(rela[i].r_info)) { case R_AARCH64_JUMP26: case R_AARCH64_CALL26: + if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE)) + break; + Why we need break if !IS_ENABLED(CONFIG_RANDOMIZE_BASE)? or any restrictions on ko? I comment out this part of the code. the ko could be inserted, and seems to work well. So is it a accepted way? or any solution for my case? -- Regards Chen Jun
[RESEND PATCH] Kconfig: Move CONFIG_DEBUG_KMEMLEAK_TEST to samples/Kconfig
From: Chen Jun commit 1abbef4f51724fb11f09adf0e75275f7cb422a8a ("mm,kmemleak-test.c: move kmemleak-test.c to samples dir") make CONFIG_DEBUG_KMEMLEAK_TEST depend on CONFIG_SAMPLES implicitly. And the dependency cannot be guaranteed by Kconfig. move the definition of CONFIG_DEBUG_KMEMLEAK_TEST from lib/Kconfig.debug to samples/Kconfig. Signed-off-by: Chen Jun Acked-by: Catalin Marinas --- lib/Kconfig.debug | 8 samples/Kconfig | 8 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 66d44d3..debacdc 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -678,14 +678,6 @@ config DEBUG_KMEMLEAK_MEM_POOL_SIZE fully initialised, this memory pool acts as an emergency one if slab allocations fail. -config DEBUG_KMEMLEAK_TEST - tristate "Simple test for the kernel memory leak detector" - depends on DEBUG_KMEMLEAK && m - help - This option enables a module that explicitly leaks memory. - - If unsure, say N. - config DEBUG_KMEMLEAK_DEFAULT_OFF bool "Default kmemleak to off" depends on DEBUG_KMEMLEAK diff --git a/samples/Kconfig b/samples/Kconfig index 0ed6e4d..15978dd 100644 --- a/samples/Kconfig +++ b/samples/Kconfig @@ -216,4 +216,12 @@ config SAMPLE_WATCH_QUEUE Build example userspace program to use the new mount_notify(), sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function. +config DEBUG_KMEMLEAK_TEST + tristate "Simple test for the kernel memory leak detector" + depends on DEBUG_KMEMLEAK && m + help + This option enables a module that explicitly leaks memory. + + If unsure, say N. + endif # SAMPLES -- 2.7.4
Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current
在 2021/3/18 11:31, chenjun (AM) 写道: > 在 2021/3/18 3:34, Mark Rutland 写道: >> On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote: >>> On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote: >>>> On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same >>>> stack: >>>>stack_trace_save+0x4c/0x78 >>>>register_early_stack+0x34/0x70 >>>>init_page_owner+0x34/0x230 >>>>page_ext_init+0x1bc/0x1dc >>>> >>>> The reason is that: >>>> check_recursive_alloc always return 1 because that >>>> entries[0] is always equal to ip (__set_page_owner+0x3c/0x60). >>>> >>>> The root cause is that: >>>> commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK") >>>> make the save_trace save 2 more entries. >>>> >>>> Add skip in arch_stack_walk when task == current. >>>> >>>> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK") >>>> Signed-off-by: Chen Jun >>>> --- >>>>arch/arm64/kernel/stacktrace.c | 5 +++-- >>>>1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/arch/arm64/kernel/stacktrace.c >>>> b/arch/arm64/kernel/stacktrace.c >>>> index ad20981..c26b0ac 100644 >>>> --- a/arch/arm64/kernel/stacktrace.c >>>> +++ b/arch/arm64/kernel/stacktrace.c >>>> @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn >>>> consume_entry, void *cookie, >>>> >>>>if (regs) >>>>start_backtrace(, regs->regs[29], regs->pc); >>>> - else if (task == current) >>>> + else if (task == current) { >>>> + ((struct stacktrace_cookie *)cookie)->skip += 2; >>>>start_backtrace(, >>>>(unsigned >>>> long)__builtin_frame_address(0), >>>>(unsigned long)arch_stack_walk); >>>> - else >>>> + } else >>>>start_backtrace(, thread_saved_fp(task), >>>>thread_saved_pc(task)); >>> >>> I don't like abusing the cookie here. It's void * as it's meant to be an >>> opaque type. I'd rather skip the first two frames in walk_stackframe() >>> instead before invoking fn(). >> >> I agree that we shouldn't touch cookie here. >> >> I don't think that it's right to bodge this inside walk_stackframe(), >> since that'll add bogus skipping for the case starting with regs in the >> current task. If we need a bodge, it has to live in arch_stack_walk() >> where we set up the initial unwinding state. >> >> In another thread, we came to the conclusion that arch_stack_walk() >> should start at its parent, and its parent should add any skipping it >> requires. >> >> Currently, arch_stack_walk() is off-by-one, and we can bodge that by >> using __builtin_frame_address(1), though I'm waiting for some compiler >> folk to confirm that's sound. Otherwise we need to add an assembly >> trampoline to snapshot the FP, which is unfortunastely convoluted. >> >> This report suggests that a caller of arch_stack_walk() is off-by-one >> too, which suggests a larger cross-architecture semantic issue. I'll try >> to take a look tomorrow. >> >> Thanks, >> Mark. >> >>> >>> Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two >>> more entries in __save_stack_trace() if tsk == current. Something like >>> below, completely untested: >>> >>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c >>> index ad20981dfda4..2a9f759aa41a 100644 >>> --- a/arch/arm64/kernel/stacktrace.c >>> +++ b/arch/arm64/kernel/stacktrace.c >>> @@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame); >>>void notrace walk_stackframe(struct task_struct *tsk, struct stackframe >>> *frame, >>> bool (*fn)(void *, unsigned long), void *data) >>>{ >>> + /* for the current task, we don't want this function nor its caller */ >>> + int skip = tsk == current ? 2 : 0; >>> + >>> while (1) { >>> int ret; >>> >>> - if (!fn(data, frame->pc)) >>> + if (skip) >>> + skip--; >>> + else if (!fn(data, frame->pc)) >>> break; >>> ret = unwind_frame(tsk, frame); >>> if (ret < 0) >>> >>> >>> -- >>> Catalin >> > > This change will make kmemleak broken. > Maybe the reason is what Mark pointed out. I will try to check out. > I make a mistake. kmemleak seems to work good. I will do more tests. -- Regards Chen Jun
Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current
在 2021/3/18 3:34, Mark Rutland 写道: > On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote: >> On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote: >>> On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same >>> stack: >>> stack_trace_save+0x4c/0x78 >>> register_early_stack+0x34/0x70 >>> init_page_owner+0x34/0x230 >>> page_ext_init+0x1bc/0x1dc >>> >>> The reason is that: >>> check_recursive_alloc always return 1 because that >>> entries[0] is always equal to ip (__set_page_owner+0x3c/0x60). >>> >>> The root cause is that: >>> commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK") >>> make the save_trace save 2 more entries. >>> >>> Add skip in arch_stack_walk when task == current. >>> >>> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK") >>> Signed-off-by: Chen Jun >>> --- >>> arch/arm64/kernel/stacktrace.c | 5 +++-- >>> 1 file changed, 3 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c >>> index ad20981..c26b0ac 100644 >>> --- a/arch/arm64/kernel/stacktrace.c >>> +++ b/arch/arm64/kernel/stacktrace.c >>> @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn >>> consume_entry, void *cookie, >>> >>> if (regs) >>> start_backtrace(, regs->regs[29], regs->pc); >>> - else if (task == current) >>> + else if (task == current) { >>> + ((struct stacktrace_cookie *)cookie)->skip += 2; >>> start_backtrace(, >>> (unsigned long)__builtin_frame_address(0), >>> (unsigned long)arch_stack_walk); >>> - else >>> + } else >>> start_backtrace(, thread_saved_fp(task), >>> thread_saved_pc(task)); >> >> I don't like abusing the cookie here. It's void * as it's meant to be an >> opaque type. I'd rather skip the first two frames in walk_stackframe() >> instead before invoking fn(). > > I agree that we shouldn't touch cookie here. > > I don't think that it's right to bodge this inside walk_stackframe(), > since that'll add bogus skipping for the case starting with regs in the > current task. If we need a bodge, it has to live in arch_stack_walk() > where we set up the initial unwinding state. > > In another thread, we came to the conclusion that arch_stack_walk() > should start at its parent, and its parent should add any skipping it > requires. > > Currently, arch_stack_walk() is off-by-one, and we can bodge that by > using __builtin_frame_address(1), though I'm waiting for some compiler > folk to confirm that's sound. Otherwise we need to add an assembly > trampoline to snapshot the FP, which is unfortunastely convoluted. > > This report suggests that a caller of arch_stack_walk() is off-by-one > too, which suggests a larger cross-architecture semantic issue. I'll try > to take a look tomorrow. > > Thanks, > Mark. > >> >> Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two >> more entries in __save_stack_trace() if tsk == current. Something like >> below, completely untested: >> >> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c >> index ad20981dfda4..2a9f759aa41a 100644 >> --- a/arch/arm64/kernel/stacktrace.c >> +++ b/arch/arm64/kernel/stacktrace.c >> @@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame); >> void notrace walk_stackframe(struct task_struct *tsk, struct stackframe >> *frame, >> bool (*fn)(void *, unsigned long), void *data) >> { >> +/* for the current task, we don't want this function nor its caller */ >> +int skip = tsk == current ? 2 : 0; >> + >> while (1) { >> int ret; >> >> -if (!fn(data, frame->pc)) >> +if (skip) >> +skip--; >> +else if (!fn(data, frame->pc)) >> break; >> ret = unwind_frame(tsk, frame); >> if (ret < 0) >> >> >> -- >> Catalin > This change will make kmemleak broken. Maybe the reason is what Mark pointed out. I will try to check out. -- Regards Chen Jun
Re: [question] Panic in dax_writeback_one
Thanks for the advices. I will check out that. 在 2021/3/17 12:55, Dan Williams 写道: > On Tue, Mar 16, 2021 at 8:00 PM chenjun (AM) wrote: >> >> 在 2021/3/12 1:25, Dan Williams 写道: >>> On Thu, Mar 11, 2021 at 4:20 AM Matthew Wilcox wrote: >>>> >>>> On Thu, Mar 11, 2021 at 07:48:25AM +, chenjun (AM) wrote: >>>>> static int dax_writeback_one(struct xa_state *xas, struct dax_device >>>>> *dax_dev, struct address_space *mapping, void *entry) >>>>> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); >>>>> The pfn is returned by the driver. In my case, the pfn does not have >>>>> struct page. so pfn_to_page(pfn) return a wrong address. >>>> >>>> I wasn't involved, but I think the right solution here is simply to >>>> replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn). I don't >>>> know why Dan decided to do this in the more complicated way. >>> >>> pfn_to_virt() only works for the direct-map. If pages are not mapped I >>> don't see how pfn_to_virt() is expected to work. >>> >>> The real question Chenjun is why are you writing a new simulator of >>> memory as a block-device vs reusing the pmem driver or brd? >>> >> >> Hi Dan >> >> In my case, I do not want to take memory to create the struct page of >> the memory my driver used. > > There are efforts happening to drastically reduce that overhead. You > might want to check out Joao's work [1]. I think that direction holds > more promise than trying to extend FS_DAX_LIMITED. > > [1]: http://lore.kernel.org/r/20201208172901.17384-1-joao.m.mart...@oracle.com > >> And, I think this is also a problem for DCSSBLK. > > If I understand correctly DAX replaced XIP for S390. There have not > been reports about this problem, and I can only guess because XIP > (eXecute-In-Place) is a read-only use case where dax_writeback_one() > is never triggered, or S390 just isn't using DCSSBLK anymore. The last > time I touched FS_DAX_LIMITED the DCSSBLK maintainers offered to just > delete this driver to get it out of the way. > >> >> So I want to go back the older way if CONFIG_FS_DAX_LIMITED >> >> diff --git a/fs/dax.c b/fs/dax.c >> index b3d27fd..6395e84 100644 >> --- a/fs/dax.c >> +++ b/fs/dax.c >> @@ -867,6 +867,9 @@ static int dax_writeback_one(struct xa_state *xas, >> struct dax_device *dax_dev, >>{ >> unsigned long pfn, index, count; >> long ret = 0; >> + void *kaddr; >> + pfn_t new_pfn_t; >> + pgoff_t pgoff; >> >> /* >> * A page got tagged dirty in DAX mapping? Something is seriously >> @@ -926,7 +929,25 @@ static int dax_writeback_one(struct xa_state *xas, >> struct dax_device *dax_dev, >> index = xas->xa_index & ~(count - 1); >> >> dax_entry_mkclean(mapping, index, pfn); >> - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * >> PAGE_SIZE); >> + >> + if (!IS_ENABLED(CONFIG_FS_DAX_LIMITED) || pfn_valid(pfn)) >> + kaddr = page_address(pfn_to_page(pfn)); >> + else { >> + ret = bdev_dax_pgoff(mapping->host->i_sb->s_bdev, pfn << >> PFN_SECTION_SHIFT, count << PAGE_SHIFT, ); > > This is broken: > > mapping->host->i_sb->s_bdev > > ...there is no guarantee that the superblock associated with the > mapping is hosted on the same block device associated with the passed > in dax_device. See dax_rtdev in xfs_open_devices(). > -- Regards Chen Jun
Re: [question] Panic in dax_writeback_one
在 2021/3/12 1:25, Dan Williams 写道: > On Thu, Mar 11, 2021 at 4:20 AM Matthew Wilcox wrote: >> >> On Thu, Mar 11, 2021 at 07:48:25AM +0000, chenjun (AM) wrote: >>> static int dax_writeback_one(struct xa_state *xas, struct dax_device >>> *dax_dev, struct address_space *mapping, void *entry) >>> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); >>> The pfn is returned by the driver. In my case, the pfn does not have >>> struct page. so pfn_to_page(pfn) return a wrong address. >> >> I wasn't involved, but I think the right solution here is simply to >> replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn). I don't >> know why Dan decided to do this in the more complicated way. > > pfn_to_virt() only works for the direct-map. If pages are not mapped I > don't see how pfn_to_virt() is expected to work. > > The real question Chenjun is why are you writing a new simulator of > memory as a block-device vs reusing the pmem driver or brd? > Hi Dan In my case, I do not want to take memory to create the struct page of the memory my driver used. And, I think this is also a problem for DCSSBLK. So I want to go back the older way if CONFIG_FS_DAX_LIMITED diff --git a/fs/dax.c b/fs/dax.c index b3d27fd..6395e84 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -867,6 +867,9 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, { unsigned long pfn, index, count; long ret = 0; + void *kaddr; + pfn_t new_pfn_t; + pgoff_t pgoff; /* * A page got tagged dirty in DAX mapping? Something is seriously @@ -926,7 +929,25 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, index = xas->xa_index & ~(count - 1); dax_entry_mkclean(mapping, index, pfn); - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); + + if (!IS_ENABLED(CONFIG_FS_DAX_LIMITED) || pfn_valid(pfn)) + kaddr = page_address(pfn_to_page(pfn)); + else { + ret = bdev_dax_pgoff(mapping->host->i_sb->s_bdev, pfn << PFN_SECTION_SHIFT, count << PAGE_SHIFT, ); + if (ret) + goto put_unlocked; + + ret = dax_direct_access(dax_dev, pgoff, count, , _pfn_t); + if (ret < 0) + goto put_unlocked; + + if (WARN_ON_ONCE(ret < count) || WARN_ON_ONCE(pfn_t_to_pfn(new_pfn_t) != pfn)) { + ret = -EIO; + goto put_unlocked; + } + } + + dax_flush(dax_dev, kaddr, count * PAGE_SIZE); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as -- -- Regards Chen Jun
Re: [question] Panic in dax_writeback_one
在 2021/3/11 20:20, Matthew Wilcox 写道: > On Thu, Mar 11, 2021 at 07:48:25AM +0000, chenjun (AM) wrote: >> static int dax_writeback_one(struct xa_state *xas, struct dax_device >> *dax_dev, struct address_space *mapping, void *entry) >> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); >> The pfn is returned by the driver. In my case, the pfn does not have >> struct page. so pfn_to_page(pfn) return a wrong address. > > I wasn't involved, but I think the right solution here is simply to > replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn). I don't > know why Dan decided to do this in the more complicated way. > Thanks Mattherw I think pfn_to_virt could not work in my case. Because of the pfn is not in memory block. -- Regards Chen Jun
[question] Panic in dax_writeback_one
Hi I write a driver to simulate memory as a block device (like a ramdisk). and I hope the memory used would not be observed by kernel, becasue of the struct page will take many memory. When I mount ext2 with dax on my ramdisk. Panic will happen when fsync. Call trace: dax_writeback_one+0x330/0x3e4 dax_writeback_mapping_range+0x15c/0x318 ext2_dax_writepages+0x38/0x44 static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, struct address_space *mapping, void *entry) dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); The pfn is returned by the driver. In my case, the pfn does not have struct page. so pfn_to_page(pfn) return a wrong address. I noticed the following changes may related to my problem: 1. Before 3fe0791c295cfd3cd735de7a32cc0780949c009f (dax: store pfns in the radix), the radix tree stroes sectors. address which would be flushed is got from driver by passing sector. And the commit assume that all pfn have struct page. 2. CONFIG_FS_DAX_LIMITED (Selected by DCSSBLK [=n] && BLK_DEV [=y] && S390 && BLOCK [=y]) is added to avoid access struct page of pfn. Does anyone have any idea about my problem. -- Regards Chen Jun
[Reoprt] Some compile warning on ppc dts
Hi After run the following commands make distclean make allmodconfig ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu- make oldconfig ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu- make -j64 ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu- I get some warning: arch/powerpc/boot/dts/mpc5200b.dtsi:267.20-280.4: Warning (pci_bridge): /pci@fd00: missing ranges for PCI bridg e (or not a bridge) arch/powerpc/boot/dts/o2dnt2.dtb: Warning (pci_device_bus_num): Failed prerequisite 'pci_bridge' arch/powerpc/boot/dts/mpc5200b.dtsi:182.18-186.5: Warning (spi_bus_bridge): /soc5200@f000/psc@2000: node name f or SPI buses should be 'spi' also defined at arch/powerpc/boot/dts/o2d.dtsi:32.12-43.5 arch/powerpc/boot/dts/o2dnt2.dtb: Warning (spi_bus_reg): Failed prerequisite 'spi_bus_bridge' .. For the problem about "node name for SPI buses should be 'spi'": Rename the psc@2000 to spi@2000 in arch/powerpc/boot/dts/o2d.dtsi can fix it. diff --git a/arch/powerpc/boot/dts/o2d.dtsi b/arch/powerpc/boot/dts/o2d.dtsi index 6661955a2be4..cd3dc70cd72e 100644 --- a/arch/powerpc/boot/dts/o2d.dtsi +++ b/arch/powerpc/boot/dts/o2d.dtsi @@ -29,7 +29,7 @@ rtc@800 { >-->--->---status = "disabled"; >-->---}; - ->-->---psc@2000 {>->---// PSC1 +>-->---spi@2000 {>->---// PSC1 >-->--->---compatible = "fsl,mpc5200b-psc-spi","fsl,mpc5200-psc-spi"; >-->--->---#address-cells = <1>; >-->--->---#size-cells = <0>; --- For the problem about "missing ranges for PCI bridge (or not a bridge)": Ranges should be add in arch/powerpc/boot/dts/mpc5200b.dtsi. >---pci: pci@fd00 { >--->---#interrupt-cells = <1>; >--->---#size-cells = <2>; >--->---#address-cells = <3>; >--->---device_type = "pci"; >--->---compatible = "fsl,mpc5200b-pci","fsl,mpc5200-pci"; >--->---reg = <0xfd00 0x100>; >--->---// interrupt-map-mask = need to add >--->---// interrupt-map = need to add >--->---clock-frequency = <0>; // From boot loader >--->---interrupts = <2 8 0 2 9 0 2 10 0>; >--->---bus-range = <0 0>; >--->---// ranges = need to add >---}; I think the ranges should be add by someone who knows the mpc5200 better. Can anyone fix this? -- Regards Chen Jun
Re: Re: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount
在 2021/2/24 6:26, Steven Rostedt 写道: > On Mon, 22 Feb 2021 15:50:19 + > "chenjun (AM)" wrote: > >> Hi >> >> There is no problem when I use aarch64_be(gcc v10.2). Because gcc v10.2 use >> __patchable_function_entries >> instead of __mcount. I am not sure when the change happened. > > But you are still saying that this patch needs to be applied, right? > > -- Steve > . > Yes,this patch needs to be applied. -- Regards Chen Jun
答复: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount
Hi There is no problem when I use aarch64_be(gcc v10.2). Because gcc v10.2 use __patchable_function_entries instead of __mcount. I am not sure when the change happened. Regards, Chen Jun -邮件原件- 发件人: Chen Jun [mailto:chenjun...@huawei.com] 发送时间: 2021年2月22日 21:59 收件人: linux-kernel@vger.kernel.org 抄送: gregory.herr...@oracle.com; catalin.mari...@arm.com; rost...@goodmis.org; Xiangrui (Euler) 主题: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount On little endian system, Use aarch64_be(gcc v7.3) downloaded from linaro.org to build image with CONFIG_CPU_BIG_ENDIAN = y, CONFIG_FTRACE = y, CONFIG_DYNAMIC_FTRACE = y. gcc will create symbols of _mcount but recordmcount can not create mcount_loc for *.o. aarch64_be-linux-gnu-objdump -r fs/namei.o | grep mcount 00d0 R_AARCH64_CALL26 _mcount ... 7190 R_AARCH64_CALL26 _mcount The reason is than funciton arm64_is_fake_mcount can not work correctly. A symbol of _mcount in *.o compiled with big endian compiler likes: 00 00 00 2d 00 00 01 1b w(rp->r_info) will return 0x2d instead of 0x011b. Because w() takes uint32_t as parameter, which truncates rp->r_info. Use w8() instead w() to read relp->r_info Fixes: ea0eada45632 ("recordmcount: only record relocation of type R_AARCH64_CALL26 on arm64.") Signed-off-by: Chen Jun --- scripts/recordmcount.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c index b9c2ee7ab43f..cce12e1971d8 100644 --- a/scripts/recordmcount.c +++ b/scripts/recordmcount.c @@ -438,7 +438,7 @@ static int arm_is_fake_mcount(Elf32_Rel const *rp) static int arm64_is_fake_mcount(Elf64_Rel const *rp) { - return ELF64_R_TYPE(w(rp->r_info)) != R_AARCH64_CALL26; + return ELF64_R_TYPE(w8(rp->r_info)) != R_AARCH64_CALL26; } /* 64-bit EM_MIPS has weird ELF64_Rela.r_info. -- 2.25.0
Re: [PATCH -next 3/5] mm/kmemleak: Add support for percpu memory leak detect
Hi Catalin Thanks for your opinions. 在 2020/9/22 17:58, Catalin Marinas 写道: > On Mon, Sep 21, 2020 at 02:00:05AM +, Chen Jun wrote: >> From: Wei Yongjun >> >> Currently the reporting of the percpu chunks leaking problem >> are not supported. This patch introduces this function. >> >> Since __percpu pointer is not pointing directly to the actual chunks, >> this patch creates an object for __percpu pointer, but marks it as no >> scan block, only check whether this pointer is referenced by other >> blocks. > > OK, so you wanted NO_SCAN to not touch the block at all, not even update > the checksum. Maybe better add a new flag, NO_ACCESS (and we could use > it to track ioremap leaks, it's been on my wishlist for years). > I will add a new OBJECT_NO_ACCESS. The checksum of the object will not be updated and its memory block will not be scanned if the object marked with OBJECT_NO_ACCESS. >> diff --git a/mm/kmemleak.c b/mm/kmemleak.c >> index c09c6b59eda6..feedb72f06f2 100644 >> --- a/mm/kmemleak.c >> +++ b/mm/kmemleak.c >> @@ -283,6 +288,9 @@ static void hex_dump_object(struct seq_file *seq, >> const u8 *ptr = (const u8 *)object->pointer; >> size_t len; >> >> +if (object->flags & OBJECT_PERCPU) >> +ptr = this_cpu_ptr((void __percpu *)object->pointer); > > You may want to print the CPU number as well since the information is > likely different on another CPU. Also, I think this context is > preemptable, so it's better with a get_cpu/put_cpu(). > I will print cpu number when dump the percpu object. >> @@ -651,6 +672,19 @@ static void create_object(unsigned long ptr, size_t >> size, int min_count, >> raw_spin_unlock_irqrestore(_lock, flags); >> } >> >> +static void create_object(unsigned long ptr, size_t size, int min_count, >> + gfp_t gfp) >> +{ >> +__create_object(ptr, size, min_count, 0, gfp); >> +} >> + >> +static void create_object_percpu(unsigned long ptr, size_t size, int >> min_count, >> + gfp_t gfp) >> +{ >> +__create_object(ptr, size, min_count, OBJECT_PERCPU | OBJECT_NO_SCAN, >> +gfp); >> +} >> + >> /* >>* Mark the object as not allocated and schedule RCU freeing via >> put_object(). >>*/ >> @@ -912,10 +946,12 @@ void __ref kmemleak_alloc_percpu(const void __percpu >> *ptr, size_t size, >> * Percpu allocations are only scanned and not reported as leaks >> * (min_count is set to 0). >> */ >> -if (kmemleak_enabled && ptr && !IS_ERR(ptr)) >> +if (kmemleak_enabled && ptr && !IS_ERR(ptr)) { >> for_each_possible_cpu(cpu) >> create_object((unsigned long)per_cpu_ptr(ptr, cpu), >>size, 0, gfp); >> +create_object_percpu((unsigned long)ptr, size, 1, gfp); >> +} >> } > > A concern I have here is that ptr may overlap with an existing object > and the insertion in the rb tree will fail. For example, with !SMP, > ptr == per_cpu_ptr(ptr, 0), so create_object() will fail and kmemleak > gets disabled. > > An option would to figure out how to allow overlapping ranges with rb > tree (or find a replacement for it if not possible). > > Another option would be to have an additional structure to track the > __percpu pointers since they have their own range. If size is not > relevant, maybe go for an xarray, otherwise another rb tree (do we have > any instance of pointers referring some inner member of a __percpu > object?). The scan_object() function will have to search two trees. > I would like to use CONFIG_SMP to seprate code: if SMP, we will create some objects for per_cpu_ptr(ptr, cpu) and an object with OBJECT_NO_ACCESS for ptr. if !SMP, we will not create object for per_cpu_ptr(ptr,cpu), but an object without OBJECT_NO_ACCESS for ptr will be created. What do you think about this opinion. Waiting for your reply Best wishes Jun