Re: [PATCH] time: Fix overwrite err unexpected in clock_adjtime32

2021-04-12 Thread chenjun (AM)
在 2021/4/12 23:58, Richard Cochran 写道:
> On Mon, Apr 12, 2021 at 02:52:11PM +0000, chenjun (AM) wrote:
>> 在 2021/4/12 22:20, Richard Cochran 写道:
>>> On Mon, Apr 12, 2021 at 12:45:51PM +, Chen Jun wrote:
>>>> the correct error is covered by put_old_timex32.
>>>
>>> Well, the non-negative return code (TIME_OK, TIME_INS, etc) is
>>> clobbered by put_old_timex32().
>>>
>>>> Fixes: f1f1d5ebd10f ("posix-timers: Introduce a syscall for clock tuning.")
>>>
>>> This is not the correct commit for the "Fixes" tag.  Please find the
>>> actual commit that introduced the issue.
>>>
>>> In commit f1f1d5ebd10f the code looked like this...
>>>
>>> long compat_sys_clock_adjtime(clockid_t which_clock,
>>> struct compat_timex __user *utp)
>>> {
>>> struct timex txc;
>>> mm_segment_t oldfs;
>>> int err, ret;
>>> 
>>> err = compat_get_timex(, utp);
>>> if (err)
>>> return err;
>>> 
>>> oldfs = get_fs();
>>> set_fs(KERNEL_DS);
>>> ret = sys_clock_adjtime(which_clock, (struct timex __user *) 
>>> );
>>> set_fs(oldfs);
>>> 
>>> err = compat_put_timex(utp, );
>>> if (err)
>>> return err;
>>> 
>>> return ret;
>>> }
> 
> Look at the code ^^^
> 
>> The implement of clock_adjtime32 is similar to compat_sys_clock_adjtime.
>> And I think f1f1d5ebd10 introduced the problem actually.
> 
> See how 'ret' and 'err' are two separate variables?  It makes a difference.
> 
> Thanks,
> Richard
> 

Oh, yee.. Very thanks.

3a4d44b616 ("ntp: Move adjtimex related compat syscalls to native 
counterparts") made the change.

-- 
Regards
Chen Jun


Re: [PATCH] time: Fix overwrite err unexpected in clock_adjtime32

2021-04-12 Thread chenjun (AM)
在 2021/4/12 22:20, Richard Cochran 写道:
> On Mon, Apr 12, 2021 at 12:45:51PM +, Chen Jun wrote:
>> the correct error is covered by put_old_timex32.
> 
> Well, the non-negative return code (TIME_OK, TIME_INS, etc) is
> clobbered by put_old_timex32().
> 
>> Fixes: f1f1d5ebd10f ("posix-timers: Introduce a syscall for clock tuning.")
> 
> This is not the correct commit for the "Fixes" tag.  Please find the
> actual commit that introduced the issue.
> 
> In commit f1f1d5ebd10f the code looked like this...
> 
>   long compat_sys_clock_adjtime(clockid_t which_clock,
>   struct compat_timex __user *utp)
>   {
>   struct timex txc;
>   mm_segment_t oldfs;
>   int err, ret;
>   
>   err = compat_get_timex(, utp);
>   if (err)
>   return err;
>   
>   oldfs = get_fs();
>   set_fs(KERNEL_DS);
>   ret = sys_clock_adjtime(which_clock, (struct timex __user *) 
> );
>   set_fs(oldfs);
>   
>   err = compat_put_timex(utp, );
>   if (err)
>   return err;
>   
>   return ret;
>   }
> 

f1f1d5ebd10:
Introduce compat_sys_clock_adjtime

62a6fa97684:
rename compat_sys_clock_adjtime to COMPAT_SYSCALL_DEFINE2(clock_adjtime

3a4d44b6162:
move COMPAT_SYSCALL_DEFINE2(clock_adjtime from kernel/compat.c to 
kernel/time/posix-timers.c

8dabe7245bb:
COMPAT_SYSCALL_DEFINE2(clock_adjtime, ..
-> SYSCALL_DEFINE2(clock_adjtime32, ..

The implement of clock_adjtime32 is similar to compat_sys_clock_adjtime. 
And I think f1f1d5ebd10 introduced the problem actually.

> Thanks,
> Richard
> 
> 
> 
>> Signed-off-by: Chen Jun 
>> ---
>>   kernel/time/posix-timers.c | 4 ++--
>>   1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
>> index bf540f5a..dd5697d 100644
>> --- a/kernel/time/posix-timers.c
>> +++ b/kernel/time/posix-timers.c
>> @@ -1191,8 +1191,8 @@ SYSCALL_DEFINE2(clock_adjtime32, clockid_t, 
>> which_clock,
>>   
>>  err = do_clock_adjtime(which_clock, );
>>   
>> -if (err >= 0)
>> -err = put_old_timex32(utp, );
>> +if (err >= 0 && put_old_timex32(utp, ))
>> +return -EFAULT;
>>   
>>  return err;
>>   }
>> -- 
>> 2.9.4
>>
> 


-- 
Regards
Chen Jun


Re: [question] insert ko failed because count_plts return 0 when CONFIG_RANDOMIZE_BASE is not set

2021-03-25 Thread chenjun (AM)
在 2021/3/24 16:29, Ard Biesheuvel 写道:
> On Wed, 24 Mar 2021 at 08:27, chenjun (AM)  wrote:
>>
>> Hi
>>
>> I make a Image for arm64 (without CONFIG_RANDOMIZE_BASE). And a ko (13M)
>> can not be inserted.
>>
> 
> How many large modules have you loaded already? The module region is
> only 128 MB, so if your modules are huge, you may run out of space.
> 
> Please check the kernel VA address and the load address of the module,
> and check whether they are more than 128 MB apart.
>

Thanks Ard

I will check it.

One more question, why is CONFIG_ARM64_MODULE_PLTS depended on 
CONFIG_RANDOMIZE_BASE?

> 
>> WARNING: CPU: 2 PID: 1998 at arch/arm64/kernel/module-plts.c:39
>> module_emit_plt_entry+0x100/0x118
>> ...
>> Call trace:
>> module_emit_plt_entry+0x100/0x118
>> apply_relocate_add+0x34c/0x570
>> ...
>>
>> I think the problem is that:
>> in apply_relocate_add:
>>case R_AARCH64_CALL26:
>>ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26,
>> AARCH64_INSN_IMM_26);
>>
>>if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
>>ovf == -ERANGE) {
>>val = module_emit_plt_entry(me, sechdrs,
>> loc, [i], sym); realoc_insn_imm return -ERANGE (because the ko is
>> too big?)
>>
>> in module_emit_plt_entry:
>> WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries)
>> pltsec->plt_max_entries is 0 if CONFIG_RANDOMIZE_BASE is not be set.
>>
>> a257e02 arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum
>> #843419
>> static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int
>> num,
>> -  Elf64_Word dstidx)
>> +  Elf64_Word dstidx, Elf_Shdr *dstsec)
>> {
>> ...
>>switch (ELF64_R_TYPE(rela[i].r_info)) {
>>case R_AARCH64_JUMP26:
>>case R_AARCH64_CALL26:
>> +   if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
>> +   break;
>> +
>>
>> Why we need break if !IS_ENABLED(CONFIG_RANDOMIZE_BASE)? or any
>> restrictions on ko?
>>
>> I comment out this part of the code. the ko could be inserted, and seems
>> to work well. So is it a accepted way? or any solution for my case?
>>
>> --
>> Regards
>> Chen Jun
>>
> 


-- 
Regards
Chen Jun


[question] insert ko failed because count_plts return 0 when CONFIG_RANDOMIZE_BASE is not set

2021-03-24 Thread chenjun (AM)
Hi

I make a Image for arm64 (without CONFIG_RANDOMIZE_BASE). And a ko (13M) 
can not be inserted.

WARNING: CPU: 2 PID: 1998 at arch/arm64/kernel/module-plts.c:39
module_emit_plt_entry+0x100/0x118
..
Call trace:
module_emit_plt_entry+0x100/0x118
apply_relocate_add+0x34c/0x570
..

I think the problem is that:
in apply_relocate_add:
  case R_AARCH64_CALL26:
  ovf = reloc_insn_imm(RELOC_OP_PREL, loc, val, 2, 26,
   AARCH64_INSN_IMM_26);

  if (IS_ENABLED(CONFIG_ARM64_MODULE_PLTS) &&
  ovf == -ERANGE) {
  val = module_emit_plt_entry(me, sechdrs, 
loc, [i], sym); realoc_insn_imm return -ERANGE (because the ko is 
too big?)

in module_emit_plt_entry:
WARN_ON(pltsec->plt_num_entries > pltsec->plt_max_entries)
pltsec->plt_max_entries is 0 if CONFIG_RANDOMIZE_BASE is not be set.

a257e02 arm64/kernel: don't ban ADRP to work around Cortex-A53 erratum
#843419
   static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int 
num,
-  Elf64_Word dstidx)
+  Elf64_Word dstidx, Elf_Shdr *dstsec)
   {
..
  switch (ELF64_R_TYPE(rela[i].r_info)) {
  case R_AARCH64_JUMP26:
  case R_AARCH64_CALL26:
+   if (!IS_ENABLED(CONFIG_RANDOMIZE_BASE))
+   break;
+

Why we need break if !IS_ENABLED(CONFIG_RANDOMIZE_BASE)? or any 
restrictions on ko?

I comment out this part of the code. the ko could be inserted, and seems 
to work well. So is it a accepted way? or any solution for my case?

--
Regards
Chen Jun



[RESEND PATCH] Kconfig: Move CONFIG_DEBUG_KMEMLEAK_TEST to samples/Kconfig

2021-03-18 Thread chenjun (AM)
From: Chen Jun 

commit 1abbef4f51724fb11f09adf0e75275f7cb422a8a
("mm,kmemleak-test.c: move kmemleak-test.c to samples dir")
make CONFIG_DEBUG_KMEMLEAK_TEST depend on CONFIG_SAMPLES implicitly.
And the dependency cannot be guaranteed by Kconfig.

move the definition of CONFIG_DEBUG_KMEMLEAK_TEST from lib/Kconfig.debug
to samples/Kconfig.

Signed-off-by: Chen Jun 
Acked-by: Catalin Marinas 
---

  lib/Kconfig.debug | 8 
  samples/Kconfig   | 8 
  2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 66d44d3..debacdc 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -678,14 +678,6 @@ config DEBUG_KMEMLEAK_MEM_POOL_SIZE
  fully initialised, this memory pool acts as an emergency one
  if slab allocations fail.

-config DEBUG_KMEMLEAK_TEST
-   tristate "Simple test for the kernel memory leak detector"
-   depends on DEBUG_KMEMLEAK && m
-   help
- This option enables a module that explicitly leaks memory.
-
- If unsure, say N.
-
  config DEBUG_KMEMLEAK_DEFAULT_OFF
bool "Default kmemleak to off"
depends on DEBUG_KMEMLEAK
diff --git a/samples/Kconfig b/samples/Kconfig
index 0ed6e4d..15978dd 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -216,4 +216,12 @@ config SAMPLE_WATCH_QUEUE
  Build example userspace program to use the new mount_notify(),
  sb_notify() syscalls and the KEYCTL_WATCH_KEY keyctl() function.

+config DEBUG_KMEMLEAK_TEST
+   tristate "Simple test for the kernel memory leak detector"
+   depends on DEBUG_KMEMLEAK && m
+   help
+ This option enables a module that explicitly leaks memory.
+
+ If unsure, say N.
+
  endif # SAMPLES
-- 
2.7.4



Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-18 Thread chenjun (AM)
在 2021/3/18 11:31, chenjun (AM) 写道:
> 在 2021/3/18 3:34, Mark Rutland 写道:
>> On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote:
>>> On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
>>>> On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
>>>> stack:
>>>>stack_trace_save+0x4c/0x78
>>>>register_early_stack+0x34/0x70
>>>>init_page_owner+0x34/0x230
>>>>page_ext_init+0x1bc/0x1dc
>>>>
>>>> The reason is that:
>>>> check_recursive_alloc always return 1 because that
>>>> entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
>>>>
>>>> The root cause is that:
>>>> commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
>>>> make the save_trace save 2 more entries.
>>>>
>>>> Add skip in arch_stack_walk when task == current.
>>>>
>>>> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
>>>> Signed-off-by: Chen Jun 
>>>> ---
>>>>arch/arm64/kernel/stacktrace.c | 5 +++--
>>>>1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/kernel/stacktrace.c 
>>>> b/arch/arm64/kernel/stacktrace.c
>>>> index ad20981..c26b0ac 100644
>>>> --- a/arch/arm64/kernel/stacktrace.c
>>>> +++ b/arch/arm64/kernel/stacktrace.c
>>>> @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
>>>> consume_entry, void *cookie,
>>>>
>>>>if (regs)
>>>>start_backtrace(, regs->regs[29], regs->pc);
>>>> -  else if (task == current)
>>>> +  else if (task == current) {
>>>> +  ((struct stacktrace_cookie *)cookie)->skip += 2;
>>>>start_backtrace(,
>>>>(unsigned 
>>>> long)__builtin_frame_address(0),
>>>>(unsigned long)arch_stack_walk);
>>>> -  else
>>>> +  } else
>>>>start_backtrace(, thread_saved_fp(task),
>>>>thread_saved_pc(task));
>>>
>>> I don't like abusing the cookie here. It's void * as it's meant to be an
>>> opaque type. I'd rather skip the first two frames in walk_stackframe()
>>> instead before invoking fn().
>>
>> I agree that we shouldn't touch cookie here.
>>
>> I don't think that it's right to bodge this inside walk_stackframe(),
>> since that'll add bogus skipping for the case starting with regs in the
>> current task. If we need a bodge, it has to live in arch_stack_walk()
>> where we set up the initial unwinding state.
>>
>> In another thread, we came to the conclusion that arch_stack_walk()
>> should start at its parent, and its parent should add any skipping it
>> requires.
>>
>> Currently, arch_stack_walk() is off-by-one, and we can bodge that by
>> using __builtin_frame_address(1), though I'm waiting for some compiler
>> folk to confirm that's sound. Otherwise we need to add an assembly
>> trampoline to snapshot the FP, which is unfortunastely convoluted.
>>
>> This report suggests that a caller of arch_stack_walk() is off-by-one
>> too, which suggests a larger cross-architecture semantic issue. I'll try
>> to take a look tomorrow.
>>
>> Thanks,
>> Mark.
>>
>>>
>>> Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two
>>> more entries in __save_stack_trace() if tsk == current. Something like
>>> below, completely untested:
>>>
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index ad20981dfda4..2a9f759aa41a 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame);
>>>void notrace walk_stackframe(struct task_struct *tsk, struct stackframe 
>>> *frame,
>>>  bool (*fn)(void *, unsigned long), void *data)
>>>{
>>> +   /* for the current task, we don't want this function nor its caller */
>>> +   int skip = tsk == current ? 2 : 0;
>>> +
>>> while (1) {
>>> int ret;
>>>
>>> -   if (!fn(data, frame->pc))
>>> +   if (skip)
>>> +   skip--;
>>> +   else if (!fn(data, frame->pc))
>>> break;
>>> ret = unwind_frame(tsk, frame);
>>> if (ret < 0)
>>>
>>>
>>> -- 
>>> Catalin
>>
> 
> This change will make kmemleak broken.
> Maybe the reason is what Mark pointed out. I will try to check out.
> 

I make a mistake. kmemleak seems to work good. I will do more tests.

-- 
Regards
Chen Jun


Re: [PATCH 2/2] arm64: stacktrace: Add skip when task == current

2021-03-17 Thread chenjun (AM)
在 2021/3/18 3:34, Mark Rutland 写道:
> On Wed, Mar 17, 2021 at 06:36:36PM +, Catalin Marinas wrote:
>> On Wed, Mar 17, 2021 at 02:20:50PM +, Chen Jun wrote:
>>> On ARM64, cat /sys/kernel/debug/page_owner, all pages return the same
>>> stack:
>>>   stack_trace_save+0x4c/0x78
>>>   register_early_stack+0x34/0x70
>>>   init_page_owner+0x34/0x230
>>>   page_ext_init+0x1bc/0x1dc
>>>
>>> The reason is that:
>>> check_recursive_alloc always return 1 because that
>>> entries[0] is always equal to ip (__set_page_owner+0x3c/0x60).
>>>
>>> The root cause is that:
>>> commit 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
>>> make the save_trace save 2 more entries.
>>>
>>> Add skip in arch_stack_walk when task == current.
>>>
>>> Fixes: 5fc57df2f6fd ("arm64: stacktrace: Convert to ARCH_STACKWALK")
>>> Signed-off-by: Chen Jun 
>>> ---
>>>   arch/arm64/kernel/stacktrace.c | 5 +++--
>>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>>> index ad20981..c26b0ac 100644
>>> --- a/arch/arm64/kernel/stacktrace.c
>>> +++ b/arch/arm64/kernel/stacktrace.c
>>> @@ -201,11 +201,12 @@ void arch_stack_walk(stack_trace_consume_fn 
>>> consume_entry, void *cookie,
>>>   
>>> if (regs)
>>> start_backtrace(, regs->regs[29], regs->pc);
>>> -   else if (task == current)
>>> +   else if (task == current) {
>>> +   ((struct stacktrace_cookie *)cookie)->skip += 2;
>>> start_backtrace(,
>>> (unsigned long)__builtin_frame_address(0),
>>> (unsigned long)arch_stack_walk);
>>> -   else
>>> +   } else
>>> start_backtrace(, thread_saved_fp(task),
>>> thread_saved_pc(task));
>>
>> I don't like abusing the cookie here. It's void * as it's meant to be an
>> opaque type. I'd rather skip the first two frames in walk_stackframe()
>> instead before invoking fn().
> 
> I agree that we shouldn't touch cookie here.
> 
> I don't think that it's right to bodge this inside walk_stackframe(),
> since that'll add bogus skipping for the case starting with regs in the
> current task. If we need a bodge, it has to live in arch_stack_walk()
> where we set up the initial unwinding state.
> 
> In another thread, we came to the conclusion that arch_stack_walk()
> should start at its parent, and its parent should add any skipping it
> requires.
> 
> Currently, arch_stack_walk() is off-by-one, and we can bodge that by
> using __builtin_frame_address(1), though I'm waiting for some compiler
> folk to confirm that's sound. Otherwise we need to add an assembly
> trampoline to snapshot the FP, which is unfortunastely convoluted.
> 
> This report suggests that a caller of arch_stack_walk() is off-by-one
> too, which suggests a larger cross-architecture semantic issue. I'll try
> to take a look tomorrow.
> 
> Thanks,
> Mark.
> 
>>
>> Prior to the conversion to ARCH_STACKWALK, we were indeed skipping two
>> more entries in __save_stack_trace() if tsk == current. Something like
>> below, completely untested:
>>
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index ad20981dfda4..2a9f759aa41a 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -115,10 +115,15 @@ NOKPROBE_SYMBOL(unwind_frame);
>>   void notrace walk_stackframe(struct task_struct *tsk, struct stackframe 
>> *frame,
>>   bool (*fn)(void *, unsigned long), void *data)
>>   {
>> +/* for the current task, we don't want this function nor its caller */
>> +int skip = tsk == current ? 2 : 0;
>> +
>>  while (1) {
>>  int ret;
>>   
>> -if (!fn(data, frame->pc))
>> +if (skip)
>> +skip--;
>> +else if (!fn(data, frame->pc))
>>  break;
>>  ret = unwind_frame(tsk, frame);
>>  if (ret < 0)
>>
>>
>> -- 
>> Catalin
> 

This change will make kmemleak broken.
Maybe the reason is what Mark pointed out. I will try to check out.

-- 
Regards
Chen Jun


Re: [question] Panic in dax_writeback_one

2021-03-17 Thread chenjun (AM)
Thanks for the advices. I will check out that.

在 2021/3/17 12:55, Dan Williams 写道:
> On Tue, Mar 16, 2021 at 8:00 PM chenjun (AM)  wrote:
>>
>> 在 2021/3/12 1:25, Dan Williams 写道:
>>> On Thu, Mar 11, 2021 at 4:20 AM Matthew Wilcox  wrote:
>>>>
>>>> On Thu, Mar 11, 2021 at 07:48:25AM +, chenjun (AM) wrote:
>>>>> static int dax_writeback_one(struct xa_state *xas, struct dax_device
>>>>> *dax_dev, struct address_space *mapping, void *entry)
>>>>> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
>>>>> The pfn is returned by the driver. In my case, the pfn does not have
>>>>> struct page. so pfn_to_page(pfn) return a wrong address.
>>>>
>>>> I wasn't involved, but I think the right solution here is simply to
>>>> replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn).  I don't
>>>> know why Dan decided to do this in the more complicated way.
>>>
>>> pfn_to_virt() only works for the direct-map. If pages are not mapped I
>>> don't see how pfn_to_virt() is expected to work.
>>>
>>> The real question Chenjun is why are you writing a new simulator of
>>> memory as a block-device vs reusing the pmem driver or brd?
>>>
>>
>> Hi Dan
>>
>> In my case, I do not want to take memory to create the struct page of
>> the memory my driver used.
> 
> There are efforts happening to drastically reduce that overhead. You
> might want to check out Joao's work [1]. I think that direction holds
> more promise than trying to extend FS_DAX_LIMITED.
> 
> [1]: http://lore.kernel.org/r/20201208172901.17384-1-joao.m.mart...@oracle.com
> 
>> And, I think this is also a problem for DCSSBLK.
> 
> If I understand correctly DAX replaced XIP for S390. There have not
> been reports about this problem, and I can only guess because XIP
> (eXecute-In-Place) is a read-only use case where dax_writeback_one()
> is never triggered, or S390 just isn't using DCSSBLK anymore. The last
> time I touched FS_DAX_LIMITED the DCSSBLK maintainers offered to just
> delete this driver to get it out of the way.
> 
>>
>> So I want to go back the older way if CONFIG_FS_DAX_LIMITED
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index b3d27fd..6395e84 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -867,6 +867,9 @@ static int dax_writeback_one(struct xa_state *xas,
>> struct dax_device *dax_dev,
>>{
>>  unsigned long pfn, index, count;
>>  long ret = 0;
>> +   void *kaddr;
>> +   pfn_t new_pfn_t;
>> +   pgoff_t pgoff;
>>
>>  /*
>>   * A page got tagged dirty in DAX mapping? Something is seriously
>> @@ -926,7 +929,25 @@ static int dax_writeback_one(struct xa_state *xas,
>> struct dax_device *dax_dev,
>>  index = xas->xa_index & ~(count - 1);
>>
>>  dax_entry_mkclean(mapping, index, pfn);
>> -   dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * 
>> PAGE_SIZE);
>> +
>> +   if (!IS_ENABLED(CONFIG_FS_DAX_LIMITED) || pfn_valid(pfn))
>> +   kaddr = page_address(pfn_to_page(pfn));
>> +   else {
>> +   ret = bdev_dax_pgoff(mapping->host->i_sb->s_bdev, pfn <<
>> PFN_SECTION_SHIFT, count << PAGE_SHIFT, );
> 
> This is broken:
> 
>  mapping->host->i_sb->s_bdev
> 
> ...there is no guarantee that the superblock associated with the
> mapping is hosted on the same block device associated with the passed
> in dax_device. See dax_rtdev in xfs_open_devices().
> 


-- 
Regards
Chen Jun


Re: [question] Panic in dax_writeback_one

2021-03-16 Thread chenjun (AM)
在 2021/3/12 1:25, Dan Williams 写道:
> On Thu, Mar 11, 2021 at 4:20 AM Matthew Wilcox  wrote:
>>
>> On Thu, Mar 11, 2021 at 07:48:25AM +0000, chenjun (AM) wrote:
>>> static int dax_writeback_one(struct xa_state *xas, struct dax_device
>>> *dax_dev, struct address_space *mapping, void *entry)
>>> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
>>> The pfn is returned by the driver. In my case, the pfn does not have
>>> struct page. so pfn_to_page(pfn) return a wrong address.
>>
>> I wasn't involved, but I think the right solution here is simply to
>> replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn).  I don't
>> know why Dan decided to do this in the more complicated way.
> 
> pfn_to_virt() only works for the direct-map. If pages are not mapped I
> don't see how pfn_to_virt() is expected to work.
> 
> The real question Chenjun is why are you writing a new simulator of
> memory as a block-device vs reusing the pmem driver or brd?
> 

Hi Dan

In my case, I do not want to take memory to create the struct page of 
the memory my driver used.

And, I think this is also a problem for DCSSBLK.

So I want to go back the older way if CONFIG_FS_DAX_LIMITED

diff --git a/fs/dax.c b/fs/dax.c
index b3d27fd..6395e84 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -867,6 +867,9 @@ static int dax_writeback_one(struct xa_state *xas, 
struct dax_device *dax_dev,
  {
unsigned long pfn, index, count;
long ret = 0;
+   void *kaddr;
+   pfn_t new_pfn_t;
+   pgoff_t pgoff;

/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -926,7 +929,25 @@ static int dax_writeback_one(struct xa_state *xas, 
struct dax_device *dax_dev,
index = xas->xa_index & ~(count - 1);

dax_entry_mkclean(mapping, index, pfn);
-   dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
+
+   if (!IS_ENABLED(CONFIG_FS_DAX_LIMITED) || pfn_valid(pfn))
+   kaddr = page_address(pfn_to_page(pfn));
+   else {
+   ret = bdev_dax_pgoff(mapping->host->i_sb->s_bdev, pfn << 
PFN_SECTION_SHIFT, count << PAGE_SHIFT, );
+   if (ret)
+   goto put_unlocked;
+
+   ret = dax_direct_access(dax_dev, pgoff, count, , 
_pfn_t);
+   if (ret < 0)
+   goto put_unlocked;
+
+   if (WARN_ON_ONCE(ret < count) || 
WARN_ON_ONCE(pfn_t_to_pfn(new_pfn_t) 
!= pfn)) {
+   ret = -EIO;
+   goto put_unlocked;
+   }
+   }
+
+   dax_flush(dax_dev, kaddr, count * PAGE_SIZE);
/*
 * After we have flushed the cache, we can clear the dirty tag. There
 * cannot be new dirty data in the pfn after the flush has completed as
-- 

-- 
Regards
Chen Jun


Re: [question] Panic in dax_writeback_one

2021-03-11 Thread chenjun (AM)
在 2021/3/11 20:20, Matthew Wilcox 写道:
> On Thu, Mar 11, 2021 at 07:48:25AM +0000, chenjun (AM) wrote:
>> static int dax_writeback_one(struct xa_state *xas, struct dax_device
>> *dax_dev, struct address_space *mapping, void *entry)
>> dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
>> The pfn is returned by the driver. In my case, the pfn does not have
>> struct page. so pfn_to_page(pfn) return a wrong address.
> 
> I wasn't involved, but I think the right solution here is simply to
> replace page_address(pfn_to_page(pfn)) with pfn_to_virt(pfn).  I don't
> know why Dan decided to do this in the more complicated way.
> 

Thanks Mattherw

I think pfn_to_virt could not work in my case. Because of the pfn is not 
in memory block.

-- 
Regards
Chen Jun


[question] Panic in dax_writeback_one

2021-03-10 Thread chenjun (AM)
Hi

I write a driver to simulate memory as a block device (like a ramdisk). 
and I hope the memory used would not be observed by kernel, becasue of 
the struct page will take many memory.

When I mount ext2 with dax on my ramdisk. Panic will happen when fsync.
Call trace:
  dax_writeback_one+0x330/0x3e4
  dax_writeback_mapping_range+0x15c/0x318
  ext2_dax_writepages+0x38/0x44


static int dax_writeback_one(struct xa_state *xas, struct dax_device 
*dax_dev, struct address_space *mapping, void *entry)
dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE);
The pfn is returned by the driver. In my case, the pfn does not have
struct page. so pfn_to_page(pfn) return a wrong address.

I noticed the following changes may related to my problem:
1. Before 3fe0791c295cfd3cd735de7a32cc0780949c009f (dax: store pfns in 
the radix), the radix tree stroes sectors. address which would be 
flushed is got from driver by passing sector.
And the commit assume that all pfn have struct page.

2. CONFIG_FS_DAX_LIMITED (Selected by DCSSBLK [=n] && BLK_DEV [=y] && 
S390 && BLOCK [=y]) is added to avoid access struct page of pfn.

Does anyone have any idea about my problem.

-- 
Regards
Chen Jun


[Reoprt] Some compile warning on ppc dts

2021-02-28 Thread chenjun (AM)
Hi

After run the following commands
make distclean
make allmodconfig ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu-
make oldconfig ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu-
make -j64 ARCH=powerpc CROSS_COMPILE=powerpc64-linux-gnu-

I get some warning:
arch/powerpc/boot/dts/mpc5200b.dtsi:267.20-280.4: Warning (pci_bridge): 
/pci@fd00: missing ranges for PCI bridg
e (or not a bridge)
arch/powerpc/boot/dts/o2dnt2.dtb: Warning (pci_device_bus_num): Failed 
prerequisite 'pci_bridge'
arch/powerpc/boot/dts/mpc5200b.dtsi:182.18-186.5: Warning 
(spi_bus_bridge): /soc5200@f000/psc@2000: node name f
or SPI buses should be 'spi'
   also defined at arch/powerpc/boot/dts/o2d.dtsi:32.12-43.5
arch/powerpc/boot/dts/o2dnt2.dtb: Warning (spi_bus_reg): Failed 
prerequisite 'spi_bus_bridge'
..

For the problem about "node name for SPI buses should be 'spi'":
Rename the psc@2000 to spi@2000 in arch/powerpc/boot/dts/o2d.dtsi can 
fix it.
diff --git a/arch/powerpc/boot/dts/o2d.dtsi b/arch/powerpc/boot/dts/o2d.dtsi
index 6661955a2be4..cd3dc70cd72e 100644
--- a/arch/powerpc/boot/dts/o2d.dtsi
+++ b/arch/powerpc/boot/dts/o2d.dtsi
@@ -29,7 +29,7 @@ rtc@800 {
  >-->--->---status = "disabled";
  >-->---};
-
->-->---psc@2000 {>->---// PSC1
+>-->---spi@2000 {>->---// PSC1
  >-->--->---compatible = 
"fsl,mpc5200b-psc-spi","fsl,mpc5200-psc-spi";
  >-->--->---#address-cells = <1>;
  >-->--->---#size-cells = <0>;
---

For the problem about "missing ranges for PCI bridge (or not a bridge)":
Ranges should be add in arch/powerpc/boot/dts/mpc5200b.dtsi.
 >---pci: pci@fd00 {
 >--->---#interrupt-cells = <1>;
 >--->---#size-cells = <2>;
 >--->---#address-cells = <3>;
 >--->---device_type = "pci";
 >--->---compatible = "fsl,mpc5200b-pci","fsl,mpc5200-pci";
 >--->---reg = <0xfd00 0x100>;
 >--->---// interrupt-map-mask = need to add
 >--->---// interrupt-map = need to add
 >--->---clock-frequency = <0>; // From boot loader
 >--->---interrupts = <2 8 0 2 9 0 2 10 0>;
 >--->---bus-range = <0 0>;
 >--->---// ranges = need to add
 >---};
I think the ranges should be add by someone who knows the mpc5200 better.

Can anyone fix this?

-- 
Regards
Chen Jun


Re: Re: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount

2021-02-23 Thread chenjun (AM)
在 2021/2/24 6:26, Steven Rostedt 写道:
> On Mon, 22 Feb 2021 15:50:19 +
> "chenjun (AM)"  wrote:
> 
>> Hi
>>
>> There is no problem when I use aarch64_be(gcc v10.2). Because gcc v10.2 use 
>> __patchable_function_entries
>> instead of __mcount. I am not sure when the change happened.
> 
> But you are still saying that this patch needs to be applied, right?
> 
> -- Steve
> .
> 

Yes,this patch needs to be applied.

-- 
Regards
Chen Jun


答复: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount

2021-02-22 Thread chenjun (AM)
Hi

There is no problem when I use aarch64_be(gcc v10.2). Because gcc v10.2 use 
__patchable_function_entries
instead of __mcount. I am not sure when the change happened.

Regards,
Chen Jun

-邮件原件-
发件人: Chen Jun [mailto:chenjun...@huawei.com] 
发送时间: 2021年2月22日 21:59
收件人: linux-kernel@vger.kernel.org
抄送: gregory.herr...@oracle.com; catalin.mari...@arm.com; rost...@goodmis.org; 
Xiangrui (Euler) 
主题: [PATCH] recordmcount: use w8 to read relp->r_info in arm64_is_fake_mcount

On little endian system, Use aarch64_be(gcc v7.3) downloaded from linaro.org to 
build image with CONFIG_CPU_BIG_ENDIAN = y, CONFIG_FTRACE = y, 
CONFIG_DYNAMIC_FTRACE = y.

gcc will create symbols of _mcount but recordmcount can not create mcount_loc 
for *.o.
aarch64_be-linux-gnu-objdump -r fs/namei.o | grep mcount
00d0 R_AARCH64_CALL26  _mcount ...
7190 R_AARCH64_CALL26  _mcount

The reason is than funciton arm64_is_fake_mcount can not work correctly.
A symbol of _mcount in *.o compiled with big endian compiler likes:
00 00 00 2d 00 00 01 1b
w(rp->r_info) will return 0x2d instead of 0x011b. Because w() takes uint32_t as 
parameter, which truncates rp->r_info.

Use w8() instead w() to read relp->r_info

Fixes: ea0eada45632 ("recordmcount: only record relocation of type 
R_AARCH64_CALL26 on arm64.")
Signed-off-by: Chen Jun 
---
 scripts/recordmcount.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/scripts/recordmcount.c b/scripts/recordmcount.c index 
b9c2ee7ab43f..cce12e1971d8 100644
--- a/scripts/recordmcount.c
+++ b/scripts/recordmcount.c
@@ -438,7 +438,7 @@ static int arm_is_fake_mcount(Elf32_Rel const *rp)
 
 static int arm64_is_fake_mcount(Elf64_Rel const *rp)  {
-   return ELF64_R_TYPE(w(rp->r_info)) != R_AARCH64_CALL26;
+   return ELF64_R_TYPE(w8(rp->r_info)) != R_AARCH64_CALL26;
 }
 
 /* 64-bit EM_MIPS has weird ELF64_Rela.r_info.
--
2.25.0



Re: [PATCH -next 3/5] mm/kmemleak: Add support for percpu memory leak detect

2020-09-28 Thread chenjun (AM)
Hi Catalin

Thanks for your opinions.

在 2020/9/22 17:58, Catalin Marinas 写道:
> On Mon, Sep 21, 2020 at 02:00:05AM +, Chen Jun wrote:
>> From: Wei Yongjun 
>>
>> Currently the reporting of the percpu chunks leaking problem
>> are not supported. This patch introduces this function.
>>
>> Since __percpu pointer is not pointing directly to the actual chunks,
>> this patch creates an object for __percpu pointer, but marks it as no
>> scan block, only check whether this pointer is referenced by other
>> blocks.
> 
> OK, so you wanted NO_SCAN to not touch the block at all, not even update
> the checksum. Maybe better add a new flag, NO_ACCESS (and we could use
> it to track ioremap leaks, it's been on my wishlist for years).
>

I will add a new OBJECT_NO_ACCESS.
The checksum of the object will not be updated and its memory block will 
not be scanned if the object marked with OBJECT_NO_ACCESS.

>> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
>> index c09c6b59eda6..feedb72f06f2 100644
>> --- a/mm/kmemleak.c
>> +++ b/mm/kmemleak.c
>> @@ -283,6 +288,9 @@ static void hex_dump_object(struct seq_file *seq,
>>  const u8 *ptr = (const u8 *)object->pointer;
>>  size_t len;
>>   
>> +if (object->flags & OBJECT_PERCPU)
>> +ptr = this_cpu_ptr((void __percpu *)object->pointer);
> 
> You may want to print the CPU number as well since the information is
> likely different on another CPU. Also, I think this context is
> preemptable, so it's better with a get_cpu/put_cpu().
> 

I will print cpu number when dump the percpu object.

>> @@ -651,6 +672,19 @@ static void create_object(unsigned long ptr, size_t 
>> size, int min_count,
>>  raw_spin_unlock_irqrestore(_lock, flags);
>>   }
>>   
>> +static void create_object(unsigned long ptr, size_t size, int min_count,
>> +  gfp_t gfp)
>> +{
>> +__create_object(ptr, size, min_count, 0, gfp);
>> +}
>> +
>> +static void create_object_percpu(unsigned long ptr, size_t size, int 
>> min_count,
>> + gfp_t gfp)
>> +{
>> +__create_object(ptr, size, min_count, OBJECT_PERCPU | OBJECT_NO_SCAN,
>> +gfp);
>> +}
>> +
>>   /*
>>* Mark the object as not allocated and schedule RCU freeing via 
>> put_object().
>>*/
>> @@ -912,10 +946,12 @@ void __ref kmemleak_alloc_percpu(const void __percpu 
>> *ptr, size_t size,
>>   * Percpu allocations are only scanned and not reported as leaks
>>   * (min_count is set to 0).
>>   */
>> -if (kmemleak_enabled && ptr && !IS_ERR(ptr))
>> +if (kmemleak_enabled && ptr && !IS_ERR(ptr)) {
>>  for_each_possible_cpu(cpu)
>>  create_object((unsigned long)per_cpu_ptr(ptr, cpu),
>>size, 0, gfp);
>> +create_object_percpu((unsigned long)ptr, size, 1, gfp);
>> +}
>>   }
> 
> A concern I have here is that ptr may overlap with an existing object
> and the insertion in the rb tree will fail. For example, with !SMP,
> ptr == per_cpu_ptr(ptr, 0), so create_object() will fail and kmemleak
> gets disabled.
> 
> An option would to figure out how to allow overlapping ranges with rb
> tree (or find a replacement for it if not possible).
> 
> Another option would be to have an additional structure to track the
> __percpu pointers since they have their own range. If size is not
> relevant, maybe go for an xarray, otherwise another rb tree (do we have
> any instance of pointers referring some inner member of a __percpu
> object?). The scan_object() function will have to search two trees.
> 

I would like to use CONFIG_SMP to seprate code:
if SMP, we will create some objects for per_cpu_ptr(ptr, cpu) and an 
object with OBJECT_NO_ACCESS for ptr.
if !SMP, we will not create object for per_cpu_ptr(ptr,cpu), but an 
object without OBJECT_NO_ACCESS for ptr will be created.
What do you think about this opinion.

Waiting for your reply

Best wishes
Jun