Re: [PATCH] tracing/osnoise: init osnoise_instances list before registering tracer

2024-03-14 Thread Jerome Marchand

On 14/03/2024 11:49, Jerome Marchand wrote:

The kernel panic during the initialization of the osnoise tracer when
booting with "ftrace=osnoise" or "ftrace=timerlat" option.


BTW, while this fixes this issue for osnoise and timerlat, another issue,
remains with timerlat which prevent to boot with "ftrace=timerlat" option.
It apparently related to hrtimers:

[1.489168] Starting tracer 'timerlat'
[1.506835] BUG: kernel NULL pointer dereference, address: 
[1.507355] Loading compiled-in X.509 certificates
[1.507652] #PF: supervisor write access in kernel mode
[1.509639] #PF: error_code(0x0002) - not-present page
[1.510425] PGD 0 P4D 0
[1.510888] Oops: 0002 [#1] PREEMPT SMP NOPTI
[1.511582] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
6.8.0-63.osnoiseinit.fc41.x86_64 #1
[1.512819] Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[1.514175] RIP: 0010:rb_erase+0x18b/0x3a0
[1.514871] Code: 48 83 c0 01 48 89 01 e9 b3 30 03 00 e9 ae 30 03 00 48 89 46 10 
e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 
01 48 89 06 48 89 31 48 83 f8 03 0f 86 96 00 00 00
[1.517486] RSP: 0018:99403e08 EFLAGS: 00010046
[1.521008] RAX: 95e06fa2de49 RBX: 95e06fa25050 RCX: 95e06fa2de48
[1.522067] RDX:  RSI: 95e06fa2de48 RDI: 95e06fa25050
[1.523111] RBP: 95e06fa24b20 R08: 95e06fa24b20 R09: 
[1.524151] R10:  R11: 99464880 R12: 0001
[1.525195] R13: 95e06fa24ac0 R14: 95e06fa24b00 R15: 00024ac0
[1.526230] FS:  () GS:95e06fa0() 
knlGS:
[1.527433] CS:  0010 DS:  ES:  CR0: 80050033
[1.528295] CR2:  CR3: 00014a422000 CR4: 00350ef0
[1.529331] Call Trace:
[1.529783]  
[1.530192]  ? __die+0x23/0x70
[1.530736]  ? page_fault_oops+0x174/0x530
[1.531394]  ? srso_return_thunk+0x5/0x5f
[1.532040]  ? srso_return_thunk+0x5/0x5f
[1.532681]  ? ttwu_queue_wakelist+0xf1/0x110
[1.533377]  ? exc_page_fault+0x7f/0x180
[1.534007]  ? asm_exc_page_fault+0x26/0x30
[1.534677]  ? rb_erase+0x18b/0x3a0
[1.535251]  ? __schedule+0x3e9/0x1530
[1.535863]  timerqueue_del+0x2e/0x50
[1.536457]  __remove_hrtimer+0x39/0x90
[1.537095]  hrtimer_start_range_ns+0x2a6/0x350
[1.537918]  tick_nohz_idle_stop_tick+0x69/0xd0
[1.538646]  do_idle+0x221/0x270
[1.539195]  cpu_startup_entry+0x29/0x30
[1.539829]  rest_init+0xcf/0xd0
[1.540374]  arch_call_rest_init+0xe/0x30
[1.541031]  start_kernel+0x717/0xa70
[1.541623]  x86_64_start_reservations+0x18/0x30
[1.542352]  x86_64_start_kernel+0x94/0xa0
[1.543007]  secondary_startup_64_no_verify+0x184/0x18b
[1.543813]  
[1.544217] Modules linked in:
[1.544751] CR2: 
[1.545300] ---[ end trace  ]---
[1.546015] RIP: 0010:rb_erase+0x18b/0x3a0
[1.546674] Code: 48 83 c0 01 48 89 01 e9 b3 30 03 00 e9 ae 30 03 00 48 89 46 10 
e9 17 ff ff ff 48 8b 56 10 48 8d 41 01 48 89 51 08 48 89 4e 10 <48> 89 02 48 8b 
01 48 89 06 48 89 31 48 83 f8 03 0f 86 96 00 00 00
[1.549271] RSP: 0018:99403e08 EFLAGS: 00010046
[1.550063] RAX: 95e06fa2de49 RBX: 95e06fa25050 RCX: 95e06fa2de48
[1.551091] RDX:  RSI: 95e06fa2de48 RDI: 95e06fa25050
[1.552125] RBP: 95e06fa24b20 R08: 95e06fa24b20 R09: 
[1.553157] R10:  R11: 99464880 R12: 0001
[1.554191] R13: 95e06fa24ac0 R14: 95e06fa24b00 R15: 00024ac0
[1.555226] FS:  () GS:95e06fa0() 
knlGS:
[1.556433] CS:  0010 DS:  ES:  CR0: 80050033
[1.557291] CR2:  CR3: 00014a422000 CR4: 00350ef0
[1.558322] Kernel panic - not syncing: Attempted to kill the idle task!
[1.559602] Kernel Offset: 0x1600 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[1.561164] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---

Maybe a similar initialization issue that's easily fixed, or maybe the
timerlat option should be ignoreded at boot time, as mmiotrace is.

Regards,
Jerome Marchand



[1.505256] Starting tracer 'osnoise'
[1.510214] BUG: kernel NULL pointer dereference, address: 0010
[1.511189] #PF: supervisor read access in kernel mode
[1.511938] #PF: error_code(0x) - not-present page
[1.512676] PGD 0 P4D 0
[1.513105] Oops:  [#1] PREEMPT SMP NOPTI
[1.513763] CPU: 9 PID: 1 Comm: swapper/0 Not tainted 6.8.0-63.fc41.x86_64 #1
[1.515485] Hardware name: Red Hat KVM/RHEL-AV, BIOS 
1.13.0-2.module+el8.3.0+7353+9de0a3cc 04/01/2014
[1.516765

[PATCH v4] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-02-05 Thread Jerome Marchand
From: Pratyush Anand <pan...@redhat.com>

Change since v3:
 - Exit unwind_frame() if frame->graph != -1

Change since v2:
- Set frame.graph = current->curr_ret_stack in profile_pc
- Check that frame->graph != -1 in unwind_frame

do_task_stat() calls get_wchan(), which further does unwind_frame().
unwind_frame() restores frame->pc to original value in case function
graph tracer has modified a return address (LR) in a stack frame to hook
a function return. However, if function graph tracer has hit a filtered
function, then we can't unwind it as ftrace_push_return_trace() has
biased the index(frame->graph) with a 'huge negative'
offset(-FTRACE_NOTRACE_DEPTH).

Moreover, arm64 stack walker defines index(frame->graph) as unsigned
int, which can not compare a -ve number.

Similar problem we can have with calling of walk_stackframe() from
save_stack_trace_tsk() or dump_backtrace().

This patch fixes unwind_frame() to test the index for -ve value and
restore index accordingly before we can restore frame->pc.

Reproducer:

cd /sys/kernel/debug/tracing/
echo schedule > set_graph_notrace
echo 1 > options/display-graph
echo wakeup > current_tracer
ps -ef | grep -i agent

Above commands result in:
Unable to handle kernel paging request at virtual address 801bd3d1e000
pgd = 8003cbe97c00
[801bd3d1e000] *pgd=, *pud=
Internal error: Oops: 9606 [#1] SMP
[...]
CPU: 5 PID: 11696 Comm: ps Not tainted 4.11.0+ #33
[...]
task: 8003c21ba000 task.stack: 8003cc6c
PC is at unwind_frame+0x12c/0x180
LR is at get_wchan+0xd4/0x134
pc : [] lr : [] pstate: 6145
sp : 8003cc6c3ab0
x29: 8003cc6c3ab0 x28: 0001
x27: 0026 x26: 0026
x25: 12d8 x24: 
x23: 8003c1c04000 x22: 08c83000
x21: 8003c1c0 x20: 000f
x19: 8003c1bc x18: fc593690
x17:  x16: 0001
x15: b855670e2b60 x14: 0003e97f22cf1d0f
x13: 0001 x12: 
x11: e8f4883e x10: 000154f47ec8
x9 : 70f367c0 x8 : 
x7 : 8003f729 x6 : 0018
x5 :  x4 : 8003c1c03cb0
x3 : 8003c1c03ca0 x2 : 0017ffe8
x1 : 8003cc6c3af8 x0 : 8003d3e9e000

Process ps (pid: 11696, stack limit = 0x8003cc6c)
Stack: (0x8003cc6c3ab0 to 0x8003cc6c4000)
[...]
[] unwind_frame+0x12c/0x180
[] do_task_stat+0x864/0x870
[] proc_tgid_stat+0x3c/0x48
[] proc_single_show+0x5c/0xb8
[] seq_read+0x160/0x414
[] __vfs_read+0x58/0x164
[] vfs_read+0x88/0x144
[] SyS_read+0x60/0xc0
[] __sys_trace_return+0x0/0x4

fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's output under function 
graph tracer)
Signed-off-by: Pratyush Anand <pan...@redhat.com>
Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 arch/arm64/include/asm/stacktrace.h | 2 +-
 arch/arm64/kernel/stacktrace.c  | 7 +++
 arch/arm64/kernel/time.c| 2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 6ad30776e984..99390755c0c4 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -27,7 +27,7 @@ struct stackframe {
unsigned long fp;
unsigned long pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   unsigned int graph;
+   int graph;
 #endif
 };
 
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 76809ccd309c..2c74ff76b0cd 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -59,6 +59,13 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
+   if (frame->graph == -1) {
+   WARN_ON(1);
+   return -EINVAL;
+   }
+   if (frame->graph < -1)
+   frame->graph += FTRACE_NOTRACE_DEPTH;
+
/*
 * This is a case where function graph tracer has
 * modified a return address (LR) in a stack frame
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index a4391280fba9..f258636273c9 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -52,7 +52,7 @@ unsigned long profile_pc(struct pt_regs *regs)
frame.fp = regs->regs[29];
frame.pc = regs->pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   frame.graph = -1; /* no task info */
+   frame.graph = current->curr_ret_stack;
 #endif
do {
int ret = unwind_frame(NULL, );
-- 
2.13.6



[PATCH v4] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-02-05 Thread Jerome Marchand
From: Pratyush Anand 

Change since v3:
 - Exit unwind_frame() if frame->graph != -1

Change since v2:
- Set frame.graph = current->curr_ret_stack in profile_pc
- Check that frame->graph != -1 in unwind_frame

do_task_stat() calls get_wchan(), which further does unwind_frame().
unwind_frame() restores frame->pc to original value in case function
graph tracer has modified a return address (LR) in a stack frame to hook
a function return. However, if function graph tracer has hit a filtered
function, then we can't unwind it as ftrace_push_return_trace() has
biased the index(frame->graph) with a 'huge negative'
offset(-FTRACE_NOTRACE_DEPTH).

Moreover, arm64 stack walker defines index(frame->graph) as unsigned
int, which can not compare a -ve number.

Similar problem we can have with calling of walk_stackframe() from
save_stack_trace_tsk() or dump_backtrace().

This patch fixes unwind_frame() to test the index for -ve value and
restore index accordingly before we can restore frame->pc.

Reproducer:

cd /sys/kernel/debug/tracing/
echo schedule > set_graph_notrace
echo 1 > options/display-graph
echo wakeup > current_tracer
ps -ef | grep -i agent

Above commands result in:
Unable to handle kernel paging request at virtual address 801bd3d1e000
pgd = 8003cbe97c00
[801bd3d1e000] *pgd=, *pud=
Internal error: Oops: 9606 [#1] SMP
[...]
CPU: 5 PID: 11696 Comm: ps Not tainted 4.11.0+ #33
[...]
task: 8003c21ba000 task.stack: 8003cc6c
PC is at unwind_frame+0x12c/0x180
LR is at get_wchan+0xd4/0x134
pc : [] lr : [] pstate: 6145
sp : 8003cc6c3ab0
x29: 8003cc6c3ab0 x28: 0001
x27: 0026 x26: 0026
x25: 12d8 x24: 
x23: 8003c1c04000 x22: 08c83000
x21: 8003c1c0 x20: 000f
x19: 8003c1bc x18: fc593690
x17:  x16: 0001
x15: b855670e2b60 x14: 0003e97f22cf1d0f
x13: 0001 x12: 
x11: e8f4883e x10: 000154f47ec8
x9 : 70f367c0 x8 : 
x7 : 8003f729 x6 : 0018
x5 :  x4 : 8003c1c03cb0
x3 : 8003c1c03ca0 x2 : 0017ffe8
x1 : 8003cc6c3af8 x0 : 8003d3e9e000

Process ps (pid: 11696, stack limit = 0x8003cc6c)
Stack: (0x8003cc6c3ab0 to 0x8003cc6c4000)
[...]
[] unwind_frame+0x12c/0x180
[] do_task_stat+0x864/0x870
[] proc_tgid_stat+0x3c/0x48
[] proc_single_show+0x5c/0xb8
[] seq_read+0x160/0x414
[] __vfs_read+0x58/0x164
[] vfs_read+0x88/0x144
[] SyS_read+0x60/0xc0
[] __sys_trace_return+0x0/0x4

fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's output under function 
graph tracer)
Signed-off-by: Pratyush Anand 
Signed-off-by: Jerome Marchand 
---
 arch/arm64/include/asm/stacktrace.h | 2 +-
 arch/arm64/kernel/stacktrace.c  | 7 +++
 arch/arm64/kernel/time.c| 2 +-
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 6ad30776e984..99390755c0c4 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -27,7 +27,7 @@ struct stackframe {
unsigned long fp;
unsigned long pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   unsigned int graph;
+   int graph;
 #endif
 };
 
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 76809ccd309c..2c74ff76b0cd 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -59,6 +59,13 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
+   if (frame->graph == -1) {
+   WARN_ON(1);
+   return -EINVAL;
+   }
+   if (frame->graph < -1)
+   frame->graph += FTRACE_NOTRACE_DEPTH;
+
/*
 * This is a case where function graph tracer has
 * modified a return address (LR) in a stack frame
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index a4391280fba9..f258636273c9 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -52,7 +52,7 @@ unsigned long profile_pc(struct pt_regs *regs)
frame.fp = regs->regs[29];
frame.pc = regs->pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   frame.graph = -1; /* no task info */
+   frame.graph = current->curr_ret_stack;
 #endif
do {
int ret = unwind_frame(NULL, );
-- 
2.13.6



Re: [PATCH v3] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-01-17 Thread Jerome Marchand
On 16/01/18 18:57, Catalin Marinas wrote:
> On Fri, Jan 12, 2018 at 11:48:32AM +0100, Jerome Marchand wrote:
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 76809ccd309c..5a528c58ef68 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -59,6 +59,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
>> stackframe *frame)
>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>  if (tsk->ret_stack &&
>>  (frame->pc == (unsigned long)return_to_handler)) {
>> +WARN_ON(frame->graph == -1);
>> +if (frame->graph < -1)
>> +frame->graph += FTRACE_NOTRACE_DEPTH;
>> +
>>  /*
>>   * This is a case where function graph tracer has
>>   * modified a return address (LR) in a stack frame
> 
> So do we still allow this to continue if graph == -1? The following line
> doesn't seem safe:
> 
>   frame->pc = tsk->ret_stack[frame->graph--].ret;
> 

You're right. We probably should return a error (-EINVAL I guess) if
this happens. Note that this shouldn't happen here and if we're
confident enough that profile_pc() was the only faulty caller, we could
just drop the warning.

Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-01-17 Thread Jerome Marchand
On 16/01/18 18:57, Catalin Marinas wrote:
> On Fri, Jan 12, 2018 at 11:48:32AM +0100, Jerome Marchand wrote:
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 76809ccd309c..5a528c58ef68 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -59,6 +59,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
>> stackframe *frame)
>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>  if (tsk->ret_stack &&
>>  (frame->pc == (unsigned long)return_to_handler)) {
>> +WARN_ON(frame->graph == -1);
>> +if (frame->graph < -1)
>> +frame->graph += FTRACE_NOTRACE_DEPTH;
>> +
>>  /*
>>   * This is a case where function graph tracer has
>>   * modified a return address (LR) in a stack frame
> 
> So do we still allow this to continue if graph == -1? The following line
> doesn't seem safe:
> 
>   frame->pc = tsk->ret_stack[frame->graph--].ret;
> 

You're right. We probably should return a error (-EINVAL I guess) if
this happens. Note that this shouldn't happen here and if we're
confident enough that profile_pc() was the only faulty caller, we could
just drop the warning.

Jerome



signature.asc
Description: OpenPGP digital signature


[PATCH v3] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-01-12 Thread Jerome Marchand
From: Pratyush Anand <pan...@redhat.com>

Change since v2:
- Set frame.graph = current->curr_ret_stack in profile_pc
- Check that frame->graph != -1 in unwind_frame

do_task_stat() calls get_wchan(), which further does unwind_frame().
unwind_frame() restores frame->pc to original value in case function
graph tracer has modified a return address (LR) in a stack frame to hook
a function return. However, if function graph tracer has hit a filtered
function, then we can't unwind it as ftrace_push_return_trace() has
biased the index(frame->graph) with a 'huge negative'
offset(-FTRACE_NOTRACE_DEPTH).

Moreover, arm64 stack walker defines index(frame->graph) as unsigned
int, which can not compare a -ve number.

Similar problem we can have with calling of walk_stackframe() from
save_stack_trace_tsk() or dump_backtrace().

This patch fixes unwind_frame() to test the index for -ve value and
restore index accordingly before we can restore frame->pc.

Reproducer:

cd /sys/kernel/debug/tracing/
echo schedule > set_graph_notrace
echo 1 > options/display-graph
echo wakeup > current_tracer
ps -ef | grep -i agent

Above commands result in:
Unable to handle kernel paging request at virtual address 801bd3d1e000
pgd = 8003cbe97c00
[801bd3d1e000] *pgd=, *pud=
Internal error: Oops: 9606 [#1] SMP
[...]
CPU: 5 PID: 11696 Comm: ps Not tainted 4.11.0+ #33
[...]
task: 8003c21ba000 task.stack: 8003cc6c
PC is at unwind_frame+0x12c/0x180
LR is at get_wchan+0xd4/0x134
pc : [] lr : [] pstate: 6145
sp : 8003cc6c3ab0
x29: 8003cc6c3ab0 x28: 0001
x27: 0026 x26: 0026
x25: 12d8 x24: 
x23: 8003c1c04000 x22: 08c83000
x21: 8003c1c0 x20: 000f
x19: 8003c1bc x18: fc593690
x17:  x16: 0001
x15: b855670e2b60 x14: 0003e97f22cf1d0f
x13: 0001 x12: 
x11: e8f4883e x10: 000154f47ec8
x9 : 70f367c0 x8 : 
x7 : 8003f729 x6 : 0018
x5 :  x4 : 8003c1c03cb0
x3 : 8003c1c03ca0 x2 : 0017ffe8
x1 : 8003cc6c3af8 x0 : 8003d3e9e000

Process ps (pid: 11696, stack limit = 0x8003cc6c)
Stack: (0x8003cc6c3ab0 to 0x8003cc6c4000)
[...]
[] unwind_frame+0x12c/0x180
[] do_task_stat+0x864/0x870
[] proc_tgid_stat+0x3c/0x48
[] proc_single_show+0x5c/0xb8
[] seq_read+0x160/0x414
[] __vfs_read+0x58/0x164
[] vfs_read+0x88/0x144
[] SyS_read+0x60/0xc0
[] __sys_trace_return+0x0/0x4

fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's output under function 
graph tracer)
Signed-off-by: Pratyush Anand <pan...@redhat.com>
Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 arch/arm64/include/asm/stacktrace.h | 2 +-
 arch/arm64/kernel/stacktrace.c  | 4 
 arch/arm64/kernel/time.c| 2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 6ad30776e984..99390755c0c4 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -27,7 +27,7 @@ struct stackframe {
unsigned long fp;
unsigned long pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   unsigned int graph;
+   int graph;
 #endif
 };
 
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 76809ccd309c..5a528c58ef68 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -59,6 +59,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
+   WARN_ON(frame->graph == -1);
+   if (frame->graph < -1)
+   frame->graph += FTRACE_NOTRACE_DEPTH;
+
/*
 * This is a case where function graph tracer has
 * modified a return address (LR) in a stack frame
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index a4391280fba9..f258636273c9 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -52,7 +52,7 @@ unsigned long profile_pc(struct pt_regs *regs)
frame.fp = regs->regs[29];
frame.pc = regs->pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   frame.graph = -1; /* no task info */
+   frame.graph = current->curr_ret_stack;
 #endif
do {
int ret = unwind_frame(NULL, );
-- 
2.13.6



[PATCH v3] arm64: fix unwind_frame() for filtered out fn for function graph tracing

2018-01-12 Thread Jerome Marchand
From: Pratyush Anand 

Change since v2:
- Set frame.graph = current->curr_ret_stack in profile_pc
- Check that frame->graph != -1 in unwind_frame

do_task_stat() calls get_wchan(), which further does unwind_frame().
unwind_frame() restores frame->pc to original value in case function
graph tracer has modified a return address (LR) in a stack frame to hook
a function return. However, if function graph tracer has hit a filtered
function, then we can't unwind it as ftrace_push_return_trace() has
biased the index(frame->graph) with a 'huge negative'
offset(-FTRACE_NOTRACE_DEPTH).

Moreover, arm64 stack walker defines index(frame->graph) as unsigned
int, which can not compare a -ve number.

Similar problem we can have with calling of walk_stackframe() from
save_stack_trace_tsk() or dump_backtrace().

This patch fixes unwind_frame() to test the index for -ve value and
restore index accordingly before we can restore frame->pc.

Reproducer:

cd /sys/kernel/debug/tracing/
echo schedule > set_graph_notrace
echo 1 > options/display-graph
echo wakeup > current_tracer
ps -ef | grep -i agent

Above commands result in:
Unable to handle kernel paging request at virtual address 801bd3d1e000
pgd = 8003cbe97c00
[801bd3d1e000] *pgd=, *pud=
Internal error: Oops: 9606 [#1] SMP
[...]
CPU: 5 PID: 11696 Comm: ps Not tainted 4.11.0+ #33
[...]
task: 8003c21ba000 task.stack: 8003cc6c
PC is at unwind_frame+0x12c/0x180
LR is at get_wchan+0xd4/0x134
pc : [] lr : [] pstate: 6145
sp : 8003cc6c3ab0
x29: 8003cc6c3ab0 x28: 0001
x27: 0026 x26: 0026
x25: 12d8 x24: 
x23: 8003c1c04000 x22: 08c83000
x21: 8003c1c0 x20: 000f
x19: 8003c1bc x18: fc593690
x17:  x16: 0001
x15: b855670e2b60 x14: 0003e97f22cf1d0f
x13: 0001 x12: 
x11: e8f4883e x10: 000154f47ec8
x9 : 70f367c0 x8 : 
x7 : 8003f729 x6 : 0018
x5 :  x4 : 8003c1c03cb0
x3 : 8003c1c03ca0 x2 : 0017ffe8
x1 : 8003cc6c3af8 x0 : 8003d3e9e000

Process ps (pid: 11696, stack limit = 0x8003cc6c)
Stack: (0x8003cc6c3ab0 to 0x8003cc6c4000)
[...]
[] unwind_frame+0x12c/0x180
[] do_task_stat+0x864/0x870
[] proc_tgid_stat+0x3c/0x48
[] proc_single_show+0x5c/0xb8
[] seq_read+0x160/0x414
[] __vfs_read+0x58/0x164
[] vfs_read+0x88/0x144
[] SyS_read+0x60/0xc0
[] __sys_trace_return+0x0/0x4

fixes: 20380bb390a4 (arm64: ftrace: fix a stack tracer's output under function 
graph tracer)
Signed-off-by: Pratyush Anand 
Signed-off-by: Jerome Marchand 
---
 arch/arm64/include/asm/stacktrace.h | 2 +-
 arch/arm64/kernel/stacktrace.c  | 4 
 arch/arm64/kernel/time.c| 2 +-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/stacktrace.h 
b/arch/arm64/include/asm/stacktrace.h
index 6ad30776e984..99390755c0c4 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -27,7 +27,7 @@ struct stackframe {
unsigned long fp;
unsigned long pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   unsigned int graph;
+   int graph;
 #endif
 };
 
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 76809ccd309c..5a528c58ef68 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -59,6 +59,10 @@ int notrace unwind_frame(struct task_struct *tsk, struct 
stackframe *frame)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
if (tsk->ret_stack &&
(frame->pc == (unsigned long)return_to_handler)) {
+   WARN_ON(frame->graph == -1);
+   if (frame->graph < -1)
+   frame->graph += FTRACE_NOTRACE_DEPTH;
+
/*
 * This is a case where function graph tracer has
 * modified a return address (LR) in a stack frame
diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c
index a4391280fba9..f258636273c9 100644
--- a/arch/arm64/kernel/time.c
+++ b/arch/arm64/kernel/time.c
@@ -52,7 +52,7 @@ unsigned long profile_pc(struct pt_regs *regs)
frame.fp = regs->regs[29];
frame.pc = regs->pc;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   frame.graph = -1; /* no task info */
+   frame.graph = current->curr_ret_stack;
 #endif
do {
int ret = unwind_frame(NULL, );
-- 
2.13.6



[PATCH v2] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-30 Thread Jerome Marchand
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size
(ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
complicated.

This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 mm/zsmalloc.c | 52 +++-
 1 file changed, 15 insertions(+), 37 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..cc98937 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -116,6 +116,11 @@
 #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
 #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
 
+#define FULLNESS_BITS  2
+#define CLASS_BITS 8
+#define ISOLATED_BITS  3
+#define MAGIC_VAL_BITS 8
+
 #define MAX(a, b) ((a) >= (b) ? (a) : (b))
 /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
 #define ZS_MIN_ALLOC_SIZE \
@@ -137,6 +142,8 @@
  *  (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASSES(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
+ ZS_SIZE_CLASS_DELTA) + 1)
 
 enum fullness_group {
ZS_EMPTY,
@@ -169,11 +176,6 @@ static struct vfsmount *zsmalloc_mnt;
 #endif
 
 /*
- * number of size_classes
- */
-static int zs_size_classes;
-
-/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  * n <= N / f, where
  * n = number of allocated objects
@@ -244,7 +246,7 @@ struct link_free {
 struct zs_pool {
const char *name;
 
-   struct size_class **size_class;
+   struct size_class *size_class[ZS_SIZE_CLASSES];
struct kmem_cache *handle_cachep;
struct kmem_cache *zspage_cachep;
 
@@ -268,11 +270,6 @@ struct zs_pool {
 #endif
 };
 
-#define FULLNESS_BITS  2
-#define CLASS_BITS 8
-#define ISOLATED_BITS  3
-#define MAGIC_VAL_BITS 8
-
 struct zspage {
struct {
unsigned int fullness:FULLNESS_BITS;
@@ -551,7 +548,7 @@ static int get_size_class_index(int size)
idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
ZS_SIZE_CLASS_DELTA);
 
-   return min(zs_size_classes - 1, idx);
+   return min((int)ZS_SIZE_CLASSES - 1, idx);
 }
 
 static inline void zs_stat_inc(struct size_class *class,
@@ -610,7 +607,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
"obj_allocated", "obj_used", "pages_used",
"pages_per_zspage", "freeable");
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
 
if (class->index != i)
@@ -1294,17 +1291,6 @@ static int zs_cpu_dead(unsigned int cpu)
return 0;
 }
 
-static void __init init_zs_size_classes(void)
-{
-   int nr;
-
-   nr = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1;
-   if ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) % ZS_SIZE_CLASS_DELTA)
-   nr += 1;
-
-   zs_size_classes = nr;
-}
-
 static bool can_merge(struct size_class *prev, int pages_per_zspage,
int objs_per_zspage)
 {
@@ -2145,7 +2131,7 @@ static void async_free_zspage(struct work_struct *work)
struct zs_pool *pool = container_of(work, struct zs_pool,
free_work);
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
if (class->index != i)
continue;
@@ -2263,7 +2249,7 @@ unsigned long zs_compact(struct zs_pool *pool)
int i;
struct size_class *class;
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2309,7 +2295,7 @@ static unsigned long zs_shrinker_count(struct shrinker 
*shrinker,
struct zs_pool *pool = container_of(shrinker, struct zs_pool,
shrinker);
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2361,12 +2347,6 @@ struct zs_pool *zs_create_pool(const char *name)
return NULL;
 
init_deferred_free(pool);
-   pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
-   GFP_KERNEL);
-   if (!pool-&

[PATCH v2] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-30 Thread Jerome Marchand
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size
(ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
complicated.

This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().

Signed-off-by: Jerome Marchand 
---
 mm/zsmalloc.c | 52 +++-
 1 file changed, 15 insertions(+), 37 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..cc98937 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -116,6 +116,11 @@
 #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
 #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
 
+#define FULLNESS_BITS  2
+#define CLASS_BITS 8
+#define ISOLATED_BITS  3
+#define MAGIC_VAL_BITS 8
+
 #define MAX(a, b) ((a) >= (b) ? (a) : (b))
 /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
 #define ZS_MIN_ALLOC_SIZE \
@@ -137,6 +142,8 @@
  *  (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASSES(DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
+ ZS_SIZE_CLASS_DELTA) + 1)
 
 enum fullness_group {
ZS_EMPTY,
@@ -169,11 +176,6 @@ static struct vfsmount *zsmalloc_mnt;
 #endif
 
 /*
- * number of size_classes
- */
-static int zs_size_classes;
-
-/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  * n <= N / f, where
  * n = number of allocated objects
@@ -244,7 +246,7 @@ struct link_free {
 struct zs_pool {
const char *name;
 
-   struct size_class **size_class;
+   struct size_class *size_class[ZS_SIZE_CLASSES];
struct kmem_cache *handle_cachep;
struct kmem_cache *zspage_cachep;
 
@@ -268,11 +270,6 @@ struct zs_pool {
 #endif
 };
 
-#define FULLNESS_BITS  2
-#define CLASS_BITS 8
-#define ISOLATED_BITS  3
-#define MAGIC_VAL_BITS 8
-
 struct zspage {
struct {
unsigned int fullness:FULLNESS_BITS;
@@ -551,7 +548,7 @@ static int get_size_class_index(int size)
idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
ZS_SIZE_CLASS_DELTA);
 
-   return min(zs_size_classes - 1, idx);
+   return min((int)ZS_SIZE_CLASSES - 1, idx);
 }
 
 static inline void zs_stat_inc(struct size_class *class,
@@ -610,7 +607,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
"obj_allocated", "obj_used", "pages_used",
"pages_per_zspage", "freeable");
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
 
if (class->index != i)
@@ -1294,17 +1291,6 @@ static int zs_cpu_dead(unsigned int cpu)
return 0;
 }
 
-static void __init init_zs_size_classes(void)
-{
-   int nr;
-
-   nr = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1;
-   if ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) % ZS_SIZE_CLASS_DELTA)
-   nr += 1;
-
-   zs_size_classes = nr;
-}
-
 static bool can_merge(struct size_class *prev, int pages_per_zspage,
int objs_per_zspage)
 {
@@ -2145,7 +2131,7 @@ static void async_free_zspage(struct work_struct *work)
struct zs_pool *pool = container_of(work, struct zs_pool,
free_work);
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
if (class->index != i)
continue;
@@ -2263,7 +2249,7 @@ unsigned long zs_compact(struct zs_pool *pool)
int i;
struct size_class *class;
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2309,7 +2295,7 @@ static unsigned long zs_shrinker_count(struct shrinker 
*shrinker,
struct zs_pool *pool = container_of(shrinker, struct zs_pool,
shrinker);
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2361,12 +2347,6 @@ struct zs_pool *zs_create_pool(const char *name)
return NULL;
 
init_deferred_free(pool);
-   pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
-   GFP_KERNEL);
-   if (!pool->size_class) {
-  

Re: [PATCH] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-30 Thread Jerome Marchand
On 06/30/2017 03:24 AM, Minchan Kim wrote:
>> @@ -137,6 +142,8 @@
>>   *  (reason above)
>>   */
>>  #define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> CLASS_BITS)
>> +#define ZS_SIZE_CLASSES DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
>> ZS_MIN_ALLOC_SIZE, \
>> + ZS_SIZE_CLASS_DELTA)
> 
> #define ZS_SIZE_CLASSES   (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
> ZS_MIN_ALLOC_SIZE, \
>ZS_SIZE_CLASS_DELTA) + 1)
> 
> 
> I think it should add +1 to cover ZS_MIN_ALLOC_SIZE.

Yes, obviously. Sorry about that.

> Otherwise, looks good to me.
> 
> Thanks.
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-30 Thread Jerome Marchand
On 06/30/2017 03:24 AM, Minchan Kim wrote:
>> @@ -137,6 +142,8 @@
>>   *  (reason above)
>>   */
>>  #define ZS_SIZE_CLASS_DELTA (PAGE_SIZE >> CLASS_BITS)
>> +#define ZS_SIZE_CLASSES DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
>> ZS_MIN_ALLOC_SIZE, \
>> + ZS_SIZE_CLASS_DELTA)
> 
> #define ZS_SIZE_CLASSES   (DIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
> ZS_MIN_ALLOC_SIZE, \
>ZS_SIZE_CLASS_DELTA) + 1)
> 
> 
> I think it should add +1 to cover ZS_MIN_ALLOC_SIZE.

Yes, obviously. Sorry about that.

> Otherwise, looks good to me.
> 
> Thanks.
> 




signature.asc
Description: OpenPGP digital signature


[PATCH] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-28 Thread Jerome Marchand
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size
(ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
complicated.

This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 mm/zsmalloc.c | 52 +++-
 1 file changed, 15 insertions(+), 37 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..134024b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -116,6 +116,11 @@
 #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
 #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
 
+#define FULLNESS_BITS  2
+#define CLASS_BITS 8
+#define ISOLATED_BITS  3
+#define MAGIC_VAL_BITS 8
+
 #define MAX(a, b) ((a) >= (b) ? (a) : (b))
 /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
 #define ZS_MIN_ALLOC_SIZE \
@@ -137,6 +142,8 @@
  *  (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASSESDIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
+ZS_SIZE_CLASS_DELTA)
 
 enum fullness_group {
ZS_EMPTY,
@@ -169,11 +176,6 @@ static struct vfsmount *zsmalloc_mnt;
 #endif
 
 /*
- * number of size_classes
- */
-static int zs_size_classes;
-
-/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  * n <= N / f, where
  * n = number of allocated objects
@@ -244,7 +246,7 @@ struct link_free {
 struct zs_pool {
const char *name;
 
-   struct size_class **size_class;
+   struct size_class *size_class[ZS_SIZE_CLASSES];
struct kmem_cache *handle_cachep;
struct kmem_cache *zspage_cachep;
 
@@ -268,11 +270,6 @@ struct zs_pool {
 #endif
 };
 
-#define FULLNESS_BITS  2
-#define CLASS_BITS 8
-#define ISOLATED_BITS  3
-#define MAGIC_VAL_BITS 8
-
 struct zspage {
struct {
unsigned int fullness:FULLNESS_BITS;
@@ -551,7 +548,7 @@ static int get_size_class_index(int size)
idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
ZS_SIZE_CLASS_DELTA);
 
-   return min(zs_size_classes - 1, idx);
+   return min((int)ZS_SIZE_CLASSES - 1, idx);
 }
 
 static inline void zs_stat_inc(struct size_class *class,
@@ -610,7 +607,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
"obj_allocated", "obj_used", "pages_used",
"pages_per_zspage", "freeable");
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
 
if (class->index != i)
@@ -1294,17 +1291,6 @@ static int zs_cpu_dead(unsigned int cpu)
return 0;
 }
 
-static void __init init_zs_size_classes(void)
-{
-   int nr;
-
-   nr = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1;
-   if ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) % ZS_SIZE_CLASS_DELTA)
-   nr += 1;
-
-   zs_size_classes = nr;
-}
-
 static bool can_merge(struct size_class *prev, int pages_per_zspage,
int objs_per_zspage)
 {
@@ -2145,7 +2131,7 @@ static void async_free_zspage(struct work_struct *work)
struct zs_pool *pool = container_of(work, struct zs_pool,
free_work);
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
if (class->index != i)
continue;
@@ -2263,7 +2249,7 @@ unsigned long zs_compact(struct zs_pool *pool)
int i;
struct size_class *class;
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2309,7 +2295,7 @@ static unsigned long zs_shrinker_count(struct shrinker 
*shrinker,
struct zs_pool *pool = container_of(shrinker, struct zs_pool,
shrinker);
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2361,12 +2347,6 @@ struct zs_pool *zs_create_pool(const char *name)
return NULL;
 
init_deferred_free(pool);
-   pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
-   GFP_KERNEL);
-   if (!pool->size_class) {
-  

[PATCH] mm/zsmalloc: simplify zs_max_alloc_size handling

2017-06-28 Thread Jerome Marchand
Commit 40f9fb8cffc6 ("mm/zsmalloc: support allocating obj with size of
ZS_MAX_ALLOC_SIZE") fixes a size calculation error that prevented
zsmalloc to allocate an object of the maximal size
(ZS_MAX_ALLOC_SIZE). I think however the fix is unneededly
complicated.

This patch replaces the dynamic calculation of zs_size_classes at init
time by a compile time calculation that uses the DIV_ROUND_UP() macro
already used in get_size_class_index().

Signed-off-by: Jerome Marchand 
---
 mm/zsmalloc.c | 52 +++-
 1 file changed, 15 insertions(+), 37 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d41edd2..134024b 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -116,6 +116,11 @@
 #define OBJ_INDEX_BITS (BITS_PER_LONG - _PFN_BITS - OBJ_TAG_BITS)
 #define OBJ_INDEX_MASK ((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
 
+#define FULLNESS_BITS  2
+#define CLASS_BITS 8
+#define ISOLATED_BITS  3
+#define MAGIC_VAL_BITS 8
+
 #define MAX(a, b) ((a) >= (b) ? (a) : (b))
 /* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
 #define ZS_MIN_ALLOC_SIZE \
@@ -137,6 +142,8 @@
  *  (reason above)
  */
 #define ZS_SIZE_CLASS_DELTA(PAGE_SIZE >> CLASS_BITS)
+#define ZS_SIZE_CLASSESDIV_ROUND_UP(ZS_MAX_ALLOC_SIZE - 
ZS_MIN_ALLOC_SIZE, \
+ZS_SIZE_CLASS_DELTA)
 
 enum fullness_group {
ZS_EMPTY,
@@ -169,11 +176,6 @@ static struct vfsmount *zsmalloc_mnt;
 #endif
 
 /*
- * number of size_classes
- */
-static int zs_size_classes;
-
-/*
  * We assign a page to ZS_ALMOST_EMPTY fullness group when:
  * n <= N / f, where
  * n = number of allocated objects
@@ -244,7 +246,7 @@ struct link_free {
 struct zs_pool {
const char *name;
 
-   struct size_class **size_class;
+   struct size_class *size_class[ZS_SIZE_CLASSES];
struct kmem_cache *handle_cachep;
struct kmem_cache *zspage_cachep;
 
@@ -268,11 +270,6 @@ struct zs_pool {
 #endif
 };
 
-#define FULLNESS_BITS  2
-#define CLASS_BITS 8
-#define ISOLATED_BITS  3
-#define MAGIC_VAL_BITS 8
-
 struct zspage {
struct {
unsigned int fullness:FULLNESS_BITS;
@@ -551,7 +548,7 @@ static int get_size_class_index(int size)
idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
ZS_SIZE_CLASS_DELTA);
 
-   return min(zs_size_classes - 1, idx);
+   return min((int)ZS_SIZE_CLASSES - 1, idx);
 }
 
 static inline void zs_stat_inc(struct size_class *class,
@@ -610,7 +607,7 @@ static int zs_stats_size_show(struct seq_file *s, void *v)
"obj_allocated", "obj_used", "pages_used",
"pages_per_zspage", "freeable");
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
 
if (class->index != i)
@@ -1294,17 +1291,6 @@ static int zs_cpu_dead(unsigned int cpu)
return 0;
 }
 
-static void __init init_zs_size_classes(void)
-{
-   int nr;
-
-   nr = (ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / ZS_SIZE_CLASS_DELTA + 1;
-   if ((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) % ZS_SIZE_CLASS_DELTA)
-   nr += 1;
-
-   zs_size_classes = nr;
-}
-
 static bool can_merge(struct size_class *prev, int pages_per_zspage,
int objs_per_zspage)
 {
@@ -2145,7 +2131,7 @@ static void async_free_zspage(struct work_struct *work)
struct zs_pool *pool = container_of(work, struct zs_pool,
free_work);
 
-   for (i = 0; i < zs_size_classes; i++) {
+   for (i = 0; i < ZS_SIZE_CLASSES; i++) {
class = pool->size_class[i];
if (class->index != i)
continue;
@@ -2263,7 +2249,7 @@ unsigned long zs_compact(struct zs_pool *pool)
int i;
struct size_class *class;
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2309,7 +2295,7 @@ static unsigned long zs_shrinker_count(struct shrinker 
*shrinker,
struct zs_pool *pool = container_of(shrinker, struct zs_pool,
shrinker);
 
-   for (i = zs_size_classes - 1; i >= 0; i--) {
+   for (i = ZS_SIZE_CLASSES - 1; i >= 0; i--) {
class = pool->size_class[i];
if (!class)
continue;
@@ -2361,12 +2347,6 @@ struct zs_pool *zs_create_pool(const char *name)
return NULL;
 
init_deferred_free(pool);
-   pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
-   GFP_KERNEL);
-   if (!pool->size_class) {
-   kfree(pool)

[RFC PATCH] netxen_nic: null-terminate serial number string in netxen_check_options()

2017-04-25 Thread Jerome Marchand
The serial_num string in netxen_check_options() is not always properly
null-terminated. I couldn't find the documention on the serial number
format and I suspect a proper integer to string conversion is in
order, but this patch a least prevents the out-of-bound access.

It solves the following kasan warning:
[   36.127074] 
==
[   36.168472] BUG: KASAN: stack-out-of-bounds in strnlen+0x38/0x60 at addr 
8800360e7a50
[   36.216956] Read of size 1 by task kworker/0:1/188
[   36.244451] page:ead839c0 count:0 mapcount:0 mapping:  
(null) index:0x2
[   36.291475] page flags: 0x1f()
[   36.314980] page dumped because: kasan: bad access detected
[   36.348117] CPU: 0 PID: 188 Comm: kworker/0:1 Not tainted 
3.10.0-650.el7.test.kasan.x86_64 #1
[   36.397065] Hardware name: HP ProLiant DL585 G7, BIOS A16 03/19/2012
[   36.434443] Workqueue: events work_for_cpu_fn
[   36.459452]  8800360e7a30 e4708e04 8800360e7538 
b37748bf
[   36.503442]  8800360e75c0 b2f4a7e7 8800360d8948 
00060007
[   36.546616]  8800360d8950 0086 b3782086 
0004
[   36.589439] Call Trace:
[   36.603611]  [] dump_stack+0x19/0x1b
[   36.633970]  [] kasan_report_error+0x507/0x540
[   36.668472]  [] ? _raw_spin_unlock_irqrestore+0x36/0x80
[   36.708967]  [] kasan_report+0x58/0x60
[   36.740311]  [] ? cpu_clock+0x10/0x20
[   36.771532]  [] ? strnlen+0x38/0x60
[   36.800430]  [] __asan_load1+0x4d/0x50
[   36.831977]  [] strnlen+0x38/0x60
[   36.859995]  [] string.isra.7+0x3f/0x130
[   36.891531]  [] vsnprintf+0x620/0xd70
[   36.922997]  [] ? __free_pages_ok+0xe9/0x160
[   36.956467]  [] ? pointer.isra.19+0x780/0x780
[   36.991095]  [] ? vprintk_emit+0x12f/0x730
[   37.023440]  [] vscnprintf+0xd/0x40
[   37.053146]  [] vprintk_emit+0x15d/0x730
[   37.084983]  [] ? netxen_setup_minidump+0x621/0x780 
[netxen_nic]
[   37.129435]  [] vprintk_default+0x3e/0x60
[   37.161962]  [] printk+0xa1/0xc8
[   37.189446]  [] ? panic+0x28d/0x28d
[   37.219447]  [] netxen_start_firmware+0x1124/0x1170 
[netxen_nic]
[   37.262989]  [] ? netxen_show_diag_mode+0x50/0x50 
[netxen_nic]
[   37.306968]  [] ? netxen_nic_hw_write_wx_2M+0x180/0x180 
[netxen_nic]
[   37.352621]  [] ? netxen_nic_hw_read_wx_2M+0x7c/0x180 
[netxen_nic]
[   37.397967]  [] netxen_nic_probe+0x6f3/0x15f0 [netxen_nic]
[   37.439351]  [] ? native_sched_clock+0xf7/0x190
[   37.474980]  [] ? mark_lock+0xd6/0x860
[   37.505439]  [] ? netxen_nic_open+0xc0/0xc0 [netxen_nic]
[   37.545988]  [] ? _raw_spin_unlock_irqrestore+0x36/0x80
[   37.584974]  [] ? trace_hardirqs_on_caller+0x187/0x2b0
[   37.625444]  [] ? trace_hardirqs_on+0xd/0x10
[   37.658978]  [] ? _raw_spin_unlock_irqrestore+0x59/0x80
[   37.698937]  [] ? netxen_nic_open+0xc0/0xc0 [netxen_nic]
[   37.738975]  [] local_pci_probe+0x7a/0xd0
[   37.771447]  [] ? process_one_work+0x36f/0xb80
[   37.806447]  [] ? pci_device_shutdown+0xa0/0xa0
[   37.841483]  [] work_for_cpu_fn+0x2c/0x50
[   37.873443]  [] process_one_work+0x416/0xb80
[   37.908116]  [] ? process_one_work+0x36f/0xb80
[   37.943456]  [] ? flush_delayed_work+0x80/0x80
[   37.977968]  [] ? move_linked_works+0x83/0xb0
[   38.013461]  [] worker_thread+0x3cc/0x580
[   38.045479]  [] ? process_one_work+0xb80/0xb80
[   38.081445]  [] kthread+0x16e/0x180
[   38.110450]  [] ? flush_kthread_work+0x280/0x280
[   38.145996]  [] ? sched_clock+0x9/0x10
[   38.177466]  [] ? finish_task_switch+0x59/0x200
[   38.212477]  [] ? flush_kthread_work+0x280/0x280
[   38.248158]  [] ret_from_fork+0x58/0x90
[   38.279982]  [] ? flush_kthread_work+0x280/0x280
[   38.315480] Memory state around the buggy address:
[   38.344557]  8800360e7900: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 
f4
[   38.386125]  8800360e7980: f4 f4 f2 f2 f2 f2 04 f4 f4 f4 f2 f2 f2 f2 00 
00
[   38.428978] >8800360e7a00: 00 00 f2 f2 f2 f2 00 00 00 00 f3 f3 f3 f3 00 
00
[   38.470442]  ^
[   38.505984]  8800360e7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[   38.547465]  8800360e7b00: 00 00 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f2 f2 
f2
[   38.590467] 
==

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c 
b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 827de83..4d9cefc 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -842,7 +842,7 @@ netxen_check_options(struct netxen_adapter *adapter)
 {
u32 fw_major, fw_minor, fw_build, prev_fw_version;
char brd_name[NETXEN_MAX_SHORT_NAME];
-   char serial_num[32];
+   char serial_num[33];

[RFC PATCH] netxen_nic: null-terminate serial number string in netxen_check_options()

2017-04-25 Thread Jerome Marchand
The serial_num string in netxen_check_options() is not always properly
null-terminated. I couldn't find the documention on the serial number
format and I suspect a proper integer to string conversion is in
order, but this patch a least prevents the out-of-bound access.

It solves the following kasan warning:
[   36.127074] 
==
[   36.168472] BUG: KASAN: stack-out-of-bounds in strnlen+0x38/0x60 at addr 
8800360e7a50
[   36.216956] Read of size 1 by task kworker/0:1/188
[   36.244451] page:ead839c0 count:0 mapcount:0 mapping:  
(null) index:0x2
[   36.291475] page flags: 0x1f()
[   36.314980] page dumped because: kasan: bad access detected
[   36.348117] CPU: 0 PID: 188 Comm: kworker/0:1 Not tainted 
3.10.0-650.el7.test.kasan.x86_64 #1
[   36.397065] Hardware name: HP ProLiant DL585 G7, BIOS A16 03/19/2012
[   36.434443] Workqueue: events work_for_cpu_fn
[   36.459452]  8800360e7a30 e4708e04 8800360e7538 
b37748bf
[   36.503442]  8800360e75c0 b2f4a7e7 8800360d8948 
00060007
[   36.546616]  8800360d8950 0086 b3782086 
0004
[   36.589439] Call Trace:
[   36.603611]  [] dump_stack+0x19/0x1b
[   36.633970]  [] kasan_report_error+0x507/0x540
[   36.668472]  [] ? _raw_spin_unlock_irqrestore+0x36/0x80
[   36.708967]  [] kasan_report+0x58/0x60
[   36.740311]  [] ? cpu_clock+0x10/0x20
[   36.771532]  [] ? strnlen+0x38/0x60
[   36.800430]  [] __asan_load1+0x4d/0x50
[   36.831977]  [] strnlen+0x38/0x60
[   36.859995]  [] string.isra.7+0x3f/0x130
[   36.891531]  [] vsnprintf+0x620/0xd70
[   36.922997]  [] ? __free_pages_ok+0xe9/0x160
[   36.956467]  [] ? pointer.isra.19+0x780/0x780
[   36.991095]  [] ? vprintk_emit+0x12f/0x730
[   37.023440]  [] vscnprintf+0xd/0x40
[   37.053146]  [] vprintk_emit+0x15d/0x730
[   37.084983]  [] ? netxen_setup_minidump+0x621/0x780 
[netxen_nic]
[   37.129435]  [] vprintk_default+0x3e/0x60
[   37.161962]  [] printk+0xa1/0xc8
[   37.189446]  [] ? panic+0x28d/0x28d
[   37.219447]  [] netxen_start_firmware+0x1124/0x1170 
[netxen_nic]
[   37.262989]  [] ? netxen_show_diag_mode+0x50/0x50 
[netxen_nic]
[   37.306968]  [] ? netxen_nic_hw_write_wx_2M+0x180/0x180 
[netxen_nic]
[   37.352621]  [] ? netxen_nic_hw_read_wx_2M+0x7c/0x180 
[netxen_nic]
[   37.397967]  [] netxen_nic_probe+0x6f3/0x15f0 [netxen_nic]
[   37.439351]  [] ? native_sched_clock+0xf7/0x190
[   37.474980]  [] ? mark_lock+0xd6/0x860
[   37.505439]  [] ? netxen_nic_open+0xc0/0xc0 [netxen_nic]
[   37.545988]  [] ? _raw_spin_unlock_irqrestore+0x36/0x80
[   37.584974]  [] ? trace_hardirqs_on_caller+0x187/0x2b0
[   37.625444]  [] ? trace_hardirqs_on+0xd/0x10
[   37.658978]  [] ? _raw_spin_unlock_irqrestore+0x59/0x80
[   37.698937]  [] ? netxen_nic_open+0xc0/0xc0 [netxen_nic]
[   37.738975]  [] local_pci_probe+0x7a/0xd0
[   37.771447]  [] ? process_one_work+0x36f/0xb80
[   37.806447]  [] ? pci_device_shutdown+0xa0/0xa0
[   37.841483]  [] work_for_cpu_fn+0x2c/0x50
[   37.873443]  [] process_one_work+0x416/0xb80
[   37.908116]  [] ? process_one_work+0x36f/0xb80
[   37.943456]  [] ? flush_delayed_work+0x80/0x80
[   37.977968]  [] ? move_linked_works+0x83/0xb0
[   38.013461]  [] worker_thread+0x3cc/0x580
[   38.045479]  [] ? process_one_work+0xb80/0xb80
[   38.081445]  [] kthread+0x16e/0x180
[   38.110450]  [] ? flush_kthread_work+0x280/0x280
[   38.145996]  [] ? sched_clock+0x9/0x10
[   38.177466]  [] ? finish_task_switch+0x59/0x200
[   38.212477]  [] ? flush_kthread_work+0x280/0x280
[   38.248158]  [] ret_from_fork+0x58/0x90
[   38.279982]  [] ? flush_kthread_work+0x280/0x280
[   38.315480] Memory state around the buggy address:
[   38.344557]  8800360e7900: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 04 
f4
[   38.386125]  8800360e7980: f4 f4 f2 f2 f2 f2 04 f4 f4 f4 f2 f2 f2 f2 00 
00
[   38.428978] >8800360e7a00: 00 00 f2 f2 f2 f2 00 00 00 00 f3 f3 f3 f3 00 
00
[   38.470442]  ^
[   38.505984]  8800360e7a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00
[   38.547465]  8800360e7b00: 00 00 00 00 00 f1 f1 f1 f1 04 f4 f4 f4 f2 f2 
f2
[   38.590467] 
==

Signed-off-by: Jerome Marchand 
---
 drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c 
b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
index 827de83..4d9cefc 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_main.c
@@ -842,7 +842,7 @@ netxen_check_options(struct netxen_adapter *adapter)
 {
u32 fw_major, fw_minor, fw_build, prev_fw_version;
char brd_name[NETXEN_MAX_SHORT_NAME];
-   char serial_num[32];
+   char serial_num[33];
int i, offset, val, 

Re: [PATCH v6 2/4] mm: Add functions to support extra actions on swap in/out

2017-03-01 Thread Jerome Marchand
On 02/28/2017 07:35 PM, Khalid Aziz wrote:
> If a processor supports special metadata for a page, for example ADI
> version tags on SPARC M7, this metadata must be saved when the page is
> swapped out. The same metadata must be restored when the page is swapped
> back in. This patch adds two new architecture specific functions -
> arch_do_swap_page() to be called when a page is swapped in,
> arch_unmap_one() to be called when a page is being unmapped for swap
> out.
> 
> Signed-off-by: Khalid Aziz <khalid.a...@oracle.com>
> Cc: Khalid Aziz <kha...@gonehiking.org>

This looks much better than your original version.

Acked-by: Jerome Marchand <jmarc...@redhat.com>

Thanks,
Jerome

> ---
> v5:
>   - Replaced set_swp_pte() function with new architecture
> functions arch_do_swap_page() and arch_unmap_one()
> 
>  include/asm-generic/pgtable.h | 16 
>  mm/memory.c   |  1 +
>  mm/rmap.c |  2 ++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 18af2bc..5764d8f 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -282,6 +282,22 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> +#ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page(struct mm_struct *mm, unsigned long 
> addr,
> +  pte_t pte, pte_t orig_pte)
> +{
> +
> +}
> +#endif
> +
> +#ifndef __HAVE_ARCH_UNMAP_ONE
> +static inline void arch_unmap_one(struct mm_struct *mm, unsigned long addr,
> +   pte_t pte, pte_t orig_pte)
> +{
> +
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PGD_OFFSET_GATE
>  #define pgd_offset_gate(mm, addr)pgd_offset(mm, addr)
>  #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index 6bf2b47..b086c76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2658,6 +2658,7 @@ int do_swap_page(struct vm_fault *vmf)
>   if (pte_swp_soft_dirty(vmf->orig_pte))
>   pte = pte_mksoft_dirty(pte);
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> + arch_do_swap_page(vma->vm_mm, vmf->address, pte, vmf->orig_pte);
>   vmf->orig_pte = pte;
>   if (page == swapcache) {
>   do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 91619fd..192c41a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1538,6 +1538,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + arch_unmap_one(mm, address, swp_pte, pteval);
>   set_pte_at(mm, address, pte, swp_pte);
>   } else if (PageAnon(page)) {
>   swp_entry_t entry = { .val = page_private(page) };
> @@ -1571,6 +1572,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + arch_unmap_one(mm, address, swp_pte, pteval);
>   set_pte_at(mm, address, pte, swp_pte);
>   } else
>   dec_mm_counter(mm, mm_counter_file(page));
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v6 2/4] mm: Add functions to support extra actions on swap in/out

2017-03-01 Thread Jerome Marchand
On 02/28/2017 07:35 PM, Khalid Aziz wrote:
> If a processor supports special metadata for a page, for example ADI
> version tags on SPARC M7, this metadata must be saved when the page is
> swapped out. The same metadata must be restored when the page is swapped
> back in. This patch adds two new architecture specific functions -
> arch_do_swap_page() to be called when a page is swapped in,
> arch_unmap_one() to be called when a page is being unmapped for swap
> out.
> 
> Signed-off-by: Khalid Aziz 
> Cc: Khalid Aziz 

This looks much better than your original version.

Acked-by: Jerome Marchand 

Thanks,
Jerome

> ---
> v5:
>   - Replaced set_swp_pte() function with new architecture
> functions arch_do_swap_page() and arch_unmap_one()
> 
>  include/asm-generic/pgtable.h | 16 
>  mm/memory.c   |  1 +
>  mm/rmap.c |  2 ++
>  3 files changed, 19 insertions(+)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 18af2bc..5764d8f 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -282,6 +282,22 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  #endif
>  
> +#ifndef __HAVE_ARCH_DO_SWAP_PAGE
> +static inline void arch_do_swap_page(struct mm_struct *mm, unsigned long 
> addr,
> +  pte_t pte, pte_t orig_pte)
> +{
> +
> +}
> +#endif
> +
> +#ifndef __HAVE_ARCH_UNMAP_ONE
> +static inline void arch_unmap_one(struct mm_struct *mm, unsigned long addr,
> +   pte_t pte, pte_t orig_pte)
> +{
> +
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PGD_OFFSET_GATE
>  #define pgd_offset_gate(mm, addr)pgd_offset(mm, addr)
>  #endif
> diff --git a/mm/memory.c b/mm/memory.c
> index 6bf2b47..b086c76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2658,6 +2658,7 @@ int do_swap_page(struct vm_fault *vmf)
>   if (pte_swp_soft_dirty(vmf->orig_pte))
>   pte = pte_mksoft_dirty(pte);
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
> + arch_do_swap_page(vma->vm_mm, vmf->address, pte, vmf->orig_pte);
>   vmf->orig_pte = pte;
>   if (page == swapcache) {
>   do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 91619fd..192c41a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1538,6 +1538,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + arch_unmap_one(mm, address, swp_pte, pteval);
>   set_pte_at(mm, address, pte, swp_pte);
>   } else if (PageAnon(page)) {
>   swp_entry_t entry = { .val = page_private(page) };
> @@ -1571,6 +1572,7 @@ static int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pteval))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + arch_unmap_one(mm, address, swp_pte, pteval);
>   set_pte_at(mm, address, pte, swp_pte);
>   } else
>   dec_mm_counter(mm, mm_counter_file(page));
> 




signature.asc
Description: OpenPGP digital signature


Re: [RFC] blk: increase logical_block_size to unsigned int

2017-01-09 Thread Jerome Marchand


- Original Message -
> From: "Sergey Senozhatsky" <sergey.senozhat...@gmail.com>
> To: "Minchan Kim" <minc...@kernel.org>
> Cc: "Jens Axboe" <ax...@kernel.dk>, "Hyeoncheol Lee" <cheol@lge.com>, 
> linux-bl...@vger.kernel.org,
> linux-kernel@vger.kernel.org, "Andrew Morton" <a...@linux-foundation.org>, 
> "Sergey Senozhatsky"
> <sergey.senozhat...@gmail.com>, "Robert Jennings" <r...@linux.vnet.ibm.com>, 
> "Jerome Marchand" <jmarc...@redhat.com>
> Sent: Monday, January 9, 2017 3:33:44 PM
> Subject: Re: [RFC] blk: increase logical_block_size to unsigned int
> 
> On (01/09/17 14:04), Minchan Kim wrote:
> > Mostly, zram is used as swap system on embedded world so it want to do IO
> > as PAGE_SIZE aligned/size IO unit. For that, one of the problem was
> > blk_queue_logical_block_size(zram->disk->queue, PAGE_SIZE) made overflow
> > in *64K page system* so [1] changed it to constant 4096.
> > Since that, partial IO can happen so zram should handle it which makes zram
> > complicated[2].
> > 
> 
> I thought that zram partial IO support is there because some file
> systems cannot cope with large logical_block_size. like FAT, for
> example. am I wrong?

Yes indeed. When we discussed the patch adding the partial I/O, increasing the
size of logical_block was considered. The reason we didn't go the easy path was
that not all block users could handle 64k blocks. FAT is one of them.

Jerome

> 
>   -ss
> 


Re: [RFC] blk: increase logical_block_size to unsigned int

2017-01-09 Thread Jerome Marchand


- Original Message -
> From: "Sergey Senozhatsky" 
> To: "Minchan Kim" 
> Cc: "Jens Axboe" , "Hyeoncheol Lee" , 
> linux-bl...@vger.kernel.org,
> linux-kernel@vger.kernel.org, "Andrew Morton" , 
> "Sergey Senozhatsky"
> , "Robert Jennings" , 
> "Jerome Marchand" 
> Sent: Monday, January 9, 2017 3:33:44 PM
> Subject: Re: [RFC] blk: increase logical_block_size to unsigned int
> 
> On (01/09/17 14:04), Minchan Kim wrote:
> > Mostly, zram is used as swap system on embedded world so it want to do IO
> > as PAGE_SIZE aligned/size IO unit. For that, one of the problem was
> > blk_queue_logical_block_size(zram->disk->queue, PAGE_SIZE) made overflow
> > in *64K page system* so [1] changed it to constant 4096.
> > Since that, partial IO can happen so zram should handle it which makes zram
> > complicated[2].
> > 
> 
> I thought that zram partial IO support is there because some file
> systems cannot cope with large logical_block_size. like FAT, for
> example. am I wrong?

Yes indeed. When we discussed the patch adding the partial I/O, increasing the
size of logical_block was considered. The reason we didn't go the easy path was
that not all block users could handle 64k blocks. FAT is one of them.

Jerome

> 
>   -ss
> 


Re: [PATCH] proc: Fix integer overflow of VmLib

2017-01-05 Thread Jerome Marchand
On 01/05/2017 12:29 AM, Richard Weinberger wrote:
> /proc//status can report extremely high VmLib values which
> will confuse monitoring tools.
> VmLib is mm->exec_vm minus text size, where exec_vm is the number of
> bytes backed by an executable memory mapping and text size is
> mm->end_code - mm->start_code as set up by binfmt.
> 
> For the vast majority of all programs text size is smaller than exec_vm.
> But if a program interprets binaries on its own the calculation result
> can be negative.
> UserModeLinux is such an example. It installs and removes lots of PROT_EXEC
> mappings but mm->start_code and mm->start_code remain and VmLib turns
> negative.
> 
> Fix this by detecting the overflow and just return 0.
> For interpreting the value reported by VmLib is anyway useless but
> returning 0 does at least not confuse userspace.

I guess returning 0 in such case is good enough, but the description of
VmLib in Documentations/filesystems/proc.txt should be updated to warn
users not to rely too much on this value.

Jerome

> 
> Signed-off-by: Richard Weinberger 
> ---
>  fs/proc/task_mmu.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8f96a49178d0..220091c29aa6 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -46,6 +46,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  
>   text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
>   lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> + if ((long)lib < 0)
> + lib = 0;
>   swap = get_mm_counter(mm, MM_SWAPENTS);
>   ptes = PTRS_PER_PTE * sizeof(pte_t) * atomic_long_read(>nr_ptes);
>   pmds = PTRS_PER_PMD * sizeof(pmd_t) * mm_nr_pmds(mm);
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] proc: Fix integer overflow of VmLib

2017-01-05 Thread Jerome Marchand
On 01/05/2017 12:29 AM, Richard Weinberger wrote:
> /proc//status can report extremely high VmLib values which
> will confuse monitoring tools.
> VmLib is mm->exec_vm minus text size, where exec_vm is the number of
> bytes backed by an executable memory mapping and text size is
> mm->end_code - mm->start_code as set up by binfmt.
> 
> For the vast majority of all programs text size is smaller than exec_vm.
> But if a program interprets binaries on its own the calculation result
> can be negative.
> UserModeLinux is such an example. It installs and removes lots of PROT_EXEC
> mappings but mm->start_code and mm->start_code remain and VmLib turns
> negative.
> 
> Fix this by detecting the overflow and just return 0.
> For interpreting the value reported by VmLib is anyway useless but
> returning 0 does at least not confuse userspace.

I guess returning 0 in such case is good enough, but the description of
VmLib in Documentations/filesystems/proc.txt should be updated to warn
users not to rely too much on this value.

Jerome

> 
> Signed-off-by: Richard Weinberger 
> ---
>  fs/proc/task_mmu.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 8f96a49178d0..220091c29aa6 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -46,6 +46,8 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
>  
>   text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
>   lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> + if ((long)lib < 0)
> + lib = 0;
>   swap = get_mm_counter(mm, MM_SWAPENTS);
>   ptes = PTRS_PER_PTE * sizeof(pte_t) * atomic_long_read(>nr_ptes);
>   pmds = PTRS_PER_PMD * sizeof(pmd_t) * mm_nr_pmds(mm);
> 




signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v3] sparc64: Add support for Application Data Integrity (ADI)

2017-01-05 Thread Jerome Marchand
On 01/04/2017 11:46 PM, Khalid Aziz wrote:
> ADI is a new feature supported on sparc M7 and newer processors to allow
> hardware to catch rogue accesses to memory. ADI is supported for data
> fetches only and not instruction fetches. An app can enable ADI on its
> data pages, set version tags on them and use versioned addresses to
> access the data pages. Upper bits of the address contain the version
> tag. On M7 processors, upper four bits (bits 63-60) contain the version
> tag. If a rogue app attempts to access ADI enabled data pages, its
> access is blocked and processor generates an exception.
> 
> This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
> MCD (Memory Corruption Detection) on selected memory ranges, enable
> TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
> version tags on page swap out/in.  It also adds handlers for all traps
> related to MCD. ADI is not enabled by default for any task. A task must
> explicitly enable ADI on a memory range and set version tag for ADI to
> be effective for the task.
> 
> Signed-off-by: Khalid Aziz 
> Cc: Khalid Aziz 
> ---
> v2:
>   - Fixed a build error
> 
> v3:
>   - Removed CONFIG_SPARC_ADI
>   - Replaced prctl commands with mprotect
>   - Added auxiliary vectors for ADI parameters
>   - Enabled ADI for swappable pages
> 
>  Documentation/sparc/adi.txt | 239 
> 
>  arch/sparc/include/asm/adi.h|   6 +
>  arch/sparc/include/asm/adi_64.h |  46 ++
>  arch/sparc/include/asm/elf_64.h |   8 ++
>  arch/sparc/include/asm/hugetlb.h|  13 ++
>  arch/sparc/include/asm/hypervisor.h |   2 +
>  arch/sparc/include/asm/mman.h   |  40 +-
>  arch/sparc/include/asm/mmu_64.h |   2 +
>  arch/sparc/include/asm/mmu_context_64.h |  32 +
>  arch/sparc/include/asm/pgtable_64.h |  97 -
>  arch/sparc/include/asm/ttable.h |  10 ++
>  arch/sparc/include/asm/uaccess_64.h | 120 +++-
>  arch/sparc/include/uapi/asm/asi.h   |   5 +
>  arch/sparc/include/uapi/asm/auxvec.h|   8 ++
>  arch/sparc/include/uapi/asm/mman.h  |   2 +
>  arch/sparc/include/uapi/asm/pstate.h|  10 ++
>  arch/sparc/kernel/Makefile  |   1 +
>  arch/sparc/kernel/adi_64.c  |  93 +
>  arch/sparc/kernel/entry.h   |   3 +
>  arch/sparc/kernel/head_64.S |   1 +
>  arch/sparc/kernel/mdesc.c   |   4 +
>  arch/sparc/kernel/process_64.c  |  21 +++
>  arch/sparc/kernel/sun4v_mcd.S   |  16 +++
>  arch/sparc/kernel/traps_64.c| 142 ++-
>  arch/sparc/kernel/ttable_64.S   |   6 +-
>  arch/sparc/mm/gup.c |  37 +
>  arch/sparc/mm/tlb.c |  28 
>  arch/x86/kernel/signal_compat.c |   2 +-
>  include/asm-generic/pgtable.h   |   5 +
>  include/linux/mm.h  |   2 +
>  include/uapi/asm-generic/siginfo.h  |   5 +-
>  mm/memory.c |   2 +-
>  mm/rmap.c   |   4 +-

I haven't actually reviewed the code and looked at why you need
set_swp_pte_at() function, but the code that add the generic version of
this function need to be separated from the rest of the patch. Also,
given the size of this patch, I suspect the rest also need to be broken
into more patches.

Jerome



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v3] sparc64: Add support for Application Data Integrity (ADI)

2017-01-05 Thread Jerome Marchand
On 01/04/2017 11:46 PM, Khalid Aziz wrote:
> ADI is a new feature supported on sparc M7 and newer processors to allow
> hardware to catch rogue accesses to memory. ADI is supported for data
> fetches only and not instruction fetches. An app can enable ADI on its
> data pages, set version tags on them and use versioned addresses to
> access the data pages. Upper bits of the address contain the version
> tag. On M7 processors, upper four bits (bits 63-60) contain the version
> tag. If a rogue app attempts to access ADI enabled data pages, its
> access is blocked and processor generates an exception.
> 
> This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
> MCD (Memory Corruption Detection) on selected memory ranges, enable
> TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
> version tags on page swap out/in.  It also adds handlers for all traps
> related to MCD. ADI is not enabled by default for any task. A task must
> explicitly enable ADI on a memory range and set version tag for ADI to
> be effective for the task.
> 
> Signed-off-by: Khalid Aziz 
> Cc: Khalid Aziz 
> ---
> v2:
>   - Fixed a build error
> 
> v3:
>   - Removed CONFIG_SPARC_ADI
>   - Replaced prctl commands with mprotect
>   - Added auxiliary vectors for ADI parameters
>   - Enabled ADI for swappable pages
> 
>  Documentation/sparc/adi.txt | 239 
> 
>  arch/sparc/include/asm/adi.h|   6 +
>  arch/sparc/include/asm/adi_64.h |  46 ++
>  arch/sparc/include/asm/elf_64.h |   8 ++
>  arch/sparc/include/asm/hugetlb.h|  13 ++
>  arch/sparc/include/asm/hypervisor.h |   2 +
>  arch/sparc/include/asm/mman.h   |  40 +-
>  arch/sparc/include/asm/mmu_64.h |   2 +
>  arch/sparc/include/asm/mmu_context_64.h |  32 +
>  arch/sparc/include/asm/pgtable_64.h |  97 -
>  arch/sparc/include/asm/ttable.h |  10 ++
>  arch/sparc/include/asm/uaccess_64.h | 120 +++-
>  arch/sparc/include/uapi/asm/asi.h   |   5 +
>  arch/sparc/include/uapi/asm/auxvec.h|   8 ++
>  arch/sparc/include/uapi/asm/mman.h  |   2 +
>  arch/sparc/include/uapi/asm/pstate.h|  10 ++
>  arch/sparc/kernel/Makefile  |   1 +
>  arch/sparc/kernel/adi_64.c  |  93 +
>  arch/sparc/kernel/entry.h   |   3 +
>  arch/sparc/kernel/head_64.S |   1 +
>  arch/sparc/kernel/mdesc.c   |   4 +
>  arch/sparc/kernel/process_64.c  |  21 +++
>  arch/sparc/kernel/sun4v_mcd.S   |  16 +++
>  arch/sparc/kernel/traps_64.c| 142 ++-
>  arch/sparc/kernel/ttable_64.S   |   6 +-
>  arch/sparc/mm/gup.c |  37 +
>  arch/sparc/mm/tlb.c |  28 
>  arch/x86/kernel/signal_compat.c |   2 +-
>  include/asm-generic/pgtable.h   |   5 +
>  include/linux/mm.h  |   2 +
>  include/uapi/asm-generic/siginfo.h  |   5 +-
>  mm/memory.c |   2 +-
>  mm/rmap.c   |   4 +-

I haven't actually reviewed the code and looked at why you need
set_swp_pte_at() function, but the code that add the generic version of
this function need to be separated from the rest of the patch. Also,
given the size of this patch, I suspect the rest also need to be broken
into more patches.

Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] swapfile: fix memory corruption via malformed swapfile

2016-11-01 Thread Jerome Marchand
On 10/31/2016 10:32 PM, Jann Horn wrote:
> When root activates a swap partition whose header has the wrong endianness,
> nr_badpages elements of badpages are swabbed before nr_badpages has been
> checked, leading to a buffer overrun of up to 8GB.
> 
> This normally is not a security issue because it can only be exploited by
> root (more specifically, a process with CAP_SYS_ADMIN or the ability to
> modify a swap file/partition), and such a process can already e.g. modify
> swapped-out memory of any other userspace process on the system.
> 
> Testcase for reproducing the bug (must be run as root, should crash your
> kernel):
> =
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define PAGE_SIZE 4096
> #define __u32 unsigned int
> 
> 
> // from include/linux/swap.h
> union swap_header {
>   struct {
> char reserved[PAGE_SIZE - 10];
> char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
>   } magic;
>   struct {
> charbootbits[1024]; /* Space for disklabel etc. */
> __u32   version;
> __u32   last_page;
> __u32   nr_badpages;
> unsigned char sws_uuid[16];
> unsigned char sws_volume[16];
> __u32   padding[117];
> __u32   badpages[1];
>   } info;
> };
> 
> int main(void) {
>   char file[] = "/tmp/swapfile.XX";
>   int file_fd = mkstemp(file);
>   if (file_fd == -1)
> err(1, "mkstemp");
>   if (ftruncate(file_fd, PAGE_SIZE))
> err(1, "ftruncate");
>   union swap_header swap_header = {
> .info = {
>   .version = __builtin_bswap32(1),
>   .nr_badpages = __builtin_bswap32(INT_MAX)
> }
>   };
>   memcpy(swap_header.magic.magic, "SWAPSPACE2", 10);
>   if (write(file_fd, _header, sizeof(swap_header)) !=
>   sizeof(swap_header))
> err(1, "write");
> 
>   // not because the attack needs it, just in case you forgot to
>   // sync yourself before crashing your machine
>   sync();
> 
>   // now die
>   if (swapon(file, 0))
> err(1, "swapon");
>   puts("huh, we survived");
>   if (swapoff(file))
> err(1, "swapoff");
>   unlink(file);
> }
> =
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jann Horn <j...@thejh.net>
> ---
>  mm/swapfile.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2210de290b54..f30438970cd1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2224,6 +2224,8 @@ static unsigned long read_swap_header(struct 
> swap_info_struct *p,
>   swab32s(_header->info.version);
>   swab32s(_header->info.last_page);
>   swab32s(_header->info.nr_badpages);
> + if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
> + return 0;
>   for (i = 0; i < swap_header->info.nr_badpages; i++)
>   swab32s(_header->info.badpages[i]);
>   }
> 

Nice catch!

Acked-by: Jerome Marchand <jmarc...@redhat.com>





signature.asc
Description: OpenPGP digital signature


Re: [PATCH] swapfile: fix memory corruption via malformed swapfile

2016-11-01 Thread Jerome Marchand
On 10/31/2016 10:32 PM, Jann Horn wrote:
> When root activates a swap partition whose header has the wrong endianness,
> nr_badpages elements of badpages are swabbed before nr_badpages has been
> checked, leading to a buffer overrun of up to 8GB.
> 
> This normally is not a security issue because it can only be exploited by
> root (more specifically, a process with CAP_SYS_ADMIN or the ability to
> modify a swap file/partition), and such a process can already e.g. modify
> swapped-out memory of any other userspace process on the system.
> 
> Testcase for reproducing the bug (must be run as root, should crash your
> kernel):
> =
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define PAGE_SIZE 4096
> #define __u32 unsigned int
> 
> 
> // from include/linux/swap.h
> union swap_header {
>   struct {
> char reserved[PAGE_SIZE - 10];
> char magic[10]; /* SWAP-SPACE or SWAPSPACE2 */
>   } magic;
>   struct {
> charbootbits[1024]; /* Space for disklabel etc. */
> __u32   version;
> __u32   last_page;
> __u32   nr_badpages;
> unsigned char sws_uuid[16];
> unsigned char sws_volume[16];
> __u32   padding[117];
> __u32   badpages[1];
>   } info;
> };
> 
> int main(void) {
>   char file[] = "/tmp/swapfile.XX";
>   int file_fd = mkstemp(file);
>   if (file_fd == -1)
> err(1, "mkstemp");
>   if (ftruncate(file_fd, PAGE_SIZE))
> err(1, "ftruncate");
>   union swap_header swap_header = {
> .info = {
>   .version = __builtin_bswap32(1),
>   .nr_badpages = __builtin_bswap32(INT_MAX)
> }
>   };
>   memcpy(swap_header.magic.magic, "SWAPSPACE2", 10);
>   if (write(file_fd, _header, sizeof(swap_header)) !=
>   sizeof(swap_header))
> err(1, "write");
> 
>   // not because the attack needs it, just in case you forgot to
>   // sync yourself before crashing your machine
>   sync();
> 
>   // now die
>   if (swapon(file, 0))
> err(1, "swapon");
>   puts("huh, we survived");
>   if (swapoff(file))
> err(1, "swapoff");
>   unlink(file);
> }
> =
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Jann Horn 
> ---
>  mm/swapfile.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 2210de290b54..f30438970cd1 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -2224,6 +2224,8 @@ static unsigned long read_swap_header(struct 
> swap_info_struct *p,
>   swab32s(_header->info.version);
>   swab32s(_header->info.last_page);
>   swab32s(_header->info.nr_badpages);
> + if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES)
> + return 0;
>   for (i = 0; i < swap_header->info.nr_badpages; i++)
>   swab32s(_header->info.badpages[i]);
>   }
> 

Nice catch!

Acked-by: Jerome Marchand 





signature.asc
Description: OpenPGP digital signature


[PATCH] platform driver: fix use-after-free in platform_device_del()

2016-07-25 Thread Jerome Marchand
In platform_device_del(), the device is still used after a call to
device_del(). At this point there is no guarantee that the device is
still there and there could be a use-after-free access. Move the
call to device_remove_properties() before device_del() to fix that.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 drivers/base/platform.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 6482d47..f57fff3 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -434,6 +434,7 @@ void platform_device_del(struct platform_device *pdev)
int i;
 
if (pdev) {
+   device_remove_properties(>dev);
device_del(>dev);
 
if (pdev->id_auto) {
@@ -446,8 +447,6 @@ void platform_device_del(struct platform_device *pdev)
if (r->parent)
release_resource(r);
}
-
-   device_remove_properties(>dev);
}
 }
 EXPORT_SYMBOL_GPL(platform_device_del);
-- 
2.5.5



[PATCH] platform driver: fix use-after-free in platform_device_del()

2016-07-25 Thread Jerome Marchand
In platform_device_del(), the device is still used after a call to
device_del(). At this point there is no guarantee that the device is
still there and there could be a use-after-free access. Move the
call to device_remove_properties() before device_del() to fix that.

Signed-off-by: Jerome Marchand 
---
 drivers/base/platform.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/base/platform.c b/drivers/base/platform.c
index 6482d47..f57fff3 100644
--- a/drivers/base/platform.c
+++ b/drivers/base/platform.c
@@ -434,6 +434,7 @@ void platform_device_del(struct platform_device *pdev)
int i;
 
if (pdev) {
+   device_remove_properties(>dev);
device_del(>dev);
 
if (pdev->id_auto) {
@@ -446,8 +447,6 @@ void platform_device_del(struct platform_device *pdev)
if (r->parent)
release_resource(r);
}
-
-   device_remove_properties(>dev);
}
 }
 EXPORT_SYMBOL_GPL(platform_device_del);
-- 
2.5.5



Re: System freezes after OOM

2016-07-13 Thread Jerome Marchand
On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.
> 
> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

There is a quite recent sysctl vm knob that I believe can help in this
case: watermark_scale_factor. If you increase this value, kswapd will
start paging out earlier, when there might still be enough free memory.

Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

Jerome

> 
> 
> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-0/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB 
>> active_anon:4096kB inactive_anon:4636kB active_file:212kB 
>> inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB 
>> present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB 
>> mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB 
>> kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB 
>> local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 
>> all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB 
>> active_anon:423820kB inactive_anon:424916kB active_file:17996kB 
>> inactive_file:21800kB unevictable:20724kB isolated(anon):384kB 
>> isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB 
>> dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB 
>> slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB 
>> pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
>> free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).
> 
> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the swap out path." in the function add_to_swap - but apart from that, 
> swap should proceed even with no available memory, as long as all the 
> drivers in the block layer use mempools.
> 
>>> So, it could prematurely trigger OOM killer on any slow swapping device 
>>> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
>>> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
>>> it triggers OOM if there is plenty of free swap space free.
>>>
>>> Michal, would you accept a change to the OOM killer, to prevent it from 
>>> triggerring when there is free swap space?
>>
>> No this doesn't sound like a proper solution. The current decision
>> logic, as explained above relies on the feedback from the reclaim. A
>> free swap space doesn't really mean we can make a forward progress.
> 
> I'm interested - why would you need to trigger the OOM killer if there is 
> free swap space?
> 
> The only possibility is that all the memory is filled with unswappable 
> kernel pages - but that condition could be detected if there is unusually 
> low number 

Re: System freezes after OOM

2016-07-13 Thread Jerome Marchand
On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.
> 
> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

There is a quite recent sysctl vm knob that I believe can help in this
case: watermark_scale_factor. If you increase this value, kswapd will
start paging out earlier, when there might still be enough free memory.

Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

Jerome

> 
> 
> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-0/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB 
>> active_anon:4096kB inactive_anon:4636kB active_file:212kB 
>> inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB 
>> present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB 
>> mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB 
>> kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB 
>> local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 
>> all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB 
>> active_anon:423820kB inactive_anon:424916kB active_file:17996kB 
>> inactive_file:21800kB unevictable:20724kB isolated(anon):384kB 
>> isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB 
>> dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB 
>> slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB 
>> pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB 
>> free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).
> 
> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the swap out path." in the function add_to_swap - but apart from that, 
> swap should proceed even with no available memory, as long as all the 
> drivers in the block layer use mempools.
> 
>>> So, it could prematurely trigger OOM killer on any slow swapping device 
>>> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
>>> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
>>> it triggers OOM if there is plenty of free swap space free.
>>>
>>> Michal, would you accept a change to the OOM killer, to prevent it from 
>>> triggerring when there is free swap space?
>>
>> No this doesn't sound like a proper solution. The current decision
>> logic, as explained above relies on the feedback from the reclaim. A
>> free swap space doesn't really mean we can make a forward progress.
> 
> I'm interested - why would you need to trigger the OOM killer if there is 
> free swap space?
> 
> The only possibility is that all the memory is filled with unswappable 
> kernel pages - but that condition could be detected if there is unusually 
> low number 

[PATCH] Documentation: add watermark_scale_factor to the list of vm systcl file

2016-07-12 Thread Jerome Marchand
Commit 795ae7a0de6b ("mm: scale kswapd watermarks in proportion to
memory") properly added the description of the new knob to
Documentation/sysctl/vm.txt, but forgot to add it to the list of files
in /proc/sys/vm. Let's fix that.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 Documentation/sysctl/vm.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 720355c..95ccbe6 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -61,6 +61,7 @@ files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_scale_factor
 - zone_reclaim_mode
 
 ==
-- 
2.5.5



[PATCH] Documentation: add watermark_scale_factor to the list of vm systcl file

2016-07-12 Thread Jerome Marchand
Commit 795ae7a0de6b ("mm: scale kswapd watermarks in proportion to
memory") properly added the description of the new knob to
Documentation/sysctl/vm.txt, but forgot to add it to the list of files
in /proc/sys/vm. Let's fix that.

Signed-off-by: Jerome Marchand 
---
 Documentation/sysctl/vm.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 720355c..95ccbe6 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -61,6 +61,7 @@ files can be found in mm/swap.c.
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_scale_factor
 - zone_reclaim_mode
 
 ==
-- 
2.5.5



[PATCH 1/2] cifs: use CIFS_MAX_DOMAINNAME_LEN when converting the domain name

2016-05-26 Thread Jerome Marchand
Currently in build_ntlmssp_auth_blob(), when converting the domain
name to UTF16, CIFS_MAX_USERNAME_LEN limit is used. It should be
CIFS_MAX_DOMAINNAME_LEN. This patch fixes this.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 fs/cifs/sess.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c
index af0ec2d..c3d086e 100644
--- a/fs/cifs/sess.c
+++ b/fs/cifs/sess.c
@@ -430,7 +430,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
} else {
int len;
len = cifs_strtoUTF16((__le16 *)tmp, ses->domainName,
- CIFS_MAX_USERNAME_LEN, nls_cp);
+ CIFS_MAX_DOMAINNAME_LEN, nls_cp);
len *= 2; /* unicode is 2 bytes each */
sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - pbuffer);
sec_blob->DomainName.Length = cpu_to_le16(len);
-- 
2.5.5



[PATCH 1/2] cifs: use CIFS_MAX_DOMAINNAME_LEN when converting the domain name

2016-05-26 Thread Jerome Marchand
Currently in build_ntlmssp_auth_blob(), when converting the domain
name to UTF16, CIFS_MAX_USERNAME_LEN limit is used. It should be
CIFS_MAX_DOMAINNAME_LEN. This patch fixes this.

Signed-off-by: Jerome Marchand 
---
 fs/cifs/sess.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c
index af0ec2d..c3d086e 100644
--- a/fs/cifs/sess.c
+++ b/fs/cifs/sess.c
@@ -430,7 +430,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
} else {
int len;
len = cifs_strtoUTF16((__le16 *)tmp, ses->domainName,
- CIFS_MAX_USERNAME_LEN, nls_cp);
+ CIFS_MAX_DOMAINNAME_LEN, nls_cp);
len *= 2; /* unicode is 2 bytes each */
sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - pbuffer);
sec_blob->DomainName.Length = cpu_to_le16(len);
-- 
2.5.5



[PATCH 2/2] cifs: dynamic allocation of ntlmssp blob

2016-05-26 Thread Jerome Marchand
In sess_auth_rawntlmssp_authenticate(), the ntlmssp blob is allocated
statically and its size is an "empirical" 5*sizeof(struct
_AUTHENTICATE_MESSAGE) (320B on x86_64). I don't know where this value
comes from or if it was ever appropriate, but it is currently
insufficient: the user and domain name in UTF16 could take 1kB by
themselves. Because of that, build_ntlmssp_auth_blob() might corrupt
memory (out-of-bounds write). The size of ntlmssp_blob in
SMB2_sess_setup() is too small too (sizeof(struct _NEGOTIATE_MESSAGE)
+ 500).

This patch allocates the blob dynamically in
build_ntlmssp_auth_blob().

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 fs/cifs/ntlmssp.h |  2 +-
 fs/cifs/sess.c| 76 ++-
 fs/cifs/smb2pdu.c | 10 ++--
 3 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/fs/cifs/ntlmssp.h b/fs/cifs/ntlmssp.h
index 848249f..3079b38 100644
--- a/fs/cifs/ntlmssp.h
+++ b/fs/cifs/ntlmssp.h
@@ -133,6 +133,6 @@ typedef struct _AUTHENTICATE_MESSAGE {
 
 int decode_ntlmssp_challenge(char *bcc_ptr, int blob_len, struct cifs_ses 
*ses);
 void build_ntlmssp_negotiate_blob(unsigned char *pbuffer, struct cifs_ses 
*ses);
-int build_ntlmssp_auth_blob(unsigned char *pbuffer, u16 *buflen,
+int build_ntlmssp_auth_blob(unsigned char **pbuffer, u16 *buflen,
struct cifs_ses *ses,
const struct nls_table *nls_cp);
diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c
index c3d086e..463bed7 100644
--- a/fs/cifs/sess.c
+++ b/fs/cifs/sess.c
@@ -364,19 +364,43 @@ void build_ntlmssp_negotiate_blob(unsigned char *pbuffer,
sec_blob->DomainName.MaximumLength = 0;
 }
 
-/* We do not malloc the blob, it is passed in pbuffer, because its
-   maximum possible size is fixed and small, making this approach cleaner.
-   This function returns the length of the data in the blob */
-int build_ntlmssp_auth_blob(unsigned char *pbuffer,
+int size_of_ntlmssp_blob(struct cifs_ses *ses)
+{
+   int sz = sizeof(AUTHENTICATE_MESSAGE) + ses->auth_key.len
+   - CIFS_SESS_KEY_SIZE + CIFS_CPHTXT_SIZE + 2;
+
+   if (ses->domainName)
+   sz += 2 * strnlen(ses->domainName, CIFS_MAX_DOMAINNAME_LEN);
+   else
+   sz += 2;
+
+   if (ses->user_name)
+   sz += 2 * strnlen(ses->user_name, CIFS_MAX_USERNAME_LEN);
+   else
+   sz += 2;
+
+   return sz;
+}
+
+int build_ntlmssp_auth_blob(unsigned char **pbuffer,
u16 *buflen,
   struct cifs_ses *ses,
   const struct nls_table *nls_cp)
 {
int rc;
-   AUTHENTICATE_MESSAGE *sec_blob = (AUTHENTICATE_MESSAGE *)pbuffer;
+   AUTHENTICATE_MESSAGE *sec_blob;
__u32 flags;
unsigned char *tmp;
 
+   rc = setup_ntlmv2_rsp(ses, nls_cp);
+   if (rc) {
+   cifs_dbg(VFS, "Error %d during NTLMSSP authentication\n", rc);
+   *buflen = 0;
+   goto setup_ntlmv2_ret;
+   }
+   *pbuffer = kmalloc(size_of_ntlmssp_blob(ses), GFP_KERNEL);
+   sec_blob = (AUTHENTICATE_MESSAGE *)*pbuffer;
+
memcpy(sec_blob->Signature, NTLMSSP_SIGNATURE, 8);
sec_blob->MessageType = NtLmAuthenticate;
 
@@ -391,7 +415,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
flags |= NTLMSSP_NEGOTIATE_KEY_XCH;
}
 
-   tmp = pbuffer + sizeof(AUTHENTICATE_MESSAGE);
+   tmp = *pbuffer + sizeof(AUTHENTICATE_MESSAGE);
sec_blob->NegotiateFlags = cpu_to_le32(flags);
 
sec_blob->LmChallengeResponse.BufferOffset =
@@ -399,13 +423,9 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
sec_blob->LmChallengeResponse.Length = 0;
sec_blob->LmChallengeResponse.MaximumLength = 0;
 
-   sec_blob->NtChallengeResponse.BufferOffset = cpu_to_le32(tmp - pbuffer);
+   sec_blob->NtChallengeResponse.BufferOffset =
+   cpu_to_le32(tmp - *pbuffer);
if (ses->user_name != NULL) {
-   rc = setup_ntlmv2_rsp(ses, nls_cp);
-   if (rc) {
-   cifs_dbg(VFS, "Error %d during NTLMSSP 
authentication\n", rc);
-   goto setup_ntlmv2_ret;
-   }
memcpy(tmp, ses->auth_key.response + CIFS_SESS_KEY_SIZE,
ses->auth_key.len - CIFS_SESS_KEY_SIZE);
tmp += ses->auth_key.len - CIFS_SESS_KEY_SIZE;
@@ -423,7 +443,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
}
 
if (ses->domainName == NULL) {
-   sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - pbuffer);
+   sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - *pbuffer);
sec_blob->DomainName.Length = 0;
 

[PATCH 2/2] cifs: dynamic allocation of ntlmssp blob

2016-05-26 Thread Jerome Marchand
In sess_auth_rawntlmssp_authenticate(), the ntlmssp blob is allocated
statically and its size is an "empirical" 5*sizeof(struct
_AUTHENTICATE_MESSAGE) (320B on x86_64). I don't know where this value
comes from or if it was ever appropriate, but it is currently
insufficient: the user and domain name in UTF16 could take 1kB by
themselves. Because of that, build_ntlmssp_auth_blob() might corrupt
memory (out-of-bounds write). The size of ntlmssp_blob in
SMB2_sess_setup() is too small too (sizeof(struct _NEGOTIATE_MESSAGE)
+ 500).

This patch allocates the blob dynamically in
build_ntlmssp_auth_blob().

Signed-off-by: Jerome Marchand 
---
 fs/cifs/ntlmssp.h |  2 +-
 fs/cifs/sess.c| 76 ++-
 fs/cifs/smb2pdu.c | 10 ++--
 3 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/fs/cifs/ntlmssp.h b/fs/cifs/ntlmssp.h
index 848249f..3079b38 100644
--- a/fs/cifs/ntlmssp.h
+++ b/fs/cifs/ntlmssp.h
@@ -133,6 +133,6 @@ typedef struct _AUTHENTICATE_MESSAGE {
 
 int decode_ntlmssp_challenge(char *bcc_ptr, int blob_len, struct cifs_ses 
*ses);
 void build_ntlmssp_negotiate_blob(unsigned char *pbuffer, struct cifs_ses 
*ses);
-int build_ntlmssp_auth_blob(unsigned char *pbuffer, u16 *buflen,
+int build_ntlmssp_auth_blob(unsigned char **pbuffer, u16 *buflen,
struct cifs_ses *ses,
const struct nls_table *nls_cp);
diff --git a/fs/cifs/sess.c b/fs/cifs/sess.c
index c3d086e..463bed7 100644
--- a/fs/cifs/sess.c
+++ b/fs/cifs/sess.c
@@ -364,19 +364,43 @@ void build_ntlmssp_negotiate_blob(unsigned char *pbuffer,
sec_blob->DomainName.MaximumLength = 0;
 }
 
-/* We do not malloc the blob, it is passed in pbuffer, because its
-   maximum possible size is fixed and small, making this approach cleaner.
-   This function returns the length of the data in the blob */
-int build_ntlmssp_auth_blob(unsigned char *pbuffer,
+int size_of_ntlmssp_blob(struct cifs_ses *ses)
+{
+   int sz = sizeof(AUTHENTICATE_MESSAGE) + ses->auth_key.len
+   - CIFS_SESS_KEY_SIZE + CIFS_CPHTXT_SIZE + 2;
+
+   if (ses->domainName)
+   sz += 2 * strnlen(ses->domainName, CIFS_MAX_DOMAINNAME_LEN);
+   else
+   sz += 2;
+
+   if (ses->user_name)
+   sz += 2 * strnlen(ses->user_name, CIFS_MAX_USERNAME_LEN);
+   else
+   sz += 2;
+
+   return sz;
+}
+
+int build_ntlmssp_auth_blob(unsigned char **pbuffer,
u16 *buflen,
   struct cifs_ses *ses,
   const struct nls_table *nls_cp)
 {
int rc;
-   AUTHENTICATE_MESSAGE *sec_blob = (AUTHENTICATE_MESSAGE *)pbuffer;
+   AUTHENTICATE_MESSAGE *sec_blob;
__u32 flags;
unsigned char *tmp;
 
+   rc = setup_ntlmv2_rsp(ses, nls_cp);
+   if (rc) {
+   cifs_dbg(VFS, "Error %d during NTLMSSP authentication\n", rc);
+   *buflen = 0;
+   goto setup_ntlmv2_ret;
+   }
+   *pbuffer = kmalloc(size_of_ntlmssp_blob(ses), GFP_KERNEL);
+   sec_blob = (AUTHENTICATE_MESSAGE *)*pbuffer;
+
memcpy(sec_blob->Signature, NTLMSSP_SIGNATURE, 8);
sec_blob->MessageType = NtLmAuthenticate;
 
@@ -391,7 +415,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
flags |= NTLMSSP_NEGOTIATE_KEY_XCH;
}
 
-   tmp = pbuffer + sizeof(AUTHENTICATE_MESSAGE);
+   tmp = *pbuffer + sizeof(AUTHENTICATE_MESSAGE);
sec_blob->NegotiateFlags = cpu_to_le32(flags);
 
sec_blob->LmChallengeResponse.BufferOffset =
@@ -399,13 +423,9 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
sec_blob->LmChallengeResponse.Length = 0;
sec_blob->LmChallengeResponse.MaximumLength = 0;
 
-   sec_blob->NtChallengeResponse.BufferOffset = cpu_to_le32(tmp - pbuffer);
+   sec_blob->NtChallengeResponse.BufferOffset =
+   cpu_to_le32(tmp - *pbuffer);
if (ses->user_name != NULL) {
-   rc = setup_ntlmv2_rsp(ses, nls_cp);
-   if (rc) {
-   cifs_dbg(VFS, "Error %d during NTLMSSP 
authentication\n", rc);
-   goto setup_ntlmv2_ret;
-   }
memcpy(tmp, ses->auth_key.response + CIFS_SESS_KEY_SIZE,
ses->auth_key.len - CIFS_SESS_KEY_SIZE);
tmp += ses->auth_key.len - CIFS_SESS_KEY_SIZE;
@@ -423,7 +443,7 @@ int build_ntlmssp_auth_blob(unsigned char *pbuffer,
}
 
if (ses->domainName == NULL) {
-   sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - pbuffer);
+   sec_blob->DomainName.BufferOffset = cpu_to_le32(tmp - *pbuffer);
sec_blob->DomainName.Length = 0;
sec_blob->DomainName.MaximumLe

BUG: drm, nouveau: slab-out-of-bounds read access in nv50_fbcon_imageblit()

2016-05-24 Thread Jerome Marchand

While testing a kernel with KASan enabled I've encountered several
out-of-bounds read warning in the nouveau driver. It seems to be
caused by inconsistent alignment requirements.

The function soft_cursor() (which I assume draw the cursor on the
console) calls fb_get_buffer_offset() which make sure there is still
room in the pixmap buffer (allocated in do_register_framebuffer()) to
copy the data (I assume a pixmap of the cursor). After the copy,
soft_cursor() sets image.data to point to the copied data in the
buffer (buf + offset) and calls nouveau_fbcon_imageblit(), which in
turn call nv50_fbcon_imageblit(). However in soft_cursor(), the data
is only aligned on 8 bits, while in nv50_fbcon_imageblit() the
alignment requirement is 32 bits. For a 8x16 cursor, the data copied
to the buffer in soft_cursor() is only 16 bytes, while
nv50_fbcon_imageblit() tries to read 64 bytes.

Someone has already reported the same issue on nvc0_fbcon_imageblit():
https://lists.freedesktop.org/archives/dri-devel/2015-November/095100.html
nv04_fbcon_imageblit() is probably affected too.

Here is the KASan report. It's from a modified RHEL7 kernel, but the
relevant code is the same as upstream.

[   38.367524] 
== 
[   38.367538] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr 
8800957f6230 
[   38.367542] Read of size 64 by task kworker/0:2/68 
[   38.367545] 
= 
[   38.367549] BUG kmalloc-8192 (Tainted: G  I  ): 
kasan: bad access detected 
[   38.367551] 
- 
[   38.367551]  
[   38.367552] Disabling lock debugging due to kernel taint 
[   38.367562] INFO: Allocated in register_framebuffer+0x4b9/0x5a0 age=25205 
cpu=0 pid=267 
[   38.367566]  __slab_alloc+0x248/0x5f0 
[   38.367571]  kmem_cache_alloc_trace+0x278/0x390 
[   38.367575]  register_framebuffer+0x4b9/0x5a0 
[   38.367597]  drm_fb_helper_initial_config+0x54c/0x810 [drm_kms_helper] 
[   38.367725]  nouveau_fbcon_init+0x154/0x190 [nouveau] 
[   38.367841]  nouveau_drm_load+0x6bf/0x9f0 [nouveau] 
[   38.367883]  drm_dev_register+0xc9/0x160 [drm] 
[   38.367923]  drm_get_pci_dev+0xcc/0x3a0 [drm] 
[   38.368039]  nouveau_drm_probe+0x3bb/0x4f0 [nouveau] 
[   38.368043]  local_pci_probe+0x7a/0xd0 
[   38.368047]  pci_device_probe+0x1b9/0x210 
[   38.368054]  driver_probe_device+0xc6/0x530 
[   38.368059]  __driver_attach+0xcb/0xd0 
[   38.368063]  bus_for_each_dev+0xfc/0x180 
[   38.368068]  driver_attach+0x2b/0x30 
[   38.368072]  bus_add_driver+0x348/0x440 
[   38.368077] INFO: Slab 0xea000255fc00 objects=3 used=3 fp=0x  
(null) flags=0x1f4080 
[   38.368080] INFO: Object 0x8800957f4260 @offset=16992 fp=0x  
(null) 
[   38.368080]  
[   38.368086] Bytes b4 8800957f4250: 02 00 00 00 a6 01 00 00 cd a8 fb ff 
00 00 00 00   
[   38.368091] Object 8800957f4260: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368095] Object 8800957f4270: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368100] Object 8800957f4280: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368104] Object 8800957f4290: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368109] Object 8800957f42a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368113] Object 8800957f42b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368117] Object 8800957f42c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368122] Object 8800957f42d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368126] Object 8800957f42e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368131] Object 8800957f42f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368135] Object 8800957f4300: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368139] Object 8800957f4310: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368144] Object 8800957f4320: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368148] Object 8800957f4330: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368153] Object 8800957f4340: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368157] Object 8800957f4350: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368162] Object 8800957f4360: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368166] Object 8800957f4370: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368170] Object 8800957f4380: 00 

BUG: drm, nouveau: slab-out-of-bounds read access in nv50_fbcon_imageblit()

2016-05-24 Thread Jerome Marchand

While testing a kernel with KASan enabled I've encountered several
out-of-bounds read warning in the nouveau driver. It seems to be
caused by inconsistent alignment requirements.

The function soft_cursor() (which I assume draw the cursor on the
console) calls fb_get_buffer_offset() which make sure there is still
room in the pixmap buffer (allocated in do_register_framebuffer()) to
copy the data (I assume a pixmap of the cursor). After the copy,
soft_cursor() sets image.data to point to the copied data in the
buffer (buf + offset) and calls nouveau_fbcon_imageblit(), which in
turn call nv50_fbcon_imageblit(). However in soft_cursor(), the data
is only aligned on 8 bits, while in nv50_fbcon_imageblit() the
alignment requirement is 32 bits. For a 8x16 cursor, the data copied
to the buffer in soft_cursor() is only 16 bytes, while
nv50_fbcon_imageblit() tries to read 64 bytes.

Someone has already reported the same issue on nvc0_fbcon_imageblit():
https://lists.freedesktop.org/archives/dri-devel/2015-November/095100.html
nv04_fbcon_imageblit() is probably affected too.

Here is the KASan report. It's from a modified RHEL7 kernel, but the
relevant code is the same as upstream.

[   38.367524] 
== 
[   38.367538] BUG: KASAN: slab-out-of-bounds in memcpy+0x1d/0x40 at addr 
8800957f6230 
[   38.367542] Read of size 64 by task kworker/0:2/68 
[   38.367545] 
= 
[   38.367549] BUG kmalloc-8192 (Tainted: G  I  ): 
kasan: bad access detected 
[   38.367551] 
- 
[   38.367551]  
[   38.367552] Disabling lock debugging due to kernel taint 
[   38.367562] INFO: Allocated in register_framebuffer+0x4b9/0x5a0 age=25205 
cpu=0 pid=267 
[   38.367566]  __slab_alloc+0x248/0x5f0 
[   38.367571]  kmem_cache_alloc_trace+0x278/0x390 
[   38.367575]  register_framebuffer+0x4b9/0x5a0 
[   38.367597]  drm_fb_helper_initial_config+0x54c/0x810 [drm_kms_helper] 
[   38.367725]  nouveau_fbcon_init+0x154/0x190 [nouveau] 
[   38.367841]  nouveau_drm_load+0x6bf/0x9f0 [nouveau] 
[   38.367883]  drm_dev_register+0xc9/0x160 [drm] 
[   38.367923]  drm_get_pci_dev+0xcc/0x3a0 [drm] 
[   38.368039]  nouveau_drm_probe+0x3bb/0x4f0 [nouveau] 
[   38.368043]  local_pci_probe+0x7a/0xd0 
[   38.368047]  pci_device_probe+0x1b9/0x210 
[   38.368054]  driver_probe_device+0xc6/0x530 
[   38.368059]  __driver_attach+0xcb/0xd0 
[   38.368063]  bus_for_each_dev+0xfc/0x180 
[   38.368068]  driver_attach+0x2b/0x30 
[   38.368072]  bus_add_driver+0x348/0x440 
[   38.368077] INFO: Slab 0xea000255fc00 objects=3 used=3 fp=0x  
(null) flags=0x1f4080 
[   38.368080] INFO: Object 0x8800957f4260 @offset=16992 fp=0x  
(null) 
[   38.368080]  
[   38.368086] Bytes b4 8800957f4250: 02 00 00 00 a6 01 00 00 cd a8 fb ff 
00 00 00 00   
[   38.368091] Object 8800957f4260: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368095] Object 8800957f4270: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368100] Object 8800957f4280: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368104] Object 8800957f4290: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368109] Object 8800957f42a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368113] Object 8800957f42b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368117] Object 8800957f42c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368122] Object 8800957f42d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368126] Object 8800957f42e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368131] Object 8800957f42f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368135] Object 8800957f4300: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368139] Object 8800957f4310: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368144] Object 8800957f4320: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368148] Object 8800957f4330: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368153] Object 8800957f4340: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368157] Object 8800957f4350: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368162] Object 8800957f4360: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368166] Object 8800957f4370: 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00   
[   38.368170] Object 8800957f4380: 00 

Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

2016-04-19 Thread Jerome Marchand
On 04/19/2016 12:55 AM, Shi, Yang wrote:
> 2. I ran my THP test (generated a program with 4MB text section) on both
> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
> the program execution time reduced by ~12% on x86-64, it looks very
> impressive.
> 
> But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
> show even worse data than non-hugepage.
> 
> Both yours and Hugh's patches has the same behavior.
> 
> Any idea?

Just a shot in the dark, but what page size do you use? If you use 4k
pages, then hugepage size should be the same as on x86 and a similar
behavior could be expected. Otherwise, hugepages would be too big to be
taken advantage of by your test program.

Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv7 00/29] THP-enabled tmpfs/shmem using compound pages

2016-04-19 Thread Jerome Marchand
On 04/19/2016 12:55 AM, Shi, Yang wrote:
> 2. I ran my THP test (generated a program with 4MB text section) on both
> x86-64 and ARM64 with yours and Hugh's patches (linux-next tree), I got
> the program execution time reduced by ~12% on x86-64, it looks very
> impressive.
> 
> But, on ARM64, there is just ~3% change, and sometimes huge tmpfs may
> show even worse data than non-hugepage.
> 
> Both yours and Hugh's patches has the same behavior.
> 
> Any idea?

Just a shot in the dark, but what page size do you use? If you use 4k
pages, then hugepage size should be the same as on x86 and a similar
behavior could be expected. Otherwise, hugepages would be too big to be
taken advantage of by your test program.

Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Jerome Marchand
On 04/14/2016 01:08 PM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> This patch is adapted from one originally written by Pierre Carrier.
> 
> Signed-off-by: Richard W.M. Jones <rjo...@redhat.com>

Acked-by: Jerome Marchand <jmarc...@redhat.com>

> ---
>  Documentation/filesystems/proc.txt |  1 +
>  fs/proc/array.c| 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 7f5607a..e8d0075 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
>   TracerPid   PID of process tracing this process (0 if not)
>   Uid Real, effective, saved set, and  file system 
> UIDs
>   Gid Real, effective, saved set, and  file system 
> GIDs
> + Umask   file mode creation mask
>   FDSize  number of file descriptor slots currently 
> allocated
>   Groups  supplementary group list
>   NStgid  descendant namespace thread group ID hierarchy
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index b6c00ce..88c7de1 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -83,6 +83,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> task_struct *tsk)
>   return task_state_array[fls(state)];
>  }
>  
> +static inline int get_task_umask(struct task_struct *tsk)
> +{
> + struct fs_struct *fs;
> + int umask = -ENOENT;
> +
> + task_lock(tsk);
> + fs = tsk->fs;
> + if (fs)
> + umask = fs->umask;
> + task_unlock(tsk);
> + return umask;
> +}
> +
>  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>   struct pid *pid, struct task_struct *p)
>  {
>   struct user_namespace *user_ns = seq_user_ns(m);
>   struct group_info *group_info;
> - int g;
> + int g, umask;
>   struct task_struct *tracer;
>   const struct cred *cred;
>   pid_t ppid, tpid = 0, tgid, ngid;
> @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
> pid_namespace *ns,
>   ngid = task_numa_group_id(p);
>   cred = get_task_cred(p);
>  
> + umask = get_task_umask(p);
> + if (umask >= 0)
> + seq_printf(m, "Umask:\t%#04o\n", umask);
> +
>   task_lock(p);
>   if (p->files)
>   max_fds = files_fdtable(p->files)->max_fds;
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2] procfs: expose umask in /proc//status

2016-04-14 Thread Jerome Marchand
On 04/14/2016 01:08 PM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> This patch is adapted from one originally written by Pierre Carrier.
> 
> Signed-off-by: Richard W.M. Jones 

Acked-by: Jerome Marchand 

> ---
>  Documentation/filesystems/proc.txt |  1 +
>  fs/proc/array.c| 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 7f5607a..e8d0075 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
>   TracerPid   PID of process tracing this process (0 if not)
>   Uid Real, effective, saved set, and  file system 
> UIDs
>   Gid Real, effective, saved set, and  file system 
> GIDs
> + Umask   file mode creation mask
>   FDSize  number of file descriptor slots currently 
> allocated
>   Groups  supplementary group list
>   NStgid  descendant namespace thread group ID hierarchy
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index b6c00ce..88c7de1 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -83,6 +83,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> task_struct *tsk)
>   return task_state_array[fls(state)];
>  }
>  
> +static inline int get_task_umask(struct task_struct *tsk)
> +{
> + struct fs_struct *fs;
> + int umask = -ENOENT;
> +
> + task_lock(tsk);
> + fs = tsk->fs;
> + if (fs)
> + umask = fs->umask;
> + task_unlock(tsk);
> + return umask;
> +}
> +
>  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>   struct pid *pid, struct task_struct *p)
>  {
>   struct user_namespace *user_ns = seq_user_ns(m);
>   struct group_info *group_info;
> - int g;
> + int g, umask;
>   struct task_struct *tracer;
>   const struct cred *cred;
>   pid_t ppid, tpid = 0, tgid, ngid;
> @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
> pid_namespace *ns,
>   ngid = task_numa_group_id(p);
>   cred = get_task_cred(p);
>  
> + umask = get_task_umask(p);
> + if (umask >= 0)
> + seq_printf(m, "Umask:\t%#04o\n", umask);
> +
>   task_lock(p);
>   if (p->files)
>   max_fds = files_fdtable(p->files)->max_fds;
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] procfs: expose umask in /proc//status

2016-04-14 Thread Jerome Marchand
On 04/14/2016 11:34 AM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> This patch is adapted from one originally written by Pierre Carrier.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  Documentation/filesystems/proc.txt |  1 +
>  fs/proc/array.c| 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 7f5607a..e8d0075 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
>   TracerPid   PID of process tracing this process (0 if not)
>   Uid Real, effective, saved set, and  file system 
> UIDs
>   Gid Real, effective, saved set, and  file system 
> GIDs
> + Umask   file mode creation mask
>   FDSize  number of file descriptor slots currently 
> allocated
>   Groups  supplementary group list
>   NStgid  descendant namespace thread group ID hierarchy
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index b6c00ce..03e8d3f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -83,6 +83,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> task_struct *tsk)
>   return task_state_array[fls(state)];
>  }
>  
> +static inline int get_task_umask(struct task_struct *tsk)
> +{
> + struct fs_struct *fs;
> + int umask = -ENOENT;
> +
> + task_lock(tsk);
> + fs = tsk->fs;
> + if (fs)
> + umask = fs->umask;
> + task_unlock(tsk);
> + return umask;
> +}
> +
>  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>   struct pid *pid, struct task_struct *p)
>  {
>   struct user_namespace *user_ns = seq_user_ns(m);
>   struct group_info *group_info;
> - int g;
> + int g, umask;
>   struct task_struct *tracer;
>   const struct cred *cred;
>   pid_t ppid, tpid = 0, tgid, ngid;
> @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
> pid_namespace *ns,
>   ngid = task_numa_group_id(p);
>   cred = get_task_cred(p);
>  
> + umask = get_task_umask(p);
> + if (umask >= 0)
> + seq_printf(m, "Umask:\t0%o\n", umask);

It seems to me that umasks are usually displayed in the form 0XXX, such
as the output of umask command. So what about:

seq_printf(m, "Umask:\t%#04o\n", umask);

Provided printk() supports those flags, of course.

Thanks,
Jerome

> +
>   task_lock(p);
>   if (p->files)
>   max_fds = files_fdtable(p->files)->max_fds;
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] procfs: expose umask in /proc//status

2016-04-14 Thread Jerome Marchand
On 04/14/2016 11:34 AM, Richard W.M. Jones wrote:
> It's not possible to read the process umask without also modifying it,
> which is what umask(2) does.  A library cannot read umask safely,
> especially if the main program might be multithreaded.
> 
> Add a new status line ("Umask") in /proc//status.  It contains
> the file mode creation mask (umask) in octal.  It is only shown for
> tasks which have task->fs.
> 
> This patch is adapted from one originally written by Pierre Carrier.
> 
> Signed-off-by: Richard W.M. Jones 
> ---
>  Documentation/filesystems/proc.txt |  1 +
>  fs/proc/array.c| 20 +++-
>  2 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/filesystems/proc.txt 
> b/Documentation/filesystems/proc.txt
> index 7f5607a..e8d0075 100644
> --- a/Documentation/filesystems/proc.txt
> +++ b/Documentation/filesystems/proc.txt
> @@ -225,6 +225,7 @@ Table 1-2: Contents of the status files (as of 4.1)
>   TracerPid   PID of process tracing this process (0 if not)
>   Uid Real, effective, saved set, and  file system 
> UIDs
>   Gid Real, effective, saved set, and  file system 
> GIDs
> + Umask   file mode creation mask
>   FDSize  number of file descriptor slots currently 
> allocated
>   Groups  supplementary group list
>   NStgid  descendant namespace thread group ID hierarchy
> diff --git a/fs/proc/array.c b/fs/proc/array.c
> index b6c00ce..03e8d3f 100644
> --- a/fs/proc/array.c
> +++ b/fs/proc/array.c
> @@ -83,6 +83,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -139,12 +140,25 @@ static inline const char *get_task_state(struct 
> task_struct *tsk)
>   return task_state_array[fls(state)];
>  }
>  
> +static inline int get_task_umask(struct task_struct *tsk)
> +{
> + struct fs_struct *fs;
> + int umask = -ENOENT;
> +
> + task_lock(tsk);
> + fs = tsk->fs;
> + if (fs)
> + umask = fs->umask;
> + task_unlock(tsk);
> + return umask;
> +}
> +
>  static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
>   struct pid *pid, struct task_struct *p)
>  {
>   struct user_namespace *user_ns = seq_user_ns(m);
>   struct group_info *group_info;
> - int g;
> + int g, umask;
>   struct task_struct *tracer;
>   const struct cred *cred;
>   pid_t ppid, tpid = 0, tgid, ngid;
> @@ -162,6 +176,10 @@ static inline void task_state(struct seq_file *m, struct 
> pid_namespace *ns,
>   ngid = task_numa_group_id(p);
>   cred = get_task_cred(p);
>  
> + umask = get_task_umask(p);
> + if (umask >= 0)
> + seq_printf(m, "Umask:\t0%o\n", umask);

It seems to me that umasks are usually displayed in the form 0XXX, such
as the output of umask command. So what about:

seq_printf(m, "Umask:\t%#04o\n", umask);

Provided printk() supports those flags, of course.

Thanks,
Jerome

> +
>   task_lock(p);
>   if (p->files)
>   max_fds = files_fdtable(p->files)->max_fds;
> 




signature.asc
Description: OpenPGP digital signature


[PATCH V2] assoc_array: don't call compare_object() on a node

2016-02-23 Thread Jerome Marchand
Changes since V1: fixed the description and added KASan warning.

In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all non-empty slots, even when they're
not leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access
in keyring_compare_object detected by KASan (see below). The issue
is easily reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.

KASan warning:
==
BUG: KASAN: slab-out-of-bounds in keyring_compare_object+0x213/0x240 at addr 
880060a6f838
Read of size 8 by task keyctl/1655
=
BUG kmalloc-192 (Not tainted): kasan: bad access detected
-

Disabling lock debugging due to kernel taint
INFO: Allocated in assoc_array_insert+0xfd0/0x3a60 age=69 cpu=1 pid=1647
___slab_alloc+0x563/0x5c0
__slab_alloc+0x51/0x90
kmem_cache_alloc_trace+0x263/0x300
assoc_array_insert+0xfd0/0x3a60
__key_link_begin+0xfc/0x270
key_create_or_update+0x459/0xaf0
SyS_add_key+0x1ba/0x350
entry_SYSCALL_64_fastpath+0x12/0x76
INFO: Slab 0xea0001829b80 objects=16 used=8 fp=0x880060a6f550 
flags=0x3fff804080
INFO: Object 0x880060a6f740 @offset=5952 fp=0x880060a6e5d1

Bytes b4 880060a6f730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f740: d1 e5 a6 60 00 88 ff ff 0e 00 00 00 00 00 00 00  
...`
Object 880060a6f750: 02 cf 8e 60 00 88 ff ff 02 c0 8e 60 00 88 ff ff  
...`...`
Object 880060a6f760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f770: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f790: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7d0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

CPU: 0 PID: 1655 Comm: keyctl Tainted: GB   4.5.0-rc4-kasan+ #291
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  1b2800b4 880060a179e0 81b60491
 88006c802900 880060a6f740 880060a17a10 815e2969
 88006c802900 ea0001829b80 880060a6f740 880060a6e650
Call Trace:
 [] dump_stack+0x85/0xc4
 [] print_trailer+0xf9/0x150
 [] object_err+0x34/0x40
 [] kasan_report_error+0x230/0x550
 [] ? keyring_get_key_chunk+0x13e/0x210
 [] __asan_report_load_n_noabort+0x5d/0x70
 [] ? keyring_compare_object+0x213/0x240
 [] keyring_compare_object+0x213/0x240
 [] assoc_array_insert+0x86c/0x3a60
 [] ? assoc_array_cancel_edit+0x70/0x70
 [] ? __key_link_begin+0x20d/0x270
 [] __key_link_begin+0xfc/0x270
 [] key_create_or_update+0x459/0xaf0
 [] ? trace_hardirqs_on+0xd/0x10
 [] ? key_type_lookup+0xc0/0xc0
 [] ? lookup_user_key+0x13d/0xcd0
 [] ? memdup_user+0x53/0x80
 [] SyS_add_key+0x1ba/0x350
 [] ? key_get_type_from_user.constprop.6+0xa0/0xa0
 [] ? retint_user+0x18/0x23
 [] ? trace_hardirqs_on_caller+0x3fe/0x580
 [] ? trace_hardirqs_on_thunk+0x17/0x19
 [] entry_SYSCALL_64_fastpath+0x12/0x76
Memory state around the buggy address:
 880060a6f700: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
 880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>880060a6f800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
 880060a6f880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 880060a6f900: fc fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00
==

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 lib/assoc_array.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/assoc_array.c b/lib/assoc_array.c
index 03dd576..59fd7c0 100644
--- a/lib/assoc_array.c
+++ b/lib/assoc_array.c
@@ -524,7 +524,9 @@ static bool assoc_array_insert_into_terminal_node(struct 
assoc_array_edit *edit,
free_slot = i;
continue;
}
-   if (ops->compare_object(assoc_array_ptr_to_leaf(ptr), 
index_key)) {
+   if (assoc_array_ptr_is_leaf(ptr) &&
+   ops->compare_object(asso

[PATCH V2] assoc_array: don't call compare_object() on a node

2016-02-23 Thread Jerome Marchand
Changes since V1: fixed the description and added KASan warning.

In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all non-empty slots, even when they're
not leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access
in keyring_compare_object detected by KASan (see below). The issue
is easily reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.

KASan warning:
==
BUG: KASAN: slab-out-of-bounds in keyring_compare_object+0x213/0x240 at addr 
880060a6f838
Read of size 8 by task keyctl/1655
=
BUG kmalloc-192 (Not tainted): kasan: bad access detected
-

Disabling lock debugging due to kernel taint
INFO: Allocated in assoc_array_insert+0xfd0/0x3a60 age=69 cpu=1 pid=1647
___slab_alloc+0x563/0x5c0
__slab_alloc+0x51/0x90
kmem_cache_alloc_trace+0x263/0x300
assoc_array_insert+0xfd0/0x3a60
__key_link_begin+0xfc/0x270
key_create_or_update+0x459/0xaf0
SyS_add_key+0x1ba/0x350
entry_SYSCALL_64_fastpath+0x12/0x76
INFO: Slab 0xea0001829b80 objects=16 used=8 fp=0x880060a6f550 
flags=0x3fff804080
INFO: Object 0x880060a6f740 @offset=5952 fp=0x880060a6e5d1

Bytes b4 880060a6f730: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f740: d1 e5 a6 60 00 88 ff ff 0e 00 00 00 00 00 00 00  
...`
Object 880060a6f750: 02 cf 8e 60 00 88 ff ff 02 c0 8e 60 00 88 ff ff  
...`...`
Object 880060a6f760: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f770: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f790: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7d0: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

Object 880060a6f7f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  

CPU: 0 PID: 1655 Comm: keyctl Tainted: GB   4.5.0-rc4-kasan+ #291
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  1b2800b4 880060a179e0 81b60491
 88006c802900 880060a6f740 880060a17a10 815e2969
 88006c802900 ea0001829b80 880060a6f740 880060a6e650
Call Trace:
 [] dump_stack+0x85/0xc4
 [] print_trailer+0xf9/0x150
 [] object_err+0x34/0x40
 [] kasan_report_error+0x230/0x550
 [] ? keyring_get_key_chunk+0x13e/0x210
 [] __asan_report_load_n_noabort+0x5d/0x70
 [] ? keyring_compare_object+0x213/0x240
 [] keyring_compare_object+0x213/0x240
 [] assoc_array_insert+0x86c/0x3a60
 [] ? assoc_array_cancel_edit+0x70/0x70
 [] ? __key_link_begin+0x20d/0x270
 [] __key_link_begin+0xfc/0x270
 [] key_create_or_update+0x459/0xaf0
 [] ? trace_hardirqs_on+0xd/0x10
 [] ? key_type_lookup+0xc0/0xc0
 [] ? lookup_user_key+0x13d/0xcd0
 [] ? memdup_user+0x53/0x80
 [] SyS_add_key+0x1ba/0x350
 [] ? key_get_type_from_user.constprop.6+0xa0/0xa0
 [] ? retint_user+0x18/0x23
 [] ? trace_hardirqs_on_caller+0x3fe/0x580
 [] ? trace_hardirqs_on_thunk+0x17/0x19
 [] entry_SYSCALL_64_fastpath+0x12/0x76
Memory state around the buggy address:
 880060a6f700: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
 880060a6f780: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>880060a6f800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^
 880060a6f880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 880060a6f900: fc fc fc fc fc fc 00 00 00 00 00 00 00 00 00 00
==

Signed-off-by: Jerome Marchand 
---
 lib/assoc_array.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/assoc_array.c b/lib/assoc_array.c
index 03dd576..59fd7c0 100644
--- a/lib/assoc_array.c
+++ b/lib/assoc_array.c
@@ -524,7 +524,9 @@ static bool assoc_array_insert_into_terminal_node(struct 
assoc_array_edit *edit,
free_slot = i;
continue;
}
-   if (ops->compare_object(assoc_array_ptr_to_leaf(ptr), 
index_key)) {
+   if (assoc_array_ptr_is_leaf(ptr) &&
+   ops->compare_object(assoc_arra

Re: [PATCH] assoc_array: don't call compare_object() on a node

2016-02-22 Thread Jerome Marchand
On 02/22/2016 05:37 PM, David Howells wrote:
> Jerome Marchand <jmarc...@redhat.com> wrote:
> 
>> In assoc_array_insert_into_terminal_node(), we call the
>> compare_object() method on all empty slots,

Sorry, this is a typo. It should be "on all non-empty slots".

> 
> Ummm...  That shouldn't happen - the:
> 
>   if (!ptr) {
>   free_slot = i;
>   continue;
>   }
> 
> preceding the line you modified should cause the comparison to be skipped on a
> slot if it's empty.
> 
>> even when they're not leaves, passing a pointer to an unexpected structure
>> to compare_object().
> 
> Do you instead mean a metadata pointer rather than an empty slot?

Yes. In the cases I debugged, it was a node, but I guess we could
encounter a shortcut here too.

> 
>> Currently it causes an out-of-bound read access in keyring_compare_object
>> detected by KASan.  The issue is easily reproduced with keyutils testsuite.
> 
> I don't see it.  Did you modify the testsuite, or is it a matter of running it
> often enough?

Do you have KASan enabled? In my experience, the reproduction is
systematic on some test (e.g. keyctl/unlink/all). AFAIK, the testsuite
isn't modify (it's from Red Hat test infrastructure).

> 
> Also, can you include the oops output you get in the patch description,
> please?

Sure.

> 
> That said, I can see that there is probably an issue that your patch fixes -
> but it's not quite the one you describe (see above).

Does the description sounds correct if you add the missing negation?

Jerome
> 
> David
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] assoc_array: don't call compare_object() on a node

2016-02-22 Thread Jerome Marchand
On 02/22/2016 05:37 PM, David Howells wrote:
> Jerome Marchand  wrote:
> 
>> In assoc_array_insert_into_terminal_node(), we call the
>> compare_object() method on all empty slots,

Sorry, this is a typo. It should be "on all non-empty slots".

> 
> Ummm...  That shouldn't happen - the:
> 
>   if (!ptr) {
>   free_slot = i;
>   continue;
>   }
> 
> preceding the line you modified should cause the comparison to be skipped on a
> slot if it's empty.
> 
>> even when they're not leaves, passing a pointer to an unexpected structure
>> to compare_object().
> 
> Do you instead mean a metadata pointer rather than an empty slot?

Yes. In the cases I debugged, it was a node, but I guess we could
encounter a shortcut here too.

> 
>> Currently it causes an out-of-bound read access in keyring_compare_object
>> detected by KASan.  The issue is easily reproduced with keyutils testsuite.
> 
> I don't see it.  Did you modify the testsuite, or is it a matter of running it
> often enough?

Do you have KASan enabled? In my experience, the reproduction is
systematic on some test (e.g. keyctl/unlink/all). AFAIK, the testsuite
isn't modify (it's from Red Hat test infrastructure).

> 
> Also, can you include the oops output you get in the patch description,
> please?

Sure.

> 
> That said, I can see that there is probably an issue that your patch fixes -
> but it's not quite the one you describe (see above).

Does the description sounds correct if you add the missing negation?

Jerome
> 
> David
> 




signature.asc
Description: OpenPGP digital signature


[PATCH] assoc_array: don't call compare_object() on a node

2016-02-22 Thread Jerome Marchand
In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all empty slots, even when they're not
leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access in
keyring_compare_object detected by KASan. The issue is easily
reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 lib/assoc_array.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/assoc_array.c b/lib/assoc_array.c
index 03dd576..59fd7c0 100644
--- a/lib/assoc_array.c
+++ b/lib/assoc_array.c
@@ -524,7 +524,9 @@ static bool assoc_array_insert_into_terminal_node(struct 
assoc_array_edit *edit,
free_slot = i;
continue;
}
-   if (ops->compare_object(assoc_array_ptr_to_leaf(ptr), 
index_key)) {
+   if (assoc_array_ptr_is_leaf(ptr) &&
+   ops->compare_object(assoc_array_ptr_to_leaf(ptr),
+   index_key)) {
pr_devel("replace in slot %d\n", i);
edit->leaf_p = >slots[i];
edit->dead_leaf = node->slots[i];
-- 
2.5.0



[PATCH] assoc_array: don't call compare_object() on a node

2016-02-22 Thread Jerome Marchand
In assoc_array_insert_into_terminal_node(), we call the
compare_object() method on all empty slots, even when they're not
leaves, passing a pointer to an unexpected structure to
compare_object(). Currently it causes an out-of-bound read access in
keyring_compare_object detected by KASan. The issue is easily
reproduced with keyutils testsuite.
Only call compare_object() when the slot is a leave.

Signed-off-by: Jerome Marchand 
---
 lib/assoc_array.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/assoc_array.c b/lib/assoc_array.c
index 03dd576..59fd7c0 100644
--- a/lib/assoc_array.c
+++ b/lib/assoc_array.c
@@ -524,7 +524,9 @@ static bool assoc_array_insert_into_terminal_node(struct 
assoc_array_edit *edit,
free_slot = i;
continue;
}
-   if (ops->compare_object(assoc_array_ptr_to_leaf(ptr), 
index_key)) {
+   if (assoc_array_ptr_is_leaf(ptr) &&
+   ops->compare_object(assoc_array_ptr_to_leaf(ptr),
+   index_key)) {
pr_devel("replace in slot %d\n", i);
edit->leaf_p = >slots[i];
edit->dead_leaf = node->slots[i];
-- 
2.5.0



Re: Should snd_card_free() check for null pointer?

2016-02-09 Thread Jerome Marchand
- Original Message -
> From: "Takashi Iwai" 
> To: "Jerome Marchand" 
> Cc: "Jaroslav Kysela" , alsa-de...@alsa-project.org, 
> linux-kernel@vger.kernel.org
> Sent: Tuesday, February 9, 2016 10:56:39 PM
> Subject: Re: Should snd_card_free() check for null pointer?
> 
> On Tue, 09 Feb 2016 15:30:16 +0100,
> Jerome Marchand wrote:
> > 
> > Hi,
> > 
> > Before commit f24640648186b (ALSA: Use standard device refcount for card
> > accounting), snd_card_free() would return -EINVAL on a null pointer. Now
> > it ends up in a null pointer dereference. There is at least one driver
> > that can call snd_card_free() with null argument: saa7134_alsa. It can
> > easily be triggered by just inserting and removing the module (no need
> > to have the hardware).
> > I don't think that is a rule, but it seems that the standard behavior of
> > *_free() functions is to check for null pointer. What do you think?
> 
> Well, I have a mixed feeling about this.  Allowing NULL sometimes
> makes the code easier.  OTOH, caling snd_card_free() with NULL is
> really an unexpected situation, and if a driver does it, most likely
> it does something weird.
> 
> So, at this moment, I would fix the caller side.  But, it's not a
> final call, just my gut feeling.

I have no strong opinion either way and I have a patch that fixes saa7134
driver ready to be sent if that is your preference.

Thanks,
Jerome

> 
> 
> thanks,
> 
> Takashi
> 


Should snd_card_free() check for null pointer?

2016-02-09 Thread Jerome Marchand
Hi,

Before commit f24640648186b (ALSA: Use standard device refcount for card
accounting), snd_card_free() would return -EINVAL on a null pointer. Now
it ends up in a null pointer dereference. There is at least one driver
that can call snd_card_free() with null argument: saa7134_alsa. It can
easily be triggered by just inserting and removing the module (no need
to have the hardware).
I don't think that is a rule, but it seems that the standard behavior of
*_free() functions is to check for null pointer. What do you think?

Thanks,
Jerome



signature.asc
Description: OpenPGP digital signature


Should snd_card_free() check for null pointer?

2016-02-09 Thread Jerome Marchand
Hi,

Before commit f24640648186b (ALSA: Use standard device refcount for card
accounting), snd_card_free() would return -EINVAL on a null pointer. Now
it ends up in a null pointer dereference. There is at least one driver
that can call snd_card_free() with null argument: saa7134_alsa. It can
easily be triggered by just inserting and removing the module (no need
to have the hardware).
I don't think that is a rule, but it seems that the standard behavior of
*_free() functions is to check for null pointer. What do you think?

Thanks,
Jerome



signature.asc
Description: OpenPGP digital signature


Re: Should snd_card_free() check for null pointer?

2016-02-09 Thread Jerome Marchand
- Original Message -
> From: "Takashi Iwai" <ti...@suse.de>
> To: "Jerome Marchand" <jmarc...@redhat.com>
> Cc: "Jaroslav Kysela" <pe...@perex.cz>, alsa-de...@alsa-project.org, 
> linux-kernel@vger.kernel.org
> Sent: Tuesday, February 9, 2016 10:56:39 PM
> Subject: Re: Should snd_card_free() check for null pointer?
> 
> On Tue, 09 Feb 2016 15:30:16 +0100,
> Jerome Marchand wrote:
> > 
> > Hi,
> > 
> > Before commit f24640648186b (ALSA: Use standard device refcount for card
> > accounting), snd_card_free() would return -EINVAL on a null pointer. Now
> > it ends up in a null pointer dereference. There is at least one driver
> > that can call snd_card_free() with null argument: saa7134_alsa. It can
> > easily be triggered by just inserting and removing the module (no need
> > to have the hardware).
> > I don't think that is a rule, but it seems that the standard behavior of
> > *_free() functions is to check for null pointer. What do you think?
> 
> Well, I have a mixed feeling about this.  Allowing NULL sometimes
> makes the code easier.  OTOH, caling snd_card_free() with NULL is
> really an unexpected situation, and if a driver does it, most likely
> it does something weird.
> 
> So, at this moment, I would fix the caller side.  But, it's not a
> final call, just my gut feeling.

I have no strong opinion either way and I have a patch that fixes saa7134
driver ready to be sent if that is your preference.

Thanks,
Jerome

> 
> 
> thanks,
> 
> Takashi
> 


cifs: out-of-bound write in build_ntlmssp_auth_blob()

2016-02-03 Thread Jerome Marchand
Hi,

While running some test, KASan detected several out-of-bound write
accesses to the ntlmssp blob in build_ntlmssp_auth_blob(). In this case,
the ntlmssp blob was allocated in sess_auth_rawntlmssp_authenticate().
Its size is an "empirical" 5*sizeof(struct _AUTHENTICATE_MESSAGE) (320B
on x86_64). I don't know where this value comes from or if it was ever
appropriate, but it is currently insufficient: the user and domain name
in UTF16 could take 1kB by themselves. I'm not sure what's the proper
way to fix this. Naively I'd say to allocate the blob dynamically in
build_ntlmssp_auth_blob().
While I haven't run into the issue, the size of ntlmssp_blob in
SMB2_sess_setup is too small too (sizeof(struct _NEGOTIATE_MESSAGE) + 500).

Regards,
Jerome Marchand



signature.asc
Description: OpenPGP digital signature


[PATCH v2] fix out of bound read in __test_aead()

2016-02-03 Thread Jerome Marchand
__test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
actual length of the initialisation vector can be shorter.
The length of the IV is already calculated earlier in the
function. Let's just reuses that. Also the IV length is currently
calculated several time for no reason. Let's fix that too.
This fix an out-of-bound error detected by KASan.

Signed-off-by: Jerome Marchand 
---
 crypto/testmgr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ae8c57fd..6691756 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -488,6 +488,8 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
  tcrypt_complete, );
 
+   iv_len = crypto_aead_ivsize(tfm);
+
for (i = 0, j = 0; i < tcount; i++) {
if (template[i].np)
continue;
@@ -508,7 +510,6 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
 
memcpy(input, template[i].input, template[i].ilen);
memcpy(assoc, template[i].assoc, template[i].alen);
-   iv_len = crypto_aead_ivsize(tfm);
if (template[i].iv)
memcpy(iv, template[i].iv, iv_len);
else
@@ -617,7 +618,7 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
j++;
 
if (template[i].iv)
-   memcpy(iv, template[i].iv, MAX_IVLEN);
+   memcpy(iv, template[i].iv, iv_len);
else
memset(iv, 0, MAX_IVLEN);
 
-- 
2.5.0



[PATCH v2] fix out of bound read in __test_aead()

2016-02-03 Thread Jerome Marchand
__test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
actual length of the initialisation vector can be shorter.
The length of the IV is already calculated earlier in the
function. Let's just reuses that. Also the IV length is currently
calculated several time for no reason. Let's fix that too.
This fix an out-of-bound error detected by KASan.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 crypto/testmgr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ae8c57fd..6691756 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -488,6 +488,8 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
aead_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
  tcrypt_complete, );
 
+   iv_len = crypto_aead_ivsize(tfm);
+
for (i = 0, j = 0; i < tcount; i++) {
if (template[i].np)
continue;
@@ -508,7 +510,6 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
 
memcpy(input, template[i].input, template[i].ilen);
memcpy(assoc, template[i].assoc, template[i].alen);
-   iv_len = crypto_aead_ivsize(tfm);
if (template[i].iv)
memcpy(iv, template[i].iv, iv_len);
else
@@ -617,7 +618,7 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
j++;
 
if (template[i].iv)
-   memcpy(iv, template[i].iv, MAX_IVLEN);
+   memcpy(iv, template[i].iv, iv_len);
else
memset(iv, 0, MAX_IVLEN);
 
-- 
2.5.0



cifs: out-of-bound write in build_ntlmssp_auth_blob()

2016-02-03 Thread Jerome Marchand
Hi,

While running some test, KASan detected several out-of-bound write
accesses to the ntlmssp blob in build_ntlmssp_auth_blob(). In this case,
the ntlmssp blob was allocated in sess_auth_rawntlmssp_authenticate().
Its size is an "empirical" 5*sizeof(struct _AUTHENTICATE_MESSAGE) (320B
on x86_64). I don't know where this value comes from or if it was ever
appropriate, but it is currently insufficient: the user and domain name
in UTF16 could take 1kB by themselves. I'm not sure what's the proper
way to fix this. Naively I'd say to allocate the blob dynamically in
build_ntlmssp_auth_blob().
While I haven't run into the issue, the size of ntlmssp_blob in
SMB2_sess_setup is too small too (sizeof(struct _NEGOTIATE_MESSAGE) + 500).

Regards,
Jerome Marchand



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] fix out of bound read in __test_aead()

2016-02-01 Thread Jerome Marchand
On 02/01/2016 03:26 PM, Herbert Xu wrote:
> On Fri, Jan 29, 2016 at 02:10:09PM +0100, Jerome Marchand wrote:
>> __test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
>> actual length of the initialisation vector can be shorter.
>> The length of the IV is already calculated earlier in the
>> function. Let's just reuses that.
>> This fix an out-of-bound error detected by KASan.
>>
>> Signed-off-by: Jerome Marchand 
> 
> This patch creates a new warning that iv_len may be uninitialised.

I see. iv_len is set for each templates. I don't see why we would like
to call crypto_aead_ivsize() more than once. Moving the initialization
of iv_len out of the loop should solve the warning.

> 
> Please fix this and resubmit.

Will do.

Jerome

> 
> Thanks,
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCH] fix out of bound read in __test_aead()

2016-02-01 Thread Jerome Marchand
On 02/01/2016 03:26 PM, Herbert Xu wrote:
> On Fri, Jan 29, 2016 at 02:10:09PM +0100, Jerome Marchand wrote:
>> __test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
>> actual length of the initialisation vector can be shorter.
>> The length of the IV is already calculated earlier in the
>> function. Let's just reuses that.
>> This fix an out-of-bound error detected by KASan.
>>
>> Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
> 
> This patch creates a new warning that iv_len may be uninitialised.

I see. iv_len is set for each templates. I don't see why we would like
to call crypto_aead_ivsize() more than once. Moving the initialization
of iv_len out of the loop should solve the warning.

> 
> Please fix this and resubmit.

Will do.

Jerome

> 
> Thanks,
> 




signature.asc
Description: OpenPGP digital signature


[PATCH] fix out of bound read in __test_aead()

2016-01-29 Thread Jerome Marchand
__test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
actual length of the initialisation vector can be shorter.
The length of the IV is already calculated earlier in the
function. Let's just reuses that.
This fix an out-of-bound error detected by KASan.

Signed-off-by: Jerome Marchand 
---
 crypto/testmgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ae8c57fd..d3587d5 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -617,7 +617,7 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
j++;
 
if (template[i].iv)
-   memcpy(iv, template[i].iv, MAX_IVLEN);
+   memcpy(iv, template[i].iv, iv_len);
else
memset(iv, 0, MAX_IVLEN);
 
-- 
2.5.0



[PATCH] fix out of bound read in __test_aead()

2016-01-29 Thread Jerome Marchand
__test_aead() reads MAX_IVLEN bytes from template[i].iv, but the
actual length of the initialisation vector can be shorter.
The length of the IV is already calculated earlier in the
function. Let's just reuses that.
This fix an out-of-bound error detected by KASan.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 crypto/testmgr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/testmgr.c b/crypto/testmgr.c
index ae8c57fd..d3587d5 100644
--- a/crypto/testmgr.c
+++ b/crypto/testmgr.c
@@ -617,7 +617,7 @@ static int __test_aead(struct crypto_aead *tfm, int enc,
j++;
 
if (template[i].iv)
-   memcpy(iv, template[i].iv, MAX_IVLEN);
+   memcpy(iv, template[i].iv, iv_len);
else
memset(iv, 0, MAX_IVLEN);
 
-- 
2.5.0



[PATCH V2] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases
vm->size to "remove" the guard hole page, even when it isn't present.
All but one users just free the vm_struct rigth away and never access
vm->size anyway.
Don't touch the size in remove_vm_area() and have __vunmap() use the
proper get_vm_area_size() helper.

Signed-off-by: Jerome Marchand 
---
 mm/vmalloc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d045634..8e3c9c5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1443,7 +1443,6 @@ struct vm_struct *remove_vm_area(const void *addr)
vmap_debug_free_range(va->va_start, va->va_end);
kasan_free_shadow(vm);
free_unmap_vmap_area(va);
-   vm->size -= PAGE_SIZE;
 
return vm;
}
@@ -1468,8 +1467,8 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
return;
}
 
-   debug_check_no_locks_freed(addr, area->size);
-   debug_check_no_obj_freed(addr, area->size);
+   debug_check_no_locks_freed(addr, get_vm_area_size(area));
+   debug_check_no_obj_freed(addr, get_vm_area_size(area));
 
if (deallocate_pages) {
int i;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
On 11/12/2015 04:55 PM, Andrey Ryabinin wrote:
> 2015-11-12 18:17 GMT+03:00 Jerome Marchand :
>> Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
>> allocation") missed a spot. Currently remove_vm_area() decreases
>> vm->size to remove the guard hole page, even when it isn't present.
>> This patch only decreases vm->size when VM_NO_GUARD isn't set.
>>
>> Signed-off-by: Jerome Marchand 
>> ---
>>  mm/vmalloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index d045634..1388c3d 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -1443,7 +1443,8 @@ struct vm_struct *remove_vm_area(const void *addr)
>> vmap_debug_free_range(va->va_start, va->va_end);
>> kasan_free_shadow(vm);
>> free_unmap_vmap_area(va);
>> -   vm->size -= PAGE_SIZE;
>> +   if (!(vm->flags & VM_NO_GUARD))
>> +   vm->size -= PAGE_SIZE;
>>
> 
> I'd fix this in another way. I think that remove_vm_area() shouldn't
> change vm's size, IMO it doesn't make sense.
> The only caller who cares about vm's size after removing is __vunmap():
>  area = remove_vm_area(addr);
>  
>  debug_check_no_locks_freed(addr, area->size);
>  debug_check_no_obj_freed(addr, area->size);
> 
> We already have proper get_vm_area_size() helper which takes
> VM_NO_GUARD into account.
> So I think we should use that helper for debug_check_no_*() and just
> remove 'vm->size -= PAGE_SIZE;' line
> from remove_vm_area()

Sure, that would be cleaner.

Btw, there might be a leak in sq_unmap() (arch/sh/kernel/cpu/sh4/sq.c)
as the vm_struct doesn't seem to be freed. CCed the SuperH folks.

Thanks,
Jerome

> 
> 
> 
>> return vm;
>> }
>> --
>> 2.4.3
>>




signature.asc
Description: OpenPGP digital signature


[PATCH] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases
vm->size to remove the guard hole page, even when it isn't present.
This patch only decreases vm->size when VM_NO_GUARD isn't set.

Signed-off-by: Jerome Marchand 
---
 mm/vmalloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d045634..1388c3d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1443,7 +1443,8 @@ struct vm_struct *remove_vm_area(const void *addr)
vmap_debug_free_range(va->va_start, va->va_end);
kasan_free_shadow(vm);
free_unmap_vmap_area(va);
-   vm->size -= PAGE_SIZE;
+   if (!(vm->flags & VM_NO_GUARD))
+   vm->size -= PAGE_SIZE;
 
return vm;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases
vm->size to remove the guard hole page, even when it isn't present.
This patch only decreases vm->size when VM_NO_GUARD isn't set.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 mm/vmalloc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d045634..1388c3d 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1443,7 +1443,8 @@ struct vm_struct *remove_vm_area(const void *addr)
vmap_debug_free_range(va->va_start, va->va_end);
kasan_free_shadow(vm);
free_unmap_vmap_area(va);
-   vm->size -= PAGE_SIZE;
+   if (!(vm->flags & VM_NO_GUARD))
+   vm->size -= PAGE_SIZE;
 
return vm;
}
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
On 11/12/2015 04:55 PM, Andrey Ryabinin wrote:
> 2015-11-12 18:17 GMT+03:00 Jerome Marchand <jmarc...@redhat.com>:
>> Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
>> allocation") missed a spot. Currently remove_vm_area() decreases
>> vm->size to remove the guard hole page, even when it isn't present.
>> This patch only decreases vm->size when VM_NO_GUARD isn't set.
>>
>> Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
>> ---
>>  mm/vmalloc.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index d045634..1388c3d 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -1443,7 +1443,8 @@ struct vm_struct *remove_vm_area(const void *addr)
>> vmap_debug_free_range(va->va_start, va->va_end);
>> kasan_free_shadow(vm);
>> free_unmap_vmap_area(va);
>> -   vm->size -= PAGE_SIZE;
>> +   if (!(vm->flags & VM_NO_GUARD))
>> +   vm->size -= PAGE_SIZE;
>>
> 
> I'd fix this in another way. I think that remove_vm_area() shouldn't
> change vm's size, IMO it doesn't make sense.
> The only caller who cares about vm's size after removing is __vunmap():
>  area = remove_vm_area(addr);
>  
>  debug_check_no_locks_freed(addr, area->size);
>  debug_check_no_obj_freed(addr, area->size);
> 
> We already have proper get_vm_area_size() helper which takes
> VM_NO_GUARD into account.
> So I think we should use that helper for debug_check_no_*() and just
> remove 'vm->size -= PAGE_SIZE;' line
> from remove_vm_area()

Sure, that would be cleaner.

Btw, there might be a leak in sq_unmap() (arch/sh/kernel/cpu/sh4/sq.c)
as the vm_struct doesn't seem to be freed. CCed the SuperH folks.

Thanks,
Jerome

> 
> 
> 
>> return vm;
>> }
>> --
>> 2.4.3
>>




signature.asc
Description: OpenPGP digital signature


[PATCH V2] mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()

2015-11-12 Thread Jerome Marchand
Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases
vm->size to "remove" the guard hole page, even when it isn't present.
All but one users just free the vm_struct rigth away and never access
vm->size anyway.
Don't touch the size in remove_vm_area() and have __vunmap() use the
proper get_vm_area_size() helper.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 mm/vmalloc.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d045634..8e3c9c5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1443,7 +1443,6 @@ struct vm_struct *remove_vm_area(const void *addr)
vmap_debug_free_range(va->va_start, va->va_end);
kasan_free_shadow(vm);
free_unmap_vmap_area(va);
-   vm->size -= PAGE_SIZE;
 
return vm;
}
@@ -1468,8 +1467,8 @@ static void __vunmap(const void *addr, int 
deallocate_pages)
return;
}
 
-   debug_check_no_locks_freed(addr, area->size);
-   debug_check_no_obj_freed(addr, area->size);
+   debug_check_no_locks_freed(addr, get_vm_area_size(area));
+   debug_check_no_obj_freed(addr, get_vm_area_size(area));
 
if (deallocate_pages) {
int i;
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 2/4] mm, proc: account for shmem swap in /proc/pid/smaps

2015-10-26 Thread Jerome Marchand
On 10/21/2015 04:39 PM, Vlastimil Babka wrote:
> On 10/05/2015 05:01 AM, Hugh Dickins wrote:
>> On Fri, 2 Oct 2015, Vlastimil Babka wrote:

>> As you acknowledge in the commit message, if a file of 100 pages
>> were copied to tmpfs, and 100 tasks map its full extent, but they
>> all mess around with the first 50 pages and take no interest in
>> the last 50, then it's quite likely that that last 50 will get
>> swapped out; then with your patch, 100 tasks are each shown as
>> using 50 pages of swap, when none of them are actually using any.
> 
> Yeah, but isn't it the same with private memory which was swapped out at
> some point and we don't know if it will be touched or not? The
> difference is in private case we know the process touched it at least
> once, but that can also mean nothing for the future (or maybe it just
> mmapped with MAP_POPULATE and didn't care about half of it).
> 
> That's basically what I was trying to say in the changelog. I interpret
> the Swap: value as the amount of swap-in potential, if the process was
> going to access it, which is what the particular customer also expects
> (see below). In that case showing zero is IMHO wrong and inconsistent
> with the anonymous private mappings.

I didn't understand the changelog that way an IMHO it's a pretty
specific interpretation. I've always understood memory accounting as
being primarily the answer to the question: how much resources a
process uses? I guess its meaning as been overloaded with corollaries
that are only true in the most simple non-shared cases, such as yours
or "how much memory would be freed if this process goes away?", but I
don't think it should ever be used as a definition.

I suppose the reason I didn't understand the changelog the way you
intended is because I think that sometimes it's correct to blame a
process for pages it never accessed (and I also believe that
over-accounting is better that under accounting,  but I must admit
that it is a quite arbitrary point of view). For instance, what if a
process has a shared anonymous mapping that includes pages that it
never accessed, but have been populated by an other process that has
already exited or munmaped the range? That process is not to blame for
the appearance of these pages, but it's the only reason why they
stay.

I'll offer a lollipop to anyone who comes up with a simple consistent
model on how to account shmem pages for all the possible cases, a
"Grand Unified Theory of Memory Accounting" so to speak.

> Other non-perfect solutions that come to mind:
> 
> 1) For private mappings, count only the swapents. "Swap:" is no longer
> showing full swap-in potential though.
> 2) For private mappings, do not count swapents. Ditto.
> 3) Provide two separate counters. The user won't know how much they
> overlap, though.
> 
> From these I would be inclined towards 3) as being more universal,
> although then it's no longer a simple "we're fixing a Swap: 0 value
> which is wrong", but closer to original Jerome's versions, which IIRC
> introduced several shmem-specific counters.

You remember correctly. Given all the controversy around shmem
accounting, maybe it would indeed be better to leave existing
counters, that are relatively well defined and understood, untouched
and add specific counters to mess around instead.

Thanks,
Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/4] mm, proc: account for shmem swap in /proc/pid/smaps

2015-10-26 Thread Jerome Marchand
On 10/21/2015 04:39 PM, Vlastimil Babka wrote:
> On 10/05/2015 05:01 AM, Hugh Dickins wrote:
>> On Fri, 2 Oct 2015, Vlastimil Babka wrote:

>> As you acknowledge in the commit message, if a file of 100 pages
>> were copied to tmpfs, and 100 tasks map its full extent, but they
>> all mess around with the first 50 pages and take no interest in
>> the last 50, then it's quite likely that that last 50 will get
>> swapped out; then with your patch, 100 tasks are each shown as
>> using 50 pages of swap, when none of them are actually using any.
> 
> Yeah, but isn't it the same with private memory which was swapped out at
> some point and we don't know if it will be touched or not? The
> difference is in private case we know the process touched it at least
> once, but that can also mean nothing for the future (or maybe it just
> mmapped with MAP_POPULATE and didn't care about half of it).
> 
> That's basically what I was trying to say in the changelog. I interpret
> the Swap: value as the amount of swap-in potential, if the process was
> going to access it, which is what the particular customer also expects
> (see below). In that case showing zero is IMHO wrong and inconsistent
> with the anonymous private mappings.

I didn't understand the changelog that way an IMHO it's a pretty
specific interpretation. I've always understood memory accounting as
being primarily the answer to the question: how much resources a
process uses? I guess its meaning as been overloaded with corollaries
that are only true in the most simple non-shared cases, such as yours
or "how much memory would be freed if this process goes away?", but I
don't think it should ever be used as a definition.

I suppose the reason I didn't understand the changelog the way you
intended is because I think that sometimes it's correct to blame a
process for pages it never accessed (and I also believe that
over-accounting is better that under accounting,  but I must admit
that it is a quite arbitrary point of view). For instance, what if a
process has a shared anonymous mapping that includes pages that it
never accessed, but have been populated by an other process that has
already exited or munmaped the range? That process is not to blame for
the appearance of these pages, but it's the only reason why they
stay.

I'll offer a lollipop to anyone who comes up with a simple consistent
model on how to account shmem pages for all the possible cases, a
"Grand Unified Theory of Memory Accounting" so to speak.

> Other non-perfect solutions that come to mind:
> 
> 1) For private mappings, count only the swapents. "Swap:" is no longer
> showing full swap-in potential though.
> 2) For private mappings, do not count swapents. Ditto.
> 3) Provide two separate counters. The user won't know how much they
> overlap, though.
> 
> From these I would be inclined towards 3) as being more universal,
> although then it's no longer a simple "we're fixing a Swap: 0 value
> which is wrong", but closer to original Jerome's versions, which IIRC
> introduced several shmem-specific counters.

You remember correctly. Given all the controversy around shmem
accounting, maybe it would indeed be better to leave existing
counters, that are relatively well defined and understood, untouched
and add specific counters to mess around instead.

Thanks,
Jerome



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/4] mm, proc: account for shmem swap in /proc/pid/smaps

2015-10-02 Thread Jerome Marchand
On 10/02/2015 03:35 PM, Vlastimil Babka wrote:
> Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
> mappings, even if the mapped portion does contain pages that were swapped out.
> This is because unlike private anonymous mappings, shmem does not change pte
> to swap entry, but pte_none when swapping the page out. In the smaps page
> walk, such page thus looks like it was never faulted in.
> 
> This patch changes smaps_pte_entry() to determine the swap status for such
> pte_none entries for shmem mappings, similarly to how mincore_page() does it.
> Swapped out pages are thus accounted for.
> 
> The accounting is arguably still not as precise as for private anonymous
> mappings, since now we will count also pages that the process in question 
> never
> accessed, but only another process populated them and then let them become
> swapped out. I believe it is still less confusing and subtle than not showing
> any swap usage by shmem mappings at all. Also, swapped out pages only becomee 
> a
> performance issue for future accesses, and we cannot predict those for neither
> kind of mapping.

Agreed, this is much better than the current situation. I don't think
there is such a thing as a perfect accounting of shared pages anyway.

> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: Konstantin Khlebnikov 

Acked-by: Jerome Marchand 





signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 1/4] mm, documentation: clarify /proc/pid/status VmSwap limitations

2015-10-02 Thread Jerome Marchand
On 10/02/2015 03:35 PM, Vlastimil Babka wrote:
> The documentation for /proc/pid/status does not mention that the value of
> VmSwap counts only swapped out anonymous private pages and not shmem. This is
> not obvious, so document this limitation.
> 
> Signed-off-by: Vlastimil Babka 
> Acked-by: Konstantin Khlebnikov 
> Acked-by: Michal Hocko 

Acked-by: Jerome Marchand 





signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 1/4] mm, documentation: clarify /proc/pid/status VmSwap limitations

2015-10-02 Thread Jerome Marchand
On 10/02/2015 03:35 PM, Vlastimil Babka wrote:
> The documentation for /proc/pid/status does not mention that the value of
> VmSwap counts only swapped out anonymous private pages and not shmem. This is
> not obvious, so document this limitation.
> 
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> Acked-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>
> Acked-by: Michal Hocko <mho...@suse.com>

Acked-by: Jerome Marchand <jmarc...@redhat.com>





signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 2/4] mm, proc: account for shmem swap in /proc/pid/smaps

2015-10-02 Thread Jerome Marchand
On 10/02/2015 03:35 PM, Vlastimil Babka wrote:
> Currently, /proc/pid/smaps will always show "Swap: 0 kB" for shmem-backed
> mappings, even if the mapped portion does contain pages that were swapped out.
> This is because unlike private anonymous mappings, shmem does not change pte
> to swap entry, but pte_none when swapping the page out. In the smaps page
> walk, such page thus looks like it was never faulted in.
> 
> This patch changes smaps_pte_entry() to determine the swap status for such
> pte_none entries for shmem mappings, similarly to how mincore_page() does it.
> Swapped out pages are thus accounted for.
> 
> The accounting is arguably still not as precise as for private anonymous
> mappings, since now we will count also pages that the process in question 
> never
> accessed, but only another process populated them and then let them become
> swapped out. I believe it is still less confusing and subtle than not showing
> any swap usage by shmem mappings at all. Also, swapped out pages only becomee 
> a
> performance issue for future accesses, and we cannot predict those for neither
> kind of mapping.

Agreed, this is much better than the current situation. I don't think
there is such a thing as a perfect accounting of shared pages anyway.

> 
> Signed-off-by: Vlastimil Babka <vba...@suse.cz>
> Acked-by: Konstantin Khlebnikov <khlebni...@yandex-team.ru>

Acked-by: Jerome Marchand <jmarc...@redhat.com>





signature.asc
Description: OpenPGP digital signature


[PATCH] mm: memcontrol: fix order calculation in try_charge()

2015-09-15 Thread Jerome Marchand
Since commit <6539cc05386> (mm: memcontrol: fold mem_cgroup_do_charge()),
the order to pass to mem_cgroup_oom() is calculated by passing the number
of pages to get_order() instead of the expected  size in bytes. AFAICT,
it only affects the value displayed in the oom warning message.
This patch fix this.

Signed-off-by: Jerome Marchand 
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1742a2d..91bf094 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2032,7 +2032,8 @@ retry:
 
mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
 
-   mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
+   mem_cgroup_oom(mem_over_limit, gfp_mask,
+  get_order(nr_pages * PAGE_SIZE));
 nomem:
if (!(gfp_mask & __GFP_NOFAIL))
return -ENOMEM;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: memcontrol: fix order calculation in try_charge()

2015-09-15 Thread Jerome Marchand
Since commit <6539cc05386> (mm: memcontrol: fold mem_cgroup_do_charge()),
the order to pass to mem_cgroup_oom() is calculated by passing the number
of pages to get_order() instead of the expected  size in bytes. AFAICT,
it only affects the value displayed in the oom warning message.
This patch fix this.

Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1742a2d..91bf094 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2032,7 +2032,8 @@ retry:
 
mem_cgroup_events(mem_over_limit, MEMCG_OOM, 1);
 
-   mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages));
+   mem_cgroup_oom(mem_over_limit, gfp_mask,
+  get_order(nr_pages * PAGE_SIZE));
 nomem:
if (!(gfp_mask & __GFP_NOFAIL))
return -ENOMEM;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

2015-09-01 Thread Jerome Marchand
On 08/20/2015 02:23 PM, Mel Gorman wrote:
> On Mon, Jul 27, 2015 at 01:25:47PM +0200, Jerome Marchand wrote:
>> On 07/27/2015 12:52 PM, Mel Gorman wrote:
>>> On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
>>>> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
>>>>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand  
>>>>> wrote:
>>>>>>
>>>>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
>>>>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
>>>>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
>>>>>> ("nfs: page cache invalidation for dio").
>>>>>> This naive test patch avoid to take the mutex on a swapfile and makes
>>>>>> lockdep happy again. However I don't know much about NFS code and I
>>>>>> assume it's probably not the proper solution. Any thought?
>>>>>>
>>>>>> Signed-off-by: Jerome Marchand 
>>>>>
>>>>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
>>>>> Why can't this be fixed in the generic swap code instead of adding
>>>>> yet-another-exception-for-IS_SWAPFILE?
>>>>
>>>> I meant to cc Mel. Just added him.
>>>>
>>>
>>> Can the full lockdep warning be included as it'll be easier to see then if
>>> the generic swap code can somehow special case this? Currently, generic
>>> swapping does not not need to care about how the filesystem locked.
>>> For most filesystems, it's writing directly to the blocks on disk and
>>> bypassing the FS. In the NFS case it'd be surprising to find that there
>>> also are dirty pages in page cache that belong to the swap file as it's
>>> going to cause corruption. If there is any special casing it would to only
>>> attempt the invalidation in the !swap case and warn if mapping->nrpages. It
>>> still would look a bit weird but safer than just not acquiring the mutex
>>> and then potentially attempting an invalidation.
>>>
>>
>> [ 6819.501009] =
>> [ 6819.501009] [ INFO: inconsistent lock state ]
>> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
>> [ 6819.501009] -----
> 
> Thanks. Sorry for the long delay but I finally got back to the bug this
> week. NFS can be modified to special case the swapfile but I was not happy
> with the result for multiple reasons. It took me a while to see a way for
> the core VM to deal with it. What do you think of the following
> approach?

Seems sound to me.

> More importantly, does it work for you?

Yes.

> 
> ---8<---
> nfs: Use swap_lock to prevent parallel swapon activations
> 
> Jerome Marchand reported a lockdep warning as follows
> 
> [ 6819.501009] =
> [ 6819.501009] [ INFO: inconsistent lock state ]
> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
> [ 6819.501009] -
> [ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 6819.501009]  (>s_type->i_mutex_key#17){+.+.?.}, at: 
> [] nfs_file_direct_write+0x85/0x3f0 [nfs]
> [ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
> [ 6819.501009]   [] mark_held_locks+0x71/0x90
> [ 6819.501009]   [] lockdep_trace_alloc+0x75/0xe0
> [ 6819.501009]   [] 
> kmem_cache_alloc_node_trace+0x39/0x440
> [ 6819.501009]   [] __get_vm_area_node+0x7f/0x160
> [ 6819.501009]   [] __vmalloc_node_range+0x72/0x2c0
> [ 6819.501009]   [] vzalloc+0x54/0x60
> [ 6819.501009]   [] SyS_swapon+0x628/0xfc0
> [ 6819.501009]   [] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> It's due to NFS acquiring i_mutex since a9ab5e840669 ("nfs: page
> cache invalidation for dio") to invalidate page cache before direct I/O.
> Filesystems may safely acquire i_mutex during direct writes but NFS is unique
> in its treatment of swap files. Ordinarily swap files are supported by the
> core VM looking up the physical block for a given offset in advance. There
> is no physical block for NFS and the direct write paths are used after
> calling mapping->swap_activate.
> 
> The lockdep warning is triggered by swapon(), which is not in reclaim
> context, acquiring the i_mutex to ensure a swapfile is not activated twice.
> 
> swapon does not need the i_mutex for this purpose.  There is a 

Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

2015-09-01 Thread Jerome Marchand
On 08/20/2015 02:23 PM, Mel Gorman wrote:
> On Mon, Jul 27, 2015 at 01:25:47PM +0200, Jerome Marchand wrote:
>> On 07/27/2015 12:52 PM, Mel Gorman wrote:
>>> On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
>>>> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
>>>>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <jmarc...@redhat.com> 
>>>>> wrote:
>>>>>>
>>>>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
>>>>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
>>>>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
>>>>>> ("nfs: page cache invalidation for dio").
>>>>>> This naive test patch avoid to take the mutex on a swapfile and makes
>>>>>> lockdep happy again. However I don't know much about NFS code and I
>>>>>> assume it's probably not the proper solution. Any thought?
>>>>>>
>>>>>> Signed-off-by: Jerome Marchand <jmarc...@redhat.com>
>>>>>
>>>>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
>>>>> Why can't this be fixed in the generic swap code instead of adding
>>>>> yet-another-exception-for-IS_SWAPFILE?
>>>>
>>>> I meant to cc Mel. Just added him.
>>>>
>>>
>>> Can the full lockdep warning be included as it'll be easier to see then if
>>> the generic swap code can somehow special case this? Currently, generic
>>> swapping does not not need to care about how the filesystem locked.
>>> For most filesystems, it's writing directly to the blocks on disk and
>>> bypassing the FS. In the NFS case it'd be surprising to find that there
>>> also are dirty pages in page cache that belong to the swap file as it's
>>> going to cause corruption. If there is any special casing it would to only
>>> attempt the invalidation in the !swap case and warn if mapping->nrpages. It
>>> still would look a bit weird but safer than just not acquiring the mutex
>>> and then potentially attempting an invalidation.
>>>
>>
>> [ 6819.501009] =
>> [ 6819.501009] [ INFO: inconsistent lock state ]
>> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
>> [ 6819.501009] -
> 
> Thanks. Sorry for the long delay but I finally got back to the bug this
> week. NFS can be modified to special case the swapfile but I was not happy
> with the result for multiple reasons. It took me a while to see a way for
> the core VM to deal with it. What do you think of the following
> approach?

Seems sound to me.

> More importantly, does it work for you?

Yes.

> 
> ---8<---
> nfs: Use swap_lock to prevent parallel swapon activations
> 
> Jerome Marchand reported a lockdep warning as follows
> 
> [ 6819.501009] =
> [ 6819.501009] [ INFO: inconsistent lock state ]
> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
> [ 6819.501009] -
> [ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 6819.501009]  (>s_type->i_mutex_key#17){+.+.?.}, at: 
> [] nfs_file_direct_write+0x85/0x3f0 [nfs]
> [ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
> [ 6819.501009]   [] mark_held_locks+0x71/0x90
> [ 6819.501009]   [] lockdep_trace_alloc+0x75/0xe0
> [ 6819.501009]   [] 
> kmem_cache_alloc_node_trace+0x39/0x440
> [ 6819.501009]   [] __get_vm_area_node+0x7f/0x160
> [ 6819.501009]   [] __vmalloc_node_range+0x72/0x2c0
> [ 6819.501009]   [] vzalloc+0x54/0x60
> [ 6819.501009]   [] SyS_swapon+0x628/0xfc0
> [ 6819.501009]   [] entry_SYSCALL_64_fastpath+0x12/0x76
> 
> It's due to NFS acquiring i_mutex since a9ab5e840669 ("nfs: page
> cache invalidation for dio") to invalidate page cache before direct I/O.
> Filesystems may safely acquire i_mutex during direct writes but NFS is unique
> in its treatment of swap files. Ordinarily swap files are supported by the
> core VM looking up the physical block for a given offset in advance. There
> is no physical block for NFS and the direct write paths are used after
> calling mapping->swap_activate.
> 
> The lockdep warning is triggered by swapon(), which is not in reclaim
> context, acquiring the i_mutex to ensure a swapfile is not activated twice.
> 
> swapon do

Re: [PATCH v3 0/4] enhance shmem process and swap accounting

2015-08-07 Thread Jerome Marchand
On 08/05/2015 03:01 PM, Vlastimil Babka wrote:
> Reposting due to lack of feedback in May. I hope at least patches 1 and 2
> could be merged as they are IMHO bugfixes. 3 and 4 is optional but IMHO 
> useful.
> 
> Changes since v2:
> o Rebase on next-20150805.
> o This means that /proc/pid/maps has the proportional swap share (SwapPss:)
>   field as per https://lkml.org/lkml/2015/6/15/274
>   It's not clear what to do with shmem here so it's 0 for now.
>   - swapped out shmem doesn't have swap entries, so we would have to look at 
> who
> else has the shmem object (partially) mapped
>   - to be more precise we should also check if his range actually includes 
> the offset in question, which could get rather involved
>   - or is there some easy way I don't see?

Hmm... This is much more difficult than I envision when commenting on
Minchan patch. One possibility could be to have the pte of paged out
shmem pages set in a similar way than regular swap entry are. But that
would need to use some very precious estate on the pte.
As it is, a zero value, while obviously wrong, has the advantage of not
being misleading like a bad approximation would be (like the kind which
doesn't properly accounts for partial mapping).

Jerome

> o Konstantin suggested for patch 3/4 that I drop the CONFIG_SHMEM #ifdefs
>   I didn't see the point in going against tinyfication when the work is
>   already done, but I can do that if more people think it's better and it
>   would block the series.
> 
> Changes since v1:
> o In Patch 2, rely on SHMEM_I(inode)->swapped if possible, and fallback to
>   radix tree iterator on partially mapped shmem objects, i.e. decouple shmem
>   swap usage determination from the page walk, for performance reasons.
>   Thanks to Jerome and Konstantin for the tips.
>   The downside is that mm/shmem.c had to be touched.
> 
> This series is based on Jerome Marchand's [1] so let me quote the first
> paragraph from there:
> 
> There are several shortcomings with the accounting of shared memory
> (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
> values in /proc//status and statm don't allow to distinguish
> between shmem memory and a shared mapping to a regular file, even
> though theirs implication on memory usage are quite different: at
> reclaim, file mapping can be dropped or write back on disk while shmem
> needs a place in swap. As for shmem pages that are swapped-out or in
> swap cache, they aren't accounted at all.
> 
> The original motivation for myself is that a customer found (IMHO rightfully)
> confusing that e.g. top output for process swap usage is unreliable with
> respect to swapped out shmem pages, which are not accounted for.
> 
> The fundamental difference between private anonymous and shmem pages is that
> the latter has PTE's converted to pte_none, and not swapents. As such, they 
> are
> not accounted to the number of swapents visible e.g. in /proc/pid/status 
> VmSwap
> row. It might be theoretically possible to use swapents when swapping out 
> shmem
> (without extra cost, as one has to change all mappers anyway), and on swap in
> only convert the swapent for the faulting process, leaving swapents in other
> processes until they also fault (so again no extra cost). But I don't know how
> many assumptions this would break, and it would be too disruptive change for a
> relatively small benefit.
> 
> Instead, my approach is to document the limitation of VmSwap, and provide 
> means
> to determine the swap usage for shmem areas for those who are interested and
> willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
> don't think it's possible to currently to determine the usage at all.  The
> previous patchset [1] did introduce new shmem-specific fields into smaps
> output, and functions to determine the values. I take a simpler approach,
> noting that smaps output already has a "Swap: X kB" line, where currently X ==
> 0 always for shmem areas. I think we can just consider this a bug and provide
> the proper value by consulting the radix tree, as e.g. mincore_page() does. 
> In the
> patch changelog I explain why this is also not perfect (and cannot be without
> swapents), but still arguably much better than showing a 0.
> 
> The last two patches are adapted from Jerome's patchset and provide a VmRSS
> breakdown to VmAnon, VmFile and VmShm in /proc/pid/status. Hugh noted that
> this is a welcome addition, and I agree that it might help e.g. debugging
> process memory usage at albeit non-zero, but still rather low cost of extra
> per-mm counter and some page flag checks. I updated these patches to 4.0-rc1,
> made them respect !CONFIG_SHMEM so that tiny systems don't pay the cost, and
> optimized 

Re: [PATCH v3 0/4] enhance shmem process and swap accounting

2015-08-07 Thread Jerome Marchand
On 08/05/2015 03:01 PM, Vlastimil Babka wrote:
 Reposting due to lack of feedback in May. I hope at least patches 1 and 2
 could be merged as they are IMHO bugfixes. 3 and 4 is optional but IMHO 
 useful.
 
 Changes since v2:
 o Rebase on next-20150805.
 o This means that /proc/pid/maps has the proportional swap share (SwapPss:)
   field as per https://lkml.org/lkml/2015/6/15/274
   It's not clear what to do with shmem here so it's 0 for now.
   - swapped out shmem doesn't have swap entries, so we would have to look at 
 who
 else has the shmem object (partially) mapped
   - to be more precise we should also check if his range actually includes 
 the offset in question, which could get rather involved
   - or is there some easy way I don't see?

Hmm... This is much more difficult than I envision when commenting on
Minchan patch. One possibility could be to have the pte of paged out
shmem pages set in a similar way than regular swap entry are. But that
would need to use some very precious estate on the pte.
As it is, a zero value, while obviously wrong, has the advantage of not
being misleading like a bad approximation would be (like the kind which
doesn't properly accounts for partial mapping).

Jerome

 o Konstantin suggested for patch 3/4 that I drop the CONFIG_SHMEM #ifdefs
   I didn't see the point in going against tinyfication when the work is
   already done, but I can do that if more people think it's better and it
   would block the series.
 
 Changes since v1:
 o In Patch 2, rely on SHMEM_I(inode)-swapped if possible, and fallback to
   radix tree iterator on partially mapped shmem objects, i.e. decouple shmem
   swap usage determination from the page walk, for performance reasons.
   Thanks to Jerome and Konstantin for the tips.
   The downside is that mm/shmem.c had to be touched.
 
 This series is based on Jerome Marchand's [1] so let me quote the first
 paragraph from there:
 
 There are several shortcomings with the accounting of shared memory
 (sysV shm, shared anonymous mapping, mapping to a tmpfs file). The
 values in /proc/pid/status and statm don't allow to distinguish
 between shmem memory and a shared mapping to a regular file, even
 though theirs implication on memory usage are quite different: at
 reclaim, file mapping can be dropped or write back on disk while shmem
 needs a place in swap. As for shmem pages that are swapped-out or in
 swap cache, they aren't accounted at all.
 
 The original motivation for myself is that a customer found (IMHO rightfully)
 confusing that e.g. top output for process swap usage is unreliable with
 respect to swapped out shmem pages, which are not accounted for.
 
 The fundamental difference between private anonymous and shmem pages is that
 the latter has PTE's converted to pte_none, and not swapents. As such, they 
 are
 not accounted to the number of swapents visible e.g. in /proc/pid/status 
 VmSwap
 row. It might be theoretically possible to use swapents when swapping out 
 shmem
 (without extra cost, as one has to change all mappers anyway), and on swap in
 only convert the swapent for the faulting process, leaving swapents in other
 processes until they also fault (so again no extra cost). But I don't know how
 many assumptions this would break, and it would be too disruptive change for a
 relatively small benefit.
 
 Instead, my approach is to document the limitation of VmSwap, and provide 
 means
 to determine the swap usage for shmem areas for those who are interested and
 willing to pay the price, using /proc/pid/smaps. Because outside of ipcs, I
 don't think it's possible to currently to determine the usage at all.  The
 previous patchset [1] did introduce new shmem-specific fields into smaps
 output, and functions to determine the values. I take a simpler approach,
 noting that smaps output already has a Swap: X kB line, where currently X ==
 0 always for shmem areas. I think we can just consider this a bug and provide
 the proper value by consulting the radix tree, as e.g. mincore_page() does. 
 In the
 patch changelog I explain why this is also not perfect (and cannot be without
 swapents), but still arguably much better than showing a 0.
 
 The last two patches are adapted from Jerome's patchset and provide a VmRSS
 breakdown to VmAnon, VmFile and VmShm in /proc/pid/status. Hugh noted that
 this is a welcome addition, and I agree that it might help e.g. debugging
 process memory usage at albeit non-zero, but still rather low cost of extra
 per-mm counter and some page flag checks. I updated these patches to 4.0-rc1,
 made them respect !CONFIG_SHMEM so that tiny systems don't pay the cost, and
 optimized the page flag checking somewhat.
 
 [1] http://lwn.net/Articles/611966/
 
 Jerome Marchand (2):
   mm, shmem: Add shmem resident memory accounting
   mm, procfs: Display VmAnon, VmFile and VmShm in /proc/pid/status
 
 Vlastimil Babka (2):
   mm, documentation: clarify /proc/pid/status VmSwap limitations
   mm, proc: account

Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

2015-08-03 Thread Jerome Marchand
On 08/03/2015 12:43 PM, Kirill A. Shutemov wrote:
> On Fri, Jul 31, 2015 at 05:04:18PM +0200, Jerome Marchand wrote:
>> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
>>> We're going to allow mapping of individual 4k pages of THP compound.
>>> It means we need to track mapcount on per small page basis.
>>>
>>> Straight-forward approach is to use ->_mapcount in all subpages to track
>>> how many time this subpage is mapped with PMDs or PTEs combined. But
>>> this is rather expensive: mapping or unmapping of a THP page with PMD
>>> would require HPAGE_PMD_NR atomic operations instead of single we have
>>> now.
>>>
>>> The idea is to store separately how many times the page was mapped as
>>> whole -- compound_mapcount. This frees up ->_mapcount in subpages to
>>> track PTE mapcount.
>>>
>>> We use the same approach as with compound page destructor and compound
>>> order to store compound_mapcount: use space in first tail page,
>>> ->mapping this time.
>>>
>>> Any time we map/unmap whole compound page (THP or hugetlb) -- we
>>> increment/decrement compound_mapcount. When we map part of compound page
>>> with PTE we operate on ->_mapcount of the subpage.
>>>
>>> page_mapcount() counts both: PTE and PMD mappings of the page.
>>>
>>> Basically, we have mapcount for a subpage spread over two counters.
>>> It makes tricky to detect when last mapcount for a page goes away.
>>>
>>> We introduced PageDoubleMap() for this. When we split THP PMD for the
>>> first time and there's other PMD mapping left we offset up ->_mapcount
>>> in all subpages by one and set PG_double_map on the compound page.
>>> These additional references go away with last compound_mapcount.
>>
>> So this stays even if all PTE mappings goes and the page is again mapped
>> only with PMD. I'm not sure how often that happen and if it's an issue
>> worth caring about.
> 
> We don't have a cheap way to detect this situation and it shouldn't
> happen often enough to care.
> 

I thought so.



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

2015-08-03 Thread Jerome Marchand
On 08/03/2015 12:41 PM, Kirill A. Shutemov wrote:
> On Fri, Jul 31, 2015 at 05:01:06PM +0200, Jerome Marchand wrote:
>> On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
>>> @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, 
>>> pmd_t *pmd,
>>>   * Note that if it returns 1, this routine returns without unlocking page
>>>   * table locks. So callers must unlock them.
>>>   */
>>
>> The comment above should be updated.
> 
> Like this?

Yes. Thanks.

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d32277463932..78a6c7cdf8f7 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1627,11 +1627,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>  }
>  
>  /*
> - * Returns 1 if a given pmd maps a stable (not under splitting) thp.
> - * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
> + * Returns true if a given pmd maps a thp, false otherwise.
>   *
> - * Note that if it returns 1, this routine returns without unlocking page
> - * table locks. So callers must unlock them.
> + * Note that if it returns true, this routine returns without unlocking page
> + * table lock. So callers must unlock it.
>   */
>  bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   spinlock_t **ptl)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

2015-08-03 Thread Jerome Marchand
On 08/03/2015 12:43 PM, Kirill A. Shutemov wrote:
 On Fri, Jul 31, 2015 at 05:04:18PM +0200, Jerome Marchand wrote:
 On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
 We're going to allow mapping of individual 4k pages of THP compound.
 It means we need to track mapcount on per small page basis.

 Straight-forward approach is to use -_mapcount in all subpages to track
 how many time this subpage is mapped with PMDs or PTEs combined. But
 this is rather expensive: mapping or unmapping of a THP page with PMD
 would require HPAGE_PMD_NR atomic operations instead of single we have
 now.

 The idea is to store separately how many times the page was mapped as
 whole -- compound_mapcount. This frees up -_mapcount in subpages to
 track PTE mapcount.

 We use the same approach as with compound page destructor and compound
 order to store compound_mapcount: use space in first tail page,
 -mapping this time.

 Any time we map/unmap whole compound page (THP or hugetlb) -- we
 increment/decrement compound_mapcount. When we map part of compound page
 with PTE we operate on -_mapcount of the subpage.

 page_mapcount() counts both: PTE and PMD mappings of the page.

 Basically, we have mapcount for a subpage spread over two counters.
 It makes tricky to detect when last mapcount for a page goes away.

 We introduced PageDoubleMap() for this. When we split THP PMD for the
 first time and there's other PMD mapping left we offset up -_mapcount
 in all subpages by one and set PG_double_map on the compound page.
 These additional references go away with last compound_mapcount.

 So this stays even if all PTE mappings goes and the page is again mapped
 only with PMD. I'm not sure how often that happen and if it's an issue
 worth caring about.
 
 We don't have a cheap way to detect this situation and it shouldn't
 happen often enough to care.
 

I thought so.



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

2015-08-03 Thread Jerome Marchand
On 08/03/2015 12:41 PM, Kirill A. Shutemov wrote:
 On Fri, Jul 31, 2015 at 05:01:06PM +0200, Jerome Marchand wrote:
 On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
 @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, 
 pmd_t *pmd,
   * Note that if it returns 1, this routine returns without unlocking page
   * table locks. So callers must unlock them.
   */

 The comment above should be updated.
 
 Like this?

Yes. Thanks.

 
 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
 index d32277463932..78a6c7cdf8f7 100644
 --- a/mm/huge_memory.c
 +++ b/mm/huge_memory.c
 @@ -1627,11 +1627,10 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
 *pmd,
  }
  
  /*
 - * Returns 1 if a given pmd maps a stable (not under splitting) thp.
 - * Returns -1 if it maps a thp under splitting. Returns 0 otherwise.
 + * Returns true if a given pmd maps a thp, false otherwise.
   *
 - * Note that if it returns 1, this routine returns without unlocking page
 - * table locks. So callers must unlock them.
 + * Note that if it returns true, this routine returns without unlocking page
 + * table lock. So callers must unlock it.
   */
  bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
   spinlock_t **ptl)
 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 35/36] mm: re-enable THP

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> All parts of THP with new refcounting are now in place. We can now allow
> to enable THP.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  mm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c973f416cbe5..e79de2bd12cd 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -410,7 +410,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
>  
>  config TRANSPARENT_HUGEPAGE
>   bool "Transparent Hugepage Support"
> - depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && BROKEN
> + depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select COMPACTION
>   help
> Transparent Hugepages allows the kernel to use huge pages and
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 36/36] thp: update documentation

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> The patch updates Documentation/vm/transhuge.txt to reflect changes in
> THP design.
> 
> Signed-off-by: Kirill A. Shutemov 

Acked-by: Jerome Marchand 

> ---
>  Documentation/vm/transhuge.txt | 151 
> ++---
>  1 file changed, 96 insertions(+), 55 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 34/36] thp: introduce deferred_split_huge_page()

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Currently we don't split huge page on partial unmap. It's not an ideal
> situation. It can lead to memory overhead.
> 
> Furtunately, we can detect partial unmap on page_remove_rmap(). But we
> cannot call split_huge_page() from there due to locking context.
> 
> It's also counterproductive to do directly from munmap() codepath: in
> many cases we will hit this from exit(2) and splitting the huge page
> just to free it up in small pages is not what we really want.
> 
> The patch introduce deferred_split_huge_page() which put the huge page
> into queue for splitting. The splitting itself will happen when we get
> memory pressure via shrinker interface. The page will be dropped from
> list on freeing through compound page destructor.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  include/linux/huge_mm.h |   4 ++
>  include/linux/mm.h  |   2 +
>  mm/huge_memory.c| 127 
> ++--
>  mm/migrate.c|   1 +
>  mm/page_alloc.c |   2 +-
>  mm/rmap.c   |   7 ++-
>  6 files changed, 138 insertions(+), 5 deletions(-)
>




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 33/36] migrate_pages: try to split pages on qeueuing

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We are not able to migrate THPs. It means it's not enough to split only
> PMD on migration -- we need to split compound page under it too.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  mm/mempolicy.c | 37 +
>  1 file changed, 33 insertions(+), 4 deletions(-)
> 





signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 32/36] thp: reintroduce split_huge_page()

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> This patch adds implementation of split_huge_page() for new
> refcountings.
> 
> Unlike previous implementation, new split_huge_page() can fail if
> somebody holds GUP pin on the page. It also means that pin on page
> would prevent it from bening split under you. It makes situation in
> many places much cleaner.
> 
> The basic scheme of split_huge_page():
> 
>   - Check that sum of mapcounts of all subpage is equal to page_count()
> plus one (caller pin). Foll off with -EBUSY. This way we can avoid
> useless PMD-splits.
> 
>   - Freeze the page counters by splitting all PMD and setup migration
> PTEs.
> 
>   - Re-check sum of mapcounts against page_count(). Page's counts are
> stable now. -EBUSY if page is pinned.
> 
>   - Split compound page.
> 
>   - Unfreeze the page by removing migration entries.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  include/linux/huge_mm.h |   7 +-
>  include/linux/pagemap.h |  13 +-
>  mm/huge_memory.c| 318 
> 
>  mm/internal.h   |  26 +++-
>  mm/rmap.c   |  21 
>  5 files changed, 357 insertions(+), 28 deletions(-)
> 
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 31/36] thp, mm: split_huge_page(): caller need to lock page

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We're going to use migration entries instead of compound_lock() to
> stabilize page refcounts. Setup and remove migration entries require
> page to be locked.
> 
> Some of split_huge_page() callers already have the page locked. Let's
> require everybody to lock the page before calling split_huge_page().
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  mm/memory-failure.c | 10 --
>  mm/migrate.c|  8 ++--
>  2 files changed, 14 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index ef33ccf37224..f32a607d1aa3 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1143,15 +1143,18 @@ int memory_failure(unsigned long pfn, int trapno, int 
> flags)
>   put_page(hpage);
>   return -EBUSY;
>   }
> + lock_page(hpage);
>   if (unlikely(split_huge_page(hpage))) {
>   pr_err("MCE: %#lx: thp split failed\n", pfn);
>   if (TestClearPageHWPoison(p))
>   atomic_long_sub(nr_pages, _poisoned_pages);
> + unlock_page(hpage);
>   put_page(p);
>   if (p != hpage)
>   put_page(hpage);
>   return -EBUSY;
>   }
> + unlock_page(hpage);
>   VM_BUG_ON_PAGE(!page_count(p), p);
>   hpage = compound_head(p);
>   }
> @@ -1714,10 +1717,13 @@ int soft_offline_page(struct page *page, int flags)
>   return -EBUSY;
>   }
>   if (!PageHuge(page) && PageTransHuge(hpage)) {
> - if (PageAnon(hpage) && unlikely(split_huge_page(hpage))) {
> + lock_page(page);
> + ret = split_huge_page(hpage);
> + unlock_page(page);
> + if (unlikely(ret)) {
>   pr_info("soft offline: %#lx: failed to split THP\n",
>   pfn);
> - return -EBUSY;
> + return ret;
>   }
>   }
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 67970faf544d..a9dbfd356e9d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -933,9 +933,13 @@ static ICE_noinline int unmap_and_move(new_page_t 
> get_new_page,
>   goto out;
>   }
>  
> - if (unlikely(PageTransHuge(page)))
> - if (unlikely(split_huge_page(page)))
> + if (unlikely(PageTransHuge(page))) {
> + lock_page(page);
> + rc = split_huge_page(page);
> + unlock_page(page);
> + if (rc)
>   goto out;
> + }
>  
>   rc = __unmap_and_move(page, newpage, force, mode);
>  
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 30/36] thp: add option to setup migration entiries during PMD split

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We are going to use migration PTE entires to stabilize page counts.
> If the page is mapped with PMDs we need to split the PMD and setup
> migration enties. It's reasonable to combine these operations to avoid
> double-scanning over the page table.

Entries? Three different typos for three occurrences of the same word.
You don't like it, do you?

> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  mm/huge_memory.c | 23 +++
>  1 file changed, 15 insertions(+), 8 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 28/36] mm, numa: skip PTE-mapped THP on numa fault

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> We're going to have THP mapped with PTEs. It will confuse numabalancing.
> Let's skip them for now.

Fair enough.

Acked-by: Jerome Marchand 

> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> ---
>  mm/memory.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 074edab89b52..52f6fa02c099 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3186,6 +3186,12 @@ static int do_numa_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   return 0;
>   }
>  
> + /* TODO: handle PTE-mapped THP */
> + if (PageCompound(page)) {
> + pte_unmap_unlock(ptep, ptl);
> + return 0;
> + }
> +
>   /*
>* Avoid grouping on RO pages in general. RO pages shouldn't hurt as
>* much anyway since they can be in shared cache state. This misses
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 29/36] thp: implement split_huge_pmd()

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Original split_huge_page() combined two operations: splitting PMDs into
> tables of PTEs and splitting underlying compound page. This patch
> implements split_huge_pmd() which split given PMD without splitting
> other PMDs this page mapped with or underlying compound page.
> 
> Without tail page refcounting, implementation of split_huge_pmd() is
> pretty straight-forward.

While it's significantly simpler than it used to be, straight-forward is
still not the adjective which come to my mind.

> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  include/linux/huge_mm.h |  11 -
>  mm/huge_memory.c| 122 
> 
>  2 files changed, 132 insertions(+), 1 deletion(-)
> 





signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 27/36] mm: differentiate page_mapped() from page_mapcount() for compound pages

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:21 PM, Kirill A. Shutemov wrote:
> Let's define page_mapped() to be true for compound pages if any
> sub-pages of the compound page is mapped (with PMD or PTE).
> 
> On other hand page_mapcount() return mapcount for this particular small
> page.
> 
> This will make cases like page_get_anon_vma() behave correctly once we
> allow huge pages to be mapped with PTE.
> 
> Most users outside core-mm should use page_mapcount() instead of
> page_mapped().
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  arch/arc/mm/cache_arc700.c |  4 ++--
>  arch/arm/mm/flush.c|  2 +-
>  arch/mips/mm/c-r4k.c   |  3 ++-
>  arch/mips/mm/cache.c   |  2 +-
>  arch/mips/mm/init.c|  6 +++---
>  arch/sh/mm/cache-sh4.c |  2 +-
>  arch/sh/mm/cache.c |  8 
>  arch/xtensa/mm/tlb.c   |  2 +-
>  fs/proc/page.c |  4 ++--
>  include/linux/mm.h | 15 +--
>  mm/filemap.c   |  2 +-
>  11 files changed, 31 insertions(+), 19 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 26/36] mm: rework mapcount accounting to enable 4k mapping of THPs

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We're going to allow mapping of individual 4k pages of THP compound.
> It means we need to track mapcount on per small page basis.
> 
> Straight-forward approach is to use ->_mapcount in all subpages to track
> how many time this subpage is mapped with PMDs or PTEs combined. But
> this is rather expensive: mapping or unmapping of a THP page with PMD
> would require HPAGE_PMD_NR atomic operations instead of single we have
> now.
> 
> The idea is to store separately how many times the page was mapped as
> whole -- compound_mapcount. This frees up ->_mapcount in subpages to
> track PTE mapcount.
> 
> We use the same approach as with compound page destructor and compound
> order to store compound_mapcount: use space in first tail page,
> ->mapping this time.
> 
> Any time we map/unmap whole compound page (THP or hugetlb) -- we
> increment/decrement compound_mapcount. When we map part of compound page
> with PTE we operate on ->_mapcount of the subpage.
> 
> page_mapcount() counts both: PTE and PMD mappings of the page.
> 
> Basically, we have mapcount for a subpage spread over two counters.
> It makes tricky to detect when last mapcount for a page goes away.
> 
> We introduced PageDoubleMap() for this. When we split THP PMD for the
> first time and there's other PMD mapping left we offset up ->_mapcount
> in all subpages by one and set PG_double_map on the compound page.
> These additional references go away with last compound_mapcount.

So this stays even if all PTE mappings goes and the page is again mapped
only with PMD. I'm not sure how often that happen and if it's an issue
worth caring about.

Acked-by: Jerome Marchand 

> 
> This approach provides a way to detect when last mapcount goes away on
> per small page basis without introducing new overhead for most common
> cases.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Aneesh Kumar K.V 
> ---
>  include/linux/mm.h | 26 +++-
>  include/linux/mm_types.h   |  1 +
>  include/linux/page-flags.h | 37 +
>  include/linux/rmap.h   |  4 +-
>  mm/debug.c |  5 ++-
>  mm/huge_memory.c   |  2 +-
>  mm/hugetlb.c   |  4 +-
>  mm/memory.c|  2 +-
>  mm/migrate.c   |  2 +-
>  mm/page_alloc.c| 14 +--
>  mm/rmap.c  | 99 
> +++---
>  11 files changed, 161 insertions(+), 35 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 25/36] mm, thp: remove infrastructure for handling splitting PMDs

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop code
> to handle this.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 
> ---
>  fs/proc/task_mmu.c|  8 +++---
>  include/asm-generic/pgtable.h |  9 ---
>  include/linux/huge_mm.h   | 21 +--
>  mm/gup.c  | 12 +
>  mm/huge_memory.c  | 60 
> ++-
>  mm/memcontrol.c   | 13 ++
>  mm/memory.c   | 18 ++---
>  mm/mincore.c  |  2 +-
>  mm/mremap.c   | 15 +--
>  mm/pgtable-generic.c  | 14 --
>  mm/rmap.c |  4 +--
>  11 files changed, 37 insertions(+), 139 deletions(-)
> 

snip

> @@ -1616,23 +1605,14 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t 
> *pmd,
>   * Note that if it returns 1, this routine returns without unlocking page
>   * table locks. So callers must unlock them.
>   */

The comment above should be updated. It otherwise looks good.

Acked-by: Jerome Marchand 

> -int __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
> +bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
>   spinlock_t **ptl)
>  {
>   *ptl = pmd_lock(vma->vm_mm, pmd);
> - if (likely(pmd_trans_huge(*pmd))) {
> - if (unlikely(pmd_trans_splitting(*pmd))) {
> - spin_unlock(*ptl);
> - wait_split_huge_page(vma->anon_vma, pmd);
> - return -1;
> - } else {
> - /* Thp mapped by 'pmd' is stable, so we can
> -  * handle it as it is. */
> - return 1;
> - }
> - }
> + if (likely(pmd_trans_huge(*pmd)))
> + return true;
>   spin_unlock(*ptl);
> - return 0;
> + return false;
>  }
>  
>  /*




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 15/36] ksm: prepare to new THP semantics

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We don't need special code to stabilize THP. If you've got reference to
> any subpage of THP it will not be split under you.
> 
> New split_huge_page() also accepts tail pages: no need in special code
> to get reference to head page.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  mm/ksm.c | 57 ++---
>  1 file changed, 10 insertions(+), 47 deletions(-)
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 16/36] mm, thp: remove compound_lock

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> We are going to use migration entries to stabilize page counts. It means
> we don't need compound_lock() for that.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  include/linux/mm.h | 35 ---
>  include/linux/page-flags.h | 12 +---
>  mm/debug.c |  3 ---
>  mm/memcontrol.c| 11 +++
>  4 files changed, 4 insertions(+), 57 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 24/36] x86, thp: remove infrastructure for handling splitting PMDs

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new refcounting we don't need to mark PMDs splitting. Let's drop
> code to handle this.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 

Acked-by: Jerome Marchand 

> ---
>  arch/x86/include/asm/pgtable.h   |  9 -
>  arch/x86/include/asm/pgtable_types.h |  2 --
>  arch/x86/mm/gup.c| 13 +
>  arch/x86/mm/pgtable.c| 14 --
>  4 files changed, 1 insertion(+), 37 deletions(-)
> 





signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 14/36] futex, thp: remove special case for THP in get_futex_key

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> With new THP refcounting, we don't need tricks to stabilize huge page.
> If we've got reference to tail page, it can't split under us.
> 
> This patch effectively reverts a5b338f2b0b1.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 

Acked-by: Jerome Marchand 

> ---
>  kernel/futex.c | 61 
> --
>  1 file changed, 12 insertions(+), 49 deletions(-)





signature.asc
Description: OpenPGP digital signature


Re: [PATCHv9 13/36] mm: drop tail page refcounting

2015-07-31 Thread Jerome Marchand
On 07/20/2015 04:20 PM, Kirill A. Shutemov wrote:
> Tail page refcounting is utterly complicated and painful to support.
> 
> It uses ->_mapcount on tail pages to store how many times this page is
> pinned. get_page() bumps ->_mapcount on tail page in addition to
> ->_count on head. This information is required by split_huge_page() to
> be able to distribute pins from head of compound page to tails during
> the split.
> 
> We will need ->_mapcount to account PTE mappings of subpages of the
> compound page. We eliminate need in current meaning of ->_mapcount in
> tail pages by forbidding split entirely if the page is pinned.
> 
> The only user of tail page refcounting is THP which is marked BROKEN for
> now.
> 
> Let's drop all this mess. It makes get_page() and put_page() much
> simpler.
> 
> Signed-off-by: Kirill A. Shutemov 
> Tested-by: Sasha Levin 
> Tested-by: Aneesh Kumar K.V 
> Acked-by: Vlastimil Babka 

Acked-by: Jerome Marchand 

> ---
>  arch/mips/mm/gup.c|   4 -
>  arch/powerpc/mm/hugetlbpage.c |  13 +-
>  arch/s390/mm/gup.c|  13 +-
>  arch/sparc/mm/gup.c   |  14 +--
>  arch/x86/mm/gup.c |   4 -
>  include/linux/mm.h|  47 ++--
>  include/linux/mm_types.h  |  17 +--
>  mm/gup.c  |  34 +-
>  mm/huge_memory.c  |  41 +--
>  mm/hugetlb.c  |   2 +-
>  mm/internal.h |  44 ---
>  mm/swap.c | 273 
> +++---
>  12 files changed, 40 insertions(+), 466 deletions(-)
> 




signature.asc
Description: OpenPGP digital signature


  1   2   3   4   5   >