from:"Rabin Vincent"

Re: [PATCH] arm: kgdb: fix NUMREGBYTES so that gdb_regs[] is the correct size

2018-04-05 Thread Rabin Vincent

On Thu, Apr 05, 2018 at 04:09:16PM -0400, David Rivshin wrote:
> From: David Rivshin <drivs...@allworx.com>
> 
> NUMREGBYTES (which is used as the size for gdb_regs[]) is incorrectly based
> on DBG_MAX_REG_NUM instead of GDB_MAX_REGS. DBG_MAX_REG_NUM is the number
> of total registers, while GDB_MAX_REGS is the number of 'unsigned longs'
> it takes to serialize those registers. Since FP registers require 3
> 'unsigned longs' each, DBG_MAX_REG_NUM is smaller than GDB_MAX_REGS.
> 
> This causes GDB 8.0 give the following error on connect:
> "Truncated register 19 in remote 'g' packet"
> 
> This also causes the register serialization/deserialization logic to
> overflow gdb_regs[], overwriting whatever follows.
> 
> Fixes: 834b2964b7ab ("kgdb,arm: fix register dump")
> Cc: <sta...@vger.kernel.org> # 2.6.37+
> Signed-off-by: David Rivshin <drivs...@allworx.com>

Acked-by: Rabin Vincent <ra...@rab.in>

Re: [PATCH] arm: kgdb: fix NUMREGBYTES so that gdb_regs[] is the correct size

2018-04-05 Thread Rabin Vincent

On Thu, Apr 05, 2018 at 04:09:16PM -0400, David Rivshin wrote:
> From: David Rivshin 
> 
> NUMREGBYTES (which is used as the size for gdb_regs[]) is incorrectly based
> on DBG_MAX_REG_NUM instead of GDB_MAX_REGS. DBG_MAX_REG_NUM is the number
> of total registers, while GDB_MAX_REGS is the number of 'unsigned longs'
> it takes to serialize those registers. Since FP registers require 3
> 'unsigned longs' each, DBG_MAX_REG_NUM is smaller than GDB_MAX_REGS.
> 
> This causes GDB 8.0 give the following error on connect:
> "Truncated register 19 in remote 'g' packet"
> 
> This also causes the register serialization/deserialization logic to
> overflow gdb_regs[], overwriting whatever follows.
> 
> Fixes: 834b2964b7ab ("kgdb,arm: fix register dump")
> Cc:  # 2.6.37+
> Signed-off-by: David Rivshin 

Acked-by: Rabin Vincent

[PATCH for v4.9-stable] sched: fix softirq time accounting

2017-12-13 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

softirq time accounting is broken on v4.9.x if ksoftirqd runs.

With
CONFIG_IRQ_TIME_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set

this test code:

struct tasklet_struct tasklet;

static void delay_tasklet(unsigned long data)
{
udelay(10);
tasklet_schedule();
}

tasklet_init(, delay_tasklet, 0);
tasklet_schedule();

results in:

$ while :; do grep cpu0 /proc/stat; done
cpu0 5 0 80 25 16 107 1 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 81 25 16 107 0 0 0 0
cpu0 5 0 81 25 16 107 1 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 0 0 0 0
cpu0 6 0 81 25 16 108 0 0 0 0
cpu0 6 0 81 25 16 108 0 0 0 0

As can be seen, the softirq numbers are totally bogus.

When ksoftirq is running, irqtime_account_process_tick() increments
cpustat[CPUSTAT_SOFTIRQ].  This causes the "nsecs_to_cputime64(irqtime)
- cpustat[CPUSTAT_SOFTIRQ]" calculation in irqtime_account_update() to
underflow the next time a softirq is handled leading to the above
values.

The underflow bug was added by 57430218317e5b280 ("sched/cputime: Count
actually elapsed irq & softirq time").

But ksoftirqd accounting was wrong even in earlier kernels.  In earlier
kernels, after a ksoftirq run, the kernel would simply stop accounting
softirq time spent outside of ksoftirqd until that (accumulated) time
exceeded the time for which ksofirqd previously had run.

Fix both the underflow and the wrong accounting by using a counter
specifically for the non-ksoftirqd softirq case.

This code has been fixed in current mainline by a499a5a14db
("sched/cputime: Increment kcpustat directly on irqtime account") [note
also the followup 25e2d8c1b9e327e ("sched/cputime: Fix ksoftirqd cputime
accounting regression")], but that patch is a part of the many changes
for eliminating of cputime_t so it does not seem suitable for backport.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 include/linux/kernel_stat.h | 1 +
 kernel/sched/cputime.c  | 9 -
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 44fda64..d0826f1 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -33,6 +33,7 @@ enum cpu_usage_stat {
 
 struct kernel_cpustat {
u64 cpustat[NR_STATS];
+   u64 softirq_no_ksoftirqd;
 };
 
 struct kernel_stat {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5ebee31..1b5a9e6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,12 +73,19 @@ EXPORT_SYMBOL_GPL(irqtime_account_irq);
 static cputime_t irqtime_account_update(u64 irqtime, int idx, cputime_t 
maxtime)
 {
u64 *cpustat = kcpustat_this_cpu->cpustat;
+   u64 base = cpustat[idx];
cputime_t irq_cputime;
 
-   irq_cputime = nsecs_to_cputime64(irqtime) - cpustat[idx];
+   if (idx == CPUTIME_SOFTIRQ)
+   base = kcpustat_this_cpu->softirq_no_ksoftirqd;
+
+   irq_cputime = nsecs_to_cputime64(irqtime) - base;
irq_cputime = min(irq_cputime, maxtime);
cpustat[idx] += irq_cputime;
 
+   if (idx == CPUTIME_SOFTIRQ)
+   kcpustat_this_cpu->softirq_no_ksoftirqd += irq_cputime;
+
return irq_cputime;
 }
 
-- 
2.1.4

[PATCH for v4.9-stable] sched: fix softirq time accounting

2017-12-13 Thread Rabin Vincent

From: Rabin Vincent 

softirq time accounting is broken on v4.9.x if ksoftirqd runs.

With
CONFIG_IRQ_TIME_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set

this test code:

struct tasklet_struct tasklet;

static void delay_tasklet(unsigned long data)
{
udelay(10);
tasklet_schedule();
}

tasklet_init(, delay_tasklet, 0);
tasklet_schedule();

results in:

$ while :; do grep cpu0 /proc/stat; done
cpu0 5 0 80 25 16 107 1 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 80 25 16 107 0 0 0 0
cpu0 5 0 81 25 16 107 0 0 0 0
cpu0 5 0 81 25 16 107 1 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 18446744073709551615 0 0 0
cpu0 5 0 81 25 16 108 0 0 0 0
cpu0 6 0 81 25 16 108 0 0 0 0
cpu0 6 0 81 25 16 108 0 0 0 0

As can be seen, the softirq numbers are totally bogus.

When ksoftirq is running, irqtime_account_process_tick() increments
cpustat[CPUSTAT_SOFTIRQ].  This causes the "nsecs_to_cputime64(irqtime)
- cpustat[CPUSTAT_SOFTIRQ]" calculation in irqtime_account_update() to
underflow the next time a softirq is handled leading to the above
values.

The underflow bug was added by 57430218317e5b280 ("sched/cputime: Count
actually elapsed irq & softirq time").

But ksoftirqd accounting was wrong even in earlier kernels.  In earlier
kernels, after a ksoftirq run, the kernel would simply stop accounting
softirq time spent outside of ksoftirqd until that (accumulated) time
exceeded the time for which ksofirqd previously had run.

Fix both the underflow and the wrong accounting by using a counter
specifically for the non-ksoftirqd softirq case.

This code has been fixed in current mainline by a499a5a14db
("sched/cputime: Increment kcpustat directly on irqtime account") [note
also the followup 25e2d8c1b9e327e ("sched/cputime: Fix ksoftirqd cputime
accounting regression")], but that patch is a part of the many changes
for eliminating of cputime_t so it does not seem suitable for backport.

Signed-off-by: Rabin Vincent 
---
 include/linux/kernel_stat.h | 1 +
 kernel/sched/cputime.c  | 9 -
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 44fda64..d0826f1 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -33,6 +33,7 @@ enum cpu_usage_stat {
 
 struct kernel_cpustat {
u64 cpustat[NR_STATS];
+   u64 softirq_no_ksoftirqd;
 };
 
 struct kernel_stat {
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5ebee31..1b5a9e6 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -73,12 +73,19 @@ EXPORT_SYMBOL_GPL(irqtime_account_irq);
 static cputime_t irqtime_account_update(u64 irqtime, int idx, cputime_t 
maxtime)
 {
u64 *cpustat = kcpustat_this_cpu->cpustat;
+   u64 base = cpustat[idx];
cputime_t irq_cputime;
 
-   irq_cputime = nsecs_to_cputime64(irqtime) - cpustat[idx];
+   if (idx == CPUTIME_SOFTIRQ)
+   base = kcpustat_this_cpu->softirq_no_ksoftirqd;
+
+   irq_cputime = nsecs_to_cputime64(irqtime) - base;
irq_cputime = min(irq_cputime, maxtime);
cpustat[idx] += irq_cputime;
 
+   if (idx == CPUTIME_SOFTIRQ)
+   kcpustat_this_cpu->softirq_no_ksoftirqd += irq_cputime;
+
return irq_cputime;
 }
 
-- 
2.1.4

Re: [PATCH 1/2] arm64: mm: abort uaccess retries upon fatal signal

2017-11-13 Thread Rabin Vincent

On Tue, Aug 22, 2017 at 10:45:24AM +0100, Will Deacon wrote:
> On Mon, Aug 21, 2017 at 02:42:03PM +0100, Mark Rutland wrote:
> > On Tue, Jul 11, 2017 at 03:58:49PM +0100, Will Deacon wrote:
> > > On Tue, Jul 11, 2017 at 03:19:22PM +0100, Mark Rutland wrote:
> > > > When there's a fatal signal pending, arm64's do_page_fault()
> > > > implementation returns 0. The intent is that we'll return to the
> > > > faulting userspace instruction, delivering the signal on the way.
> > > > 
> > > > However, if we take a fatal signal during fixing up a uaccess, this
> > > > results in a return to the faulting kernel instruction, which will be
> > > > instantly retried, resulting in the same fault being taken forever. As
> > > > the task never reaches userspace, the signal is not delivered, and the
> > > > task is left unkillable. While the task is stuck in this state, it can
> > > > inhibit the forward progress of the system.
> > > > 
> > > > To avoid this, we must ensure that when a fatal signal is pending, we
> > > > apply any necessary fixup for a faulting kernel instruction. Thus we
> > > > will return to an error path, and it is up to that code to make forward
> > > > progress towards delivering the fatal signal.
> > > > 
> > > > Signed-off-by: Mark Rutland 
> > > > Reviewed-by: Steve Capper 
> > > > Tested-by: Steve Capper 
> > > > Cc: Catalin Marinas 
> > > > Cc: James Morse 
> > > > Cc: Laura Abbott 
> > > > Cc: Will Deacon 
> > > > Cc: sta...@vger.kernel.org
> > > > ---
> > > >  arch/arm64/mm/fault.c | 5 -
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > > index 37b95df..3952d5e 100644
> > > > --- a/arch/arm64/mm/fault.c
> > > > +++ b/arch/arm64/mm/fault.c
> > > > @@ -397,8 +397,11 @@ static int __kprobes do_page_fault(unsigned long 
> > > > addr, unsigned int esr,
> > > >  * signal first. We do not need to release the mmap_sem because 
> > > > it
> > > >  * would already be released in __lock_page_or_retry in 
> > > > mm/filemap.c.
> > > >  */
> > > > -   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> > > > +   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > > > +   if (!user_mode(regs))
> > > > +   goto no_context;
> > > > return 0;
> > > > +   }
> > > 
> > > This will need rebasing at -rc1 (take a look at current HEAD).
> > > 
> > > Also, I think it introduces a weird corner case where we take a page fault
> > > when writing the signal frame to the user stack to deliver a SIGSEGV. If
> > > we end up with VM_FAULT_RETRY and somebody has sent a SIGKILL to the task,
> > > then we'll fail setup_sigframe and force an un-handleable SIGSEGV instead
> > > of SIGKILL.
> > > 
> > > The end result (task is killed) is the same, but the fatal signal is 
> > > wrong.
> > 
> > That doesn't seem to be the case, testing on v4.13-rc5.
> > 
> > I used sigaltstack() to use the userfaultfd region as signal stack,
> > registerd a SIGSEGV handler, and dereferenced NULL. The task locks up,
> > but when killed with a SIGINT or SIGKILL, the exit status reflects that
> > signal, rather than the SIGSEGV.
> > 
> > If I move the SIGINT handler onto the userfaultfd-monitored stack, then
> > delivering SIGINT hangs, but can be killed with SIGKILL, and the exit
> > status reflects that SIGKILL.
> > 
> > As you say, it does look like we'd try to set up a deferred SIGSEGV for
> > the failed signal delivery.
> > 
> > I haven't yet figured out exactly how that works; I'll keep digging.
> 
> The SEGV makes it all the way into do_group_exit, but then signal_group_exit
> is set and the exit_code is overridden with SIGKILL at the last minute (see
> complete_signal).

Unfortunately, this last minute is too late for print-fatal-signals.
With print-fatal-signals enabled, this patch leads to misleading
"potentially unexpected fatal signal 11" warnings if a process is
SIGKILL'd at the right time.

I've seen it without userfaultfd, but it's easiest reproduced by
patching Mark's original test code [1] with the following patch and
simply running "pkill -WINCH foo; pkill -KILL foo".  This results in:

 foo: potentially unexpected fatal signal 11.
 CPU: 1 PID: 1793 Comm: foo Not tainted 4.9.58-devel #3
 task: b3534780 task.stack: b4b7c000
 PC is at 0x76effb60
 LR is at 0x4227f4
 pc : [<76effb60>]lr : [<004227f4>]psr: 600b0010
 sp : 7eaf7bb4  ip :   fp : 
 r10: 0001  r9 : 0003  r8 : 76fcd000
 r7 : 001d  r6 : 76fd0cf0  r5 : 7eaf7c08  r4 : 
 r3 :   r2 :   r1 : 7eaf7a88  r0 : fffc
 Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
 Control: 10c5387d  Table: 3357404a  DAC: 0055
 CPU: 1 PID: 1793 Comm: foo Not tainted

Re: [PATCH 1/2] arm64: mm: abort uaccess retries upon fatal signal

2017-11-13 Thread Rabin Vincent

On Tue, Aug 22, 2017 at 10:45:24AM +0100, Will Deacon wrote:
> On Mon, Aug 21, 2017 at 02:42:03PM +0100, Mark Rutland wrote:
> > On Tue, Jul 11, 2017 at 03:58:49PM +0100, Will Deacon wrote:
> > > On Tue, Jul 11, 2017 at 03:19:22PM +0100, Mark Rutland wrote:
> > > > When there's a fatal signal pending, arm64's do_page_fault()
> > > > implementation returns 0. The intent is that we'll return to the
> > > > faulting userspace instruction, delivering the signal on the way.
> > > > 
> > > > However, if we take a fatal signal during fixing up a uaccess, this
> > > > results in a return to the faulting kernel instruction, which will be
> > > > instantly retried, resulting in the same fault being taken forever. As
> > > > the task never reaches userspace, the signal is not delivered, and the
> > > > task is left unkillable. While the task is stuck in this state, it can
> > > > inhibit the forward progress of the system.
> > > > 
> > > > To avoid this, we must ensure that when a fatal signal is pending, we
> > > > apply any necessary fixup for a faulting kernel instruction. Thus we
> > > > will return to an error path, and it is up to that code to make forward
> > > > progress towards delivering the fatal signal.
> > > > 
> > > > Signed-off-by: Mark Rutland 
> > > > Reviewed-by: Steve Capper 
> > > > Tested-by: Steve Capper 
> > > > Cc: Catalin Marinas 
> > > > Cc: James Morse 
> > > > Cc: Laura Abbott 
> > > > Cc: Will Deacon 
> > > > Cc: sta...@vger.kernel.org
> > > > ---
> > > >  arch/arm64/mm/fault.c | 5 -
> > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > > > index 37b95df..3952d5e 100644
> > > > --- a/arch/arm64/mm/fault.c
> > > > +++ b/arch/arm64/mm/fault.c
> > > > @@ -397,8 +397,11 @@ static int __kprobes do_page_fault(unsigned long 
> > > > addr, unsigned int esr,
> > > >  * signal first. We do not need to release the mmap_sem because 
> > > > it
> > > >  * would already be released in __lock_page_or_retry in 
> > > > mm/filemap.c.
> > > >  */
> > > > -   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> > > > +   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
> > > > +   if (!user_mode(regs))
> > > > +   goto no_context;
> > > > return 0;
> > > > +   }
> > > 
> > > This will need rebasing at -rc1 (take a look at current HEAD).
> > > 
> > > Also, I think it introduces a weird corner case where we take a page fault
> > > when writing the signal frame to the user stack to deliver a SIGSEGV. If
> > > we end up with VM_FAULT_RETRY and somebody has sent a SIGKILL to the task,
> > > then we'll fail setup_sigframe and force an un-handleable SIGSEGV instead
> > > of SIGKILL.
> > > 
> > > The end result (task is killed) is the same, but the fatal signal is 
> > > wrong.
> > 
> > That doesn't seem to be the case, testing on v4.13-rc5.
> > 
> > I used sigaltstack() to use the userfaultfd region as signal stack,
> > registerd a SIGSEGV handler, and dereferenced NULL. The task locks up,
> > but when killed with a SIGINT or SIGKILL, the exit status reflects that
> > signal, rather than the SIGSEGV.
> > 
> > If I move the SIGINT handler onto the userfaultfd-monitored stack, then
> > delivering SIGINT hangs, but can be killed with SIGKILL, and the exit
> > status reflects that SIGKILL.
> > 
> > As you say, it does look like we'd try to set up a deferred SIGSEGV for
> > the failed signal delivery.
> > 
> > I haven't yet figured out exactly how that works; I'll keep digging.
> 
> The SEGV makes it all the way into do_group_exit, but then signal_group_exit
> is set and the exit_code is overridden with SIGKILL at the last minute (see
> complete_signal).

Unfortunately, this last minute is too late for print-fatal-signals.
With print-fatal-signals enabled, this patch leads to misleading
"potentially unexpected fatal signal 11" warnings if a process is
SIGKILL'd at the right time.

I've seen it without userfaultfd, but it's easiest reproduced by
patching Mark's original test code [1] with the following patch and
simply running "pkill -WINCH foo; pkill -KILL foo".  This results in:

 foo: potentially unexpected fatal signal 11.
 CPU: 1 PID: 1793 Comm: foo Not tainted 4.9.58-devel #3
 task: b3534780 task.stack: b4b7c000
 PC is at 0x76effb60
 LR is at 0x4227f4
 pc : [<76effb60>]lr : [<004227f4>]psr: 600b0010
 sp : 7eaf7bb4  ip :   fp : 
 r10: 0001  r9 : 0003  r8 : 76fcd000
 r7 : 001d  r6 : 76fd0cf0  r5 : 7eaf7c08  r4 : 
 r3 :   r2 :   r1 : 7eaf7a88  r0 : fffc
 Flags: nZCv  IRQs on  FIQs on  Mode USER_32  ISA ARM  Segment user
 Control: 10c5387d  Table: 3357404a  DAC: 0055
 CPU: 1 PID: 1793 Comm: foo Not tainted 4.9.58-devel #3
 [<801113f0>] (unwind_backtrace) from [<8010cfb0>] (show_stack+0x18/0x1c)
 [<8010cfb0>] (show_stack) from [<8039725c>]

[PATCH] CIFS: fix circular locking dependency

2017-06-29 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

When a CIFS filesystem is mounted with the forcemand option and the
following command is run on it, lockdep warns about a circular locking
dependency between CifsInodeInfo::lock_sem and the inode lock.

 while echo foo > hello; do :; done & while touch -c hello; do :; done

cifs_writev() takes the locks in the wrong order, but note that we can't
only flip the order around because it releases the inode lock before the
call to generic_write_sync() while it holds the lock_sem across that
call.

But, AFAICS, there is no need to hold the CifsInodeInfo::lock_sem across
the generic_write_sync() call either, so we can release both the locks
before generic_write_sync(), and change the order.

 ==
 WARNING: possible circular locking dependency detected
 4.12.0-rc7+ #9 Not tainted
 --
 touch/487 is trying to acquire lock:
  (>lock_sem){..}, at: cifsFileInfo_put+0x88f/0x16a0

 but task is already holding lock:
  (>s_type->i_mutex_key#11){+.+.+.}, at: utimes_common+0x3ad/0x870

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (>s_type->i_mutex_key#11){+.+.+.}:
__lock_acquire+0x1f74/0x38f0
lock_acquire+0x1cc/0x600
down_write+0x74/0x110
cifs_strict_writev+0x3cb/0x8c0
__vfs_write+0x4c1/0x930
vfs_write+0x14c/0x2d0
SyS_write+0xf7/0x240
entry_SYSCALL_64_fastpath+0x1f/0xbe

 -> #0 (>lock_sem){..}:
check_prevs_add+0xfa0/0x1d10
__lock_acquire+0x1f74/0x38f0
lock_acquire+0x1cc/0x600
down_write+0x74/0x110
cifsFileInfo_put+0x88f/0x16a0
cifs_setattr+0x992/0x1680
notify_change+0x61a/0xa80
utimes_common+0x3d4/0x870
do_utimes+0x1c1/0x220
SyS_utimensat+0x84/0x1a0
entry_SYSCALL_64_fastpath+0x1f/0xbe

 other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(>s_type->i_mutex_key#11);
lock(>lock_sem);
lock(>s_type->i_mutex_key#11);
   lock(>lock_sem);

  *** DEADLOCK ***

 2 locks held by touch/487:
  #0:  (sb_writers#10){.+.+.+}, at: mnt_want_write+0x41/0xb0
  #1:  (>s_type->i_mutex_key#11){+.+.+.}, at: utimes_common+0x3ad/0x870

 stack backtrace:
 CPU: 0 PID: 487 Comm: touch Not tainted 4.12.0-rc7+ #9
 Call Trace:
  dump_stack+0xdb/0x185
  print_circular_bug+0x45b/0x790
  __lock_acquire+0x1f74/0x38f0
  lock_acquire+0x1cc/0x600
  down_write+0x74/0x110
  cifsFileInfo_put+0x88f/0x16a0
  cifs_setattr+0x992/0x1680
  notify_change+0x61a/0xa80
  utimes_common+0x3d4/0x870
  do_utimes+0x1c1/0x220
  SyS_utimensat+0x84/0x1a0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: 19dfc1f5f2ef03a52 ("cifs: fix the race in cifs_writev()")
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 fs/cifs/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index fcef706..d16fa55 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2810,12 +2810,12 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
struct TCP_Server_Info *server = tlink_tcon(cfile->tlink)->ses->server;
ssize_t rc;
 
+   inode_lock(inode);
/*
 * We need to hold the sem to be sure nobody modifies lock list
 * with a brlock that prevents writing.
 */
down_read(>lock_sem);
-   inode_lock(inode);
 
rc = generic_write_checks(iocb, from);
if (rc <= 0)
@@ -2828,11 +2828,11 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
else
rc = -EACCES;
 out:
+   up_read(>lock_sem);
inode_unlock(inode);
 
if (rc > 0)
rc = generic_write_sync(iocb, rc);
-   up_read(>lock_sem);
return rc;
 }
 
-- 
2.1.4

[PATCH] CIFS: fix circular locking dependency

2017-06-29 Thread Rabin Vincent

From: Rabin Vincent 

When a CIFS filesystem is mounted with the forcemand option and the
following command is run on it, lockdep warns about a circular locking
dependency between CifsInodeInfo::lock_sem and the inode lock.

 while echo foo > hello; do :; done & while touch -c hello; do :; done

cifs_writev() takes the locks in the wrong order, but note that we can't
only flip the order around because it releases the inode lock before the
call to generic_write_sync() while it holds the lock_sem across that
call.

But, AFAICS, there is no need to hold the CifsInodeInfo::lock_sem across
the generic_write_sync() call either, so we can release both the locks
before generic_write_sync(), and change the order.

 ==
 WARNING: possible circular locking dependency detected
 4.12.0-rc7+ #9 Not tainted
 --
 touch/487 is trying to acquire lock:
  (>lock_sem){..}, at: cifsFileInfo_put+0x88f/0x16a0

 but task is already holding lock:
  (>s_type->i_mutex_key#11){+.+.+.}, at: utimes_common+0x3ad/0x870

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (>s_type->i_mutex_key#11){+.+.+.}:
__lock_acquire+0x1f74/0x38f0
lock_acquire+0x1cc/0x600
down_write+0x74/0x110
cifs_strict_writev+0x3cb/0x8c0
__vfs_write+0x4c1/0x930
vfs_write+0x14c/0x2d0
SyS_write+0xf7/0x240
entry_SYSCALL_64_fastpath+0x1f/0xbe

 -> #0 (>lock_sem){..}:
check_prevs_add+0xfa0/0x1d10
__lock_acquire+0x1f74/0x38f0
lock_acquire+0x1cc/0x600
down_write+0x74/0x110
cifsFileInfo_put+0x88f/0x16a0
cifs_setattr+0x992/0x1680
notify_change+0x61a/0xa80
utimes_common+0x3d4/0x870
do_utimes+0x1c1/0x220
SyS_utimensat+0x84/0x1a0
entry_SYSCALL_64_fastpath+0x1f/0xbe

 other info that might help us debug this:

  Possible unsafe locking scenario:

CPU0CPU1

   lock(>s_type->i_mutex_key#11);
lock(>lock_sem);
lock(>s_type->i_mutex_key#11);
   lock(>lock_sem);

  *** DEADLOCK ***

 2 locks held by touch/487:
  #0:  (sb_writers#10){.+.+.+}, at: mnt_want_write+0x41/0xb0
  #1:  (>s_type->i_mutex_key#11){+.+.+.}, at: utimes_common+0x3ad/0x870

 stack backtrace:
 CPU: 0 PID: 487 Comm: touch Not tainted 4.12.0-rc7+ #9
 Call Trace:
  dump_stack+0xdb/0x185
  print_circular_bug+0x45b/0x790
  __lock_acquire+0x1f74/0x38f0
  lock_acquire+0x1cc/0x600
  down_write+0x74/0x110
  cifsFileInfo_put+0x88f/0x16a0
  cifs_setattr+0x992/0x1680
  notify_change+0x61a/0xa80
  utimes_common+0x3d4/0x870
  do_utimes+0x1c1/0x220
  SyS_utimensat+0x84/0x1a0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Fixes: 19dfc1f5f2ef03a52 ("cifs: fix the race in cifs_writev()")
Signed-off-by: Rabin Vincent 
---
 fs/cifs/file.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index fcef706..d16fa55 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2810,12 +2810,12 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
struct TCP_Server_Info *server = tlink_tcon(cfile->tlink)->ses->server;
ssize_t rc;
 
+   inode_lock(inode);
/*
 * We need to hold the sem to be sure nobody modifies lock list
 * with a brlock that prevents writing.
 */
down_read(>lock_sem);
-   inode_lock(inode);
 
rc = generic_write_checks(iocb, from);
if (rc <= 0)
@@ -2828,11 +2828,11 @@ cifs_writev(struct kiocb *iocb, struct iov_iter *from)
else
rc = -EACCES;
 out:
+   up_read(>lock_sem);
inode_unlock(inode);
 
if (rc > 0)
rc = generic_write_sync(iocb, rc);
-   up_read(>lock_sem);
return rc;
 }
 
-- 
2.1.4

Re: [4.9.28] vmscan: shrink_slab: ext4_es_scan+0x0/0x150 negative objects to delete nr=-2147483624

2017-06-12 Thread Rabin Vincent

On Thu, May 18, 2017 at 07:21:49AM +0200, Marc Burkhardt wrote:
> tonight my dmesg was flooded with mesages like 
> 
> vmscan: shrink_slab: ext4_es_scan+0x0/0x150 negative objects to delete 
> nr=-2147483624
> 
> Is that an integer overflow happening in ext4?
> 
> It's the first time I see this message. Any help on how to debug/reprocude 
> this
> are appreciated. Please advice if you want me to investigate this.

I haven't attempted to debug nor reproduce it, but what I can tell you
is that it does not not have anything to with ext4.  I've seen similar
messages with a completely different slab, on 4.9.26:

 [367594.725081] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482285
 [367595.046073] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479427
 [367595.279228] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482317
 [367595.459529] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482353
 [367595.497191] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482386
 [367595.521578] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482413
 [367595.551109] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482501
 [367598.344400] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482458
 [367598.369103] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482493
 [367598.403148] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482521
 [367598.422815] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482611
 [367598.524128] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147483238
 [367601.554775] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482245
 [367601.582922] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482279
 [367601.620175] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482307
 [367602.958946] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479516
 [367603.630417] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482412
 [367603.746885] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482512
 [367603.769490] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482217
 [367604.155461] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479940
 [367604.174624] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482635
 [367604.197573] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482595

I don't see any fixes/changes to mm/vmscan.c in newer 4.9-stable kernels
other than these ones which were already merged v4.9.14:

 $ git shortlog v4.9..v4.9.30 -- mm/vmscan.c
 Michal Hocko (3):
   mm, memcg: fix the active list aging for lowmem requests when memcg is 
enabled
   mm, vmscan: cleanup lru size claculations
   mm, vmscan: consider eligible zones in get_scan_count
 
 Shaohua Li (1):
   mm/vmscan.c: set correct defer count for shrinker

Perhaps one of the above people or someone else in linux-mm recognizes this.

Re: [4.9.28] vmscan: shrink_slab: ext4_es_scan+0x0/0x150 negative objects to delete nr=-2147483624

2017-06-12 Thread Rabin Vincent

On Thu, May 18, 2017 at 07:21:49AM +0200, Marc Burkhardt wrote:
> tonight my dmesg was flooded with mesages like 
> 
> vmscan: shrink_slab: ext4_es_scan+0x0/0x150 negative objects to delete 
> nr=-2147483624
> 
> Is that an integer overflow happening in ext4?
> 
> It's the first time I see this message. Any help on how to debug/reprocude 
> this
> are appreciated. Please advice if you want me to investigate this.

I haven't attempted to debug nor reproduce it, but what I can tell you
is that it does not not have anything to with ext4.  I've seen similar
messages with a completely different slab, on 4.9.26:

 [367594.725081] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482285
 [367595.046073] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479427
 [367595.279228] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482317
 [367595.459529] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482353
 [367595.497191] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482386
 [367595.521578] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482413
 [367595.551109] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482501
 [367598.344400] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482458
 [367598.369103] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482493
 [367598.403148] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482521
 [367598.422815] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482611
 [367598.524128] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147483238
 [367601.554775] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482245
 [367601.582922] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482279
 [367601.620175] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482307
 [367602.958946] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479516
 [367603.630417] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482412
 [367603.746885] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482512
 [367603.769490] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482217
 [367604.155461] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147479940
 [367604.174624] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482635
 [367604.197573] vmscan: shrink_slab: super_cache_scan+0x0/0x19c negative 
objects to delete nr=-2147482595

I don't see any fixes/changes to mm/vmscan.c in newer 4.9-stable kernels
other than these ones which were already merged v4.9.14:

 $ git shortlog v4.9..v4.9.30 -- mm/vmscan.c
 Michal Hocko (3):
   mm, memcg: fix the active list aging for lowmem requests when memcg is 
enabled
   mm, vmscan: cleanup lru size claculations
   mm, vmscan: consider eligible zones in get_scan_count
 
 Shaohua Li (1):
   mm/vmscan.c: set correct defer count for shrinker

Perhaps one of the above people or someone else in linux-mm recognizes this.

[PATCH v4] ubifs: allow userspace to map mounts to volumes

2017-05-31 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, show the UBI number and UBI volume number as part of the
options visible under /proc/mounts.

Also, accept and ignore the ubi= vol= options if they are used mounting
(patch from Richard Weinberger).

 # mount -t ubifs ubi:baz x
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubirename /dev/ubi0 baz bazz
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubinfo -d 0 -n 2
 Volume ID:   2 (on ubi0)
 Type:dynamic
 Alignment:   1
 Size:67 LEBs (1063424 bytes, 1.0 MiB)
 State:   OK
 Name:bazz
 Character device major/minor: 254:3

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
v4 - ignore options in mount

 fs/ubifs/super.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..50cebbe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -446,6 +446,8 @@ static int ubifs_show_options(struct seq_file *s, struct 
dentry *root)
   ubifs_compr_name(c->mount_opts.compr_type));
}
 
+   seq_printf(s, ",ubi=%d,vol=%d", c->vi.ubi_num, c->vi.vol_id);
+
return 0;
 }
 
@@ -931,6 +933,7 @@ enum {
Opt_chk_data_crc,
Opt_no_chk_data_crc,
Opt_override_compr,
+   Opt_ignore,
Opt_err,
 };
 
@@ -942,6 +945,8 @@ static const match_table_t tokens = {
{Opt_chk_data_crc, "chk_data_crc"},
{Opt_no_chk_data_crc, "no_chk_data_crc"},
{Opt_override_compr, "compr=%s"},
+   {Opt_ignore, "ubi=%s"},
+   {Opt_ignore, "vol=%s"},
{Opt_err, NULL},
 };
 
@@ -1042,6 +1047,8 @@ static int ubifs_parse_options(struct ubifs_info *c, char 
*options,
c->default_compr = c->mount_opts.compr_type;
break;
}
+   case Opt_ignore:
+   break;
default:
{
unsigned long flag;
-- 
2.1.4

[PATCH v4] ubifs: allow userspace to map mounts to volumes

2017-05-31 Thread Rabin Vincent

From: Rabin Vincent 

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, show the UBI number and UBI volume number as part of the
options visible under /proc/mounts.

Also, accept and ignore the ubi= vol= options if they are used mounting
(patch from Richard Weinberger).

 # mount -t ubifs ubi:baz x
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubirename /dev/ubi0 baz bazz
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubinfo -d 0 -n 2
 Volume ID:   2 (on ubi0)
 Type:dynamic
 Alignment:   1
 Size:67 LEBs (1063424 bytes, 1.0 MiB)
 State:   OK
 Name:bazz
 Character device major/minor: 254:3

Signed-off-by: Rabin Vincent 
---
v4 - ignore options in mount

 fs/ubifs/super.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..50cebbe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -446,6 +446,8 @@ static int ubifs_show_options(struct seq_file *s, struct 
dentry *root)
   ubifs_compr_name(c->mount_opts.compr_type));
}
 
+   seq_printf(s, ",ubi=%d,vol=%d", c->vi.ubi_num, c->vi.vol_id);
+
return 0;
 }
 
@@ -931,6 +933,7 @@ enum {
Opt_chk_data_crc,
Opt_no_chk_data_crc,
Opt_override_compr,
+   Opt_ignore,
Opt_err,
 };
 
@@ -942,6 +945,8 @@ static const match_table_t tokens = {
{Opt_chk_data_crc, "chk_data_crc"},
{Opt_no_chk_data_crc, "no_chk_data_crc"},
{Opt_override_compr, "compr=%s"},
+   {Opt_ignore, "ubi=%s"},
+   {Opt_ignore, "vol=%s"},
{Opt_err, NULL},
 };
 
@@ -1042,6 +1047,8 @@ static int ubifs_parse_options(struct ubifs_info *c, char 
*options,
c->default_compr = c->mount_opts.compr_type;
break;
}
+   case Opt_ignore:
+   break;
default:
{
unsigned long flag;
-- 
2.1.4

Re: Setting ->s_dev to a char device (Was: Re: [PATCH v2] ubifs: allow userspace to map mounts to volumes)

2017-05-29 Thread Rabin Vincent

On Mon, May 29, 2017 at 01:08:25PM +0100, Al Viro wrote:
> Userspace sure as hell does.  st_dev in stat(2) is a block device number;
> moreover, there might _be_ a block device with the same number at the same
> time - even mounted.  Why not make ->show_options() print the currently
> valid volume name, anyway?  That would seem to be the obvious approach...

The following patch works for me, if it's OK to show options that can't
actually be used when mounting:

8<---
>From 19797334f9c87a2e0c90fe2c93f29cce397b6230 Mon Sep 17 00:00:00 2001
From: Rabin Vincent <rab...@axis.com>
Date: Mon, 29 May 2017 16:08:44 +0200
Subject: [PATCH v3] ubifs: allow userspace to map mounts to volumes

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, show the UBI number and UBI volume number as part of the
options visible under /proc/mounts.

 # mount -t ubifs ubi:baz x
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubirename /dev/ubi0 baz bazz
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubinfo -d 0 -n 2
 Volume ID:   2 (on ubi0)
 Type:dynamic
 Alignment:   1
 Size:67 LEBs (1063424 bytes, 1.0 MiB)
 State:   OK
 Name:bazz
 Character device major/minor: 254:3

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 fs/ubifs/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..4b54186 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -446,6 +446,8 @@ static int ubifs_show_options(struct seq_file *s, struct 
dentry *root)
   ubifs_compr_name(c->mount_opts.compr_type));
}
 
+   seq_printf(s, ",ubi=%d,vol=%d", c->vi.ubi_num, c->vi.vol_id);
+
return 0;
 }
 
-- 
2.1.4

Re: Setting ->s_dev to a char device (Was: Re: [PATCH v2] ubifs: allow userspace to map mounts to volumes)

2017-05-29 Thread Rabin Vincent

On Mon, May 29, 2017 at 01:08:25PM +0100, Al Viro wrote:
> Userspace sure as hell does.  st_dev in stat(2) is a block device number;
> moreover, there might _be_ a block device with the same number at the same
> time - even mounted.  Why not make ->show_options() print the currently
> valid volume name, anyway?  That would seem to be the obvious approach...

The following patch works for me, if it's OK to show options that can't
actually be used when mounting:

8<---
>From 19797334f9c87a2e0c90fe2c93f29cce397b6230 Mon Sep 17 00:00:00 2001
From: Rabin Vincent 
Date: Mon, 29 May 2017 16:08:44 +0200
Subject: [PATCH v3] ubifs: allow userspace to map mounts to volumes

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, show the UBI number and UBI volume number as part of the
options visible under /proc/mounts.

 # mount -t ubifs ubi:baz x
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubirename /dev/ubi0 baz bazz
 # mount
 ubi:baz on /root/x type ubifs (rw,relatime,ubi=0,vol=2)
 # ubinfo -d 0 -n 2
 Volume ID:   2 (on ubi0)
 Type:dynamic
 Alignment:   1
 Size:67 LEBs (1063424 bytes, 1.0 MiB)
 State:   OK
 Name:bazz
 Character device major/minor: 254:3

Signed-off-by: Rabin Vincent 
---
 fs/ubifs/super.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..4b54186 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -446,6 +446,8 @@ static int ubifs_show_options(struct seq_file *s, struct 
dentry *root)
   ubifs_compr_name(c->mount_opts.compr_type));
}
 
+   seq_printf(s, ",ubi=%d,vol=%d", c->vi.ubi_num, c->vi.vol_id);
+
return 0;
 }
 
-- 
2.1.4

[PATCH v2] ubifs: allow userspace to map mounts to volumes

2017-05-29 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, provide a directory in /sys/fs/ubifs named after the
underlying anonymous block device's number (obtainable by userspace via
stat(2)) and provide a link named "ubi" to the underlying UBI volume.

 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # ubirename /dev/ubi0 rootfs foo
 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # stat /
   File: /
   Size: 1520   Blocks: 0  IO Block: 4096   directory
 Device: dh/13d Inode: 1   Links: 18
 ...
 # ls -l /sys/fs/ubifs/
 drwxr-xr-x2 root root 0 May 23 09:57 0:13
 drwxr-xr-x2 root root 0 May 23 09:57 0:17
 # ls -l /sys/fs/ubifs/0\:13/
 lrwxrwxrwx1 root root 0 May 23 11:45 ubi
   -> ../../../devices/virtual/ubi/ubi0/ubi0_10

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
v2: export symbol to fix module build

 Documentation/ABI/testing/sysfs-fs-ubifs |  6 +++
 drivers/mtd/ubi/kapi.c   | 13 +++
 fs/ubifs/Makefile|  2 +-
 fs/ubifs/super.c | 16 +++-
 fs/ubifs/sysfs.c | 66 
 fs/ubifs/sysfs.h | 11 ++
 fs/ubifs/ubifs.h |  7 
 include/linux/mtd/ubi.h  |  3 ++
 8 files changed, 122 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-fs-ubifs
 create mode 100644 fs/ubifs/sysfs.c
 create mode 100644 fs/ubifs/sysfs.h

diff --git a/Documentation/ABI/testing/sysfs-fs-ubifs 
b/Documentation/ABI/testing/sysfs-fs-ubifs
new file mode 100644
index 000..1735859
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-fs-ubifs
@@ -0,0 +1,6 @@
+What:  /sys/fs/ubifs//ubi
+Date:  May 2017
+Description:
+   This symbolic link points to the file system's underlying UBI
+   volume.  The  is the major:minor numbers of the anonymous
+   block device backing the file system.
diff --git a/drivers/mtd/ubi/kapi.c b/drivers/mtd/ubi/kapi.c
index d4b2e87..395dae6 100644
--- a/drivers/mtd/ubi/kapi.c
+++ b/drivers/mtd/ubi/kapi.c
@@ -107,6 +107,19 @@ void ubi_get_volume_info(struct ubi_volume_desc *desc,
 EXPORT_SYMBOL_GPL(ubi_get_volume_info);
 
 /**
+ * ubi_volume_kobject - get kobject for a UBI volume.
+ * @desc: volume descriptor
+ *
+ * Retrieves a pointer to the struct kobject underlying the UBI volume.
+ * The caller must hold a reference to the UBI volume.
+ */
+struct kobject *ubi_volume_kobj(struct ubi_volume_desc *desc)
+{
+   return >vol->dev.kobj;
+}
+EXPORT_SYMBOL_GPL(ubi_volume_kobj);
+
+/**
  * ubi_open_volume - open UBI volume.
  * @ubi_num: UBI device number
  * @vol_id: volume ID
diff --git a/fs/ubifs/Makefile b/fs/ubifs/Makefile
index 6f3251c..7bf4689 100644
--- a/fs/ubifs/Makefile
+++ b/fs/ubifs/Makefile
@@ -4,5 +4,5 @@ ubifs-y += shrinker.o journal.o file.o dir.o super.o sb.o io.o
 ubifs-y += tnc.o master.o scan.o replay.o log.o commit.o gc.o orphan.o
 ubifs-y += budget.o find.o tnc_commit.o compress.o lpt.o lprops.o
 ubifs-y += recovery.o ioctl.o lpt_commit.o tnc_misc.o xattr.o debug.o
-ubifs-y += misc.o
+ubifs-y += sysfs.o misc.o
 ubifs-$(CONFIG_UBIFS_FS_ENCRYPTION) += crypto.o
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..fdcfefe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1164,6 +1164,10 @@ static int mount_ubifs(struct ubifs_info *c)
if (err)
return err;
 
+   err = ubifs_sysfs_register(c);
+   if (err)
+   goto out_debugging;
+
err = check_volume_empty(c);
if (err)
goto out_free;
@@ -1496,6 +1500,8 @@ static int mount_ubifs(struct ubifs_info *c)
vfree(c->ileb_buf);
vfree(c->sbuf);
kfree(c->bottom_up_buf);
+   ubifs_sysfs_unregister(c);
+out_debugging:
ubifs_debugging_exit(c);
return err;
 }
@@ -1536,6 +1542,7 @@ static void ubifs_umount(struct ubifs_info *c)
vfree(c->sbuf);
kfree(c->bottom_up_buf);
ubifs_debugging_exit(c);
+   ubifs_sysfs_unregister(c);
 }
 
 /**
@@ -2271,14 +2278,20 @@ static int __init ubifs_init(void)
if (err)
goto out_compr;
 
+   err = ubifs_sysfs_init();
+   if (err)
+   goto out_dbg;
+
err = register_filesystem(_fs_type);
if (err) {
pr_err("UBIFS error (pid %d): cannot register file system, 
error %d",
   current->pid, err);
-   goto out_dbg;
+   goto out_sysfs;

[PATCH v2] ubifs: allow userspace to map mounts to volumes

2017-05-29 Thread Rabin Vincent

From: Rabin Vincent 

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, provide a directory in /sys/fs/ubifs named after the
underlying anonymous block device's number (obtainable by userspace via
stat(2)) and provide a link named "ubi" to the underlying UBI volume.

 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # ubirename /dev/ubi0 rootfs foo
 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # stat /
   File: /
   Size: 1520   Blocks: 0  IO Block: 4096   directory
 Device: dh/13d Inode: 1   Links: 18
 ...
 # ls -l /sys/fs/ubifs/
 drwxr-xr-x2 root root 0 May 23 09:57 0:13
 drwxr-xr-x2 root root 0 May 23 09:57 0:17
 # ls -l /sys/fs/ubifs/0\:13/
 lrwxrwxrwx1 root root 0 May 23 11:45 ubi
   -> ../../../devices/virtual/ubi/ubi0/ubi0_10

Signed-off-by: Rabin Vincent 
---
v2: export symbol to fix module build

 Documentation/ABI/testing/sysfs-fs-ubifs |  6 +++
 drivers/mtd/ubi/kapi.c   | 13 +++
 fs/ubifs/Makefile|  2 +-
 fs/ubifs/super.c | 16 +++-
 fs/ubifs/sysfs.c | 66 
 fs/ubifs/sysfs.h | 11 ++
 fs/ubifs/ubifs.h |  7 
 include/linux/mtd/ubi.h  |  3 ++
 8 files changed, 122 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-fs-ubifs
 create mode 100644 fs/ubifs/sysfs.c
 create mode 100644 fs/ubifs/sysfs.h

diff --git a/Documentation/ABI/testing/sysfs-fs-ubifs 
b/Documentation/ABI/testing/sysfs-fs-ubifs
new file mode 100644
index 000..1735859
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-fs-ubifs
@@ -0,0 +1,6 @@
+What:  /sys/fs/ubifs//ubi
+Date:  May 2017
+Description:
+   This symbolic link points to the file system's underlying UBI
+   volume.  The  is the major:minor numbers of the anonymous
+   block device backing the file system.
diff --git a/drivers/mtd/ubi/kapi.c b/drivers/mtd/ubi/kapi.c
index d4b2e87..395dae6 100644
--- a/drivers/mtd/ubi/kapi.c
+++ b/drivers/mtd/ubi/kapi.c
@@ -107,6 +107,19 @@ void ubi_get_volume_info(struct ubi_volume_desc *desc,
 EXPORT_SYMBOL_GPL(ubi_get_volume_info);
 
 /**
+ * ubi_volume_kobject - get kobject for a UBI volume.
+ * @desc: volume descriptor
+ *
+ * Retrieves a pointer to the struct kobject underlying the UBI volume.
+ * The caller must hold a reference to the UBI volume.
+ */
+struct kobject *ubi_volume_kobj(struct ubi_volume_desc *desc)
+{
+   return >vol->dev.kobj;
+}
+EXPORT_SYMBOL_GPL(ubi_volume_kobj);
+
+/**
  * ubi_open_volume - open UBI volume.
  * @ubi_num: UBI device number
  * @vol_id: volume ID
diff --git a/fs/ubifs/Makefile b/fs/ubifs/Makefile
index 6f3251c..7bf4689 100644
--- a/fs/ubifs/Makefile
+++ b/fs/ubifs/Makefile
@@ -4,5 +4,5 @@ ubifs-y += shrinker.o journal.o file.o dir.o super.o sb.o io.o
 ubifs-y += tnc.o master.o scan.o replay.o log.o commit.o gc.o orphan.o
 ubifs-y += budget.o find.o tnc_commit.o compress.o lpt.o lprops.o
 ubifs-y += recovery.o ioctl.o lpt_commit.o tnc_misc.o xattr.o debug.o
-ubifs-y += misc.o
+ubifs-y += sysfs.o misc.o
 ubifs-$(CONFIG_UBIFS_FS_ENCRYPTION) += crypto.o
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..fdcfefe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1164,6 +1164,10 @@ static int mount_ubifs(struct ubifs_info *c)
if (err)
return err;
 
+   err = ubifs_sysfs_register(c);
+   if (err)
+   goto out_debugging;
+
err = check_volume_empty(c);
if (err)
goto out_free;
@@ -1496,6 +1500,8 @@ static int mount_ubifs(struct ubifs_info *c)
vfree(c->ileb_buf);
vfree(c->sbuf);
kfree(c->bottom_up_buf);
+   ubifs_sysfs_unregister(c);
+out_debugging:
ubifs_debugging_exit(c);
return err;
 }
@@ -1536,6 +1542,7 @@ static void ubifs_umount(struct ubifs_info *c)
vfree(c->sbuf);
kfree(c->bottom_up_buf);
ubifs_debugging_exit(c);
+   ubifs_sysfs_unregister(c);
 }
 
 /**
@@ -2271,14 +2278,20 @@ static int __init ubifs_init(void)
if (err)
goto out_compr;
 
+   err = ubifs_sysfs_init();
+   if (err)
+   goto out_dbg;
+
err = register_filesystem(_fs_type);
if (err) {
pr_err("UBIFS error (pid %d): cannot register file system, 
error %d",
   current->pid, err);
-   goto out_dbg;
+   goto out_sysfs;
}
return 0;
 
+out_sysfs:
+   ubifs_s

[PATCH] ubifs: allow userspace to map mounts to volumes

2017-05-24 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, provide a directory in /sys/fs/ubifs named after the
underlying anonymous block device's number (obtainable by userspace via
stat(2)) and provide a link named "ubi" to the underlying UBI volume.

 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # ubirename /dev/ubi0 rootfs foo
 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # stat /
   File: /
   Size: 1520   Blocks: 0  IO Block: 4096   directory
 Device: dh/13d Inode: 1   Links: 18
 ...
 # ls -l /sys/fs/ubifs/
 drwxr-xr-x2 root root 0 May 23 09:57 0:13
 drwxr-xr-x2 root root 0 May 23 09:57 0:17
 # ls -l /sys/fs/ubifs/0\:13/
 lrwxrwxrwx1 root root 0 May 23 11:45 ubi
   -> ../../../devices/virtual/ubi/ubi0/ubi0_10

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 Documentation/ABI/testing/sysfs-fs-ubifs |  6 +++
 drivers/mtd/ubi/kapi.c   | 12 ++
 fs/ubifs/Makefile|  2 +-
 fs/ubifs/super.c | 16 +++-
 fs/ubifs/sysfs.c | 66 
 fs/ubifs/sysfs.h | 11 ++
 fs/ubifs/ubifs.h |  7 
 include/linux/mtd/ubi.h  |  3 ++
 8 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-fs-ubifs
 create mode 100644 fs/ubifs/sysfs.c
 create mode 100644 fs/ubifs/sysfs.h

diff --git a/Documentation/ABI/testing/sysfs-fs-ubifs 
b/Documentation/ABI/testing/sysfs-fs-ubifs
new file mode 100644
index 000..1735859
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-fs-ubifs
@@ -0,0 +1,6 @@
+What:  /sys/fs/ubifs//ubi
+Date:  May 2017
+Description:
+   This symbolic link points to the file system's underlying UBI
+   volume.  The  is the major:minor numbers of the anonymous
+   block device backing the file system.
diff --git a/drivers/mtd/ubi/kapi.c b/drivers/mtd/ubi/kapi.c
index d4b2e87..aa766d9 100644
--- a/drivers/mtd/ubi/kapi.c
+++ b/drivers/mtd/ubi/kapi.c
@@ -107,6 +107,18 @@ void ubi_get_volume_info(struct ubi_volume_desc *desc,
 EXPORT_SYMBOL_GPL(ubi_get_volume_info);
 
 /**
+ * ubi_volume_kobject - get kobject for a UBI volume.
+ * @desc: volume descriptor
+ *
+ * Retrieves a pointer to the struct kobject underlying the UBI volume.
+ * The caller must hold a reference to the UBI volume.
+ */
+struct kobject *ubi_volume_kobj(struct ubi_volume_desc *desc)
+{
+   return >vol->dev.kobj;
+}
+
+/**
  * ubi_open_volume - open UBI volume.
  * @ubi_num: UBI device number
  * @vol_id: volume ID
diff --git a/fs/ubifs/Makefile b/fs/ubifs/Makefile
index 6f3251c..7bf4689 100644
--- a/fs/ubifs/Makefile
+++ b/fs/ubifs/Makefile
@@ -4,5 +4,5 @@ ubifs-y += shrinker.o journal.o file.o dir.o super.o sb.o io.o
 ubifs-y += tnc.o master.o scan.o replay.o log.o commit.o gc.o orphan.o
 ubifs-y += budget.o find.o tnc_commit.o compress.o lpt.o lprops.o
 ubifs-y += recovery.o ioctl.o lpt_commit.o tnc_misc.o xattr.o debug.o
-ubifs-y += misc.o
+ubifs-y += sysfs.o misc.o
 ubifs-$(CONFIG_UBIFS_FS_ENCRYPTION) += crypto.o
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..fdcfefe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1164,6 +1164,10 @@ static int mount_ubifs(struct ubifs_info *c)
if (err)
return err;
 
+   err = ubifs_sysfs_register(c);
+   if (err)
+   goto out_debugging;
+
err = check_volume_empty(c);
if (err)
goto out_free;
@@ -1496,6 +1500,8 @@ static int mount_ubifs(struct ubifs_info *c)
vfree(c->ileb_buf);
vfree(c->sbuf);
kfree(c->bottom_up_buf);
+   ubifs_sysfs_unregister(c);
+out_debugging:
ubifs_debugging_exit(c);
return err;
 }
@@ -1536,6 +1542,7 @@ static void ubifs_umount(struct ubifs_info *c)
vfree(c->sbuf);
kfree(c->bottom_up_buf);
ubifs_debugging_exit(c);
+   ubifs_sysfs_unregister(c);
 }
 
 /**
@@ -2271,14 +2278,20 @@ static int __init ubifs_init(void)
if (err)
goto out_compr;
 
+   err = ubifs_sysfs_init();
+   if (err)
+   goto out_dbg;
+
err = register_filesystem(_fs_type);
if (err) {
pr_err("UBIFS error (pid %d): cannot register file system, 
error %d",
   current->pid, err);
-   goto out_dbg;
+   goto out_sysfs;
}
return 0;
 
+out_sysfs:
+   ubifs_sysfs_exit();
 out_dbg:
dbg_de

[PATCH] ubifs: allow userspace to map mounts to volumes

2017-05-24 Thread Rabin Vincent

From: Rabin Vincent 

There currently appears to be no way for userspace to find out the
underlying volume number for a mounted ubifs file system, since ubifs
uses anonymous block devices.  The volume name is present in
/proc/mounts but UBI volumes can be renamed after the volume has been
mounted.

To remedy this, provide a directory in /sys/fs/ubifs named after the
underlying anonymous block device's number (obtainable by userspace via
stat(2)) and provide a link named "ubi" to the underlying UBI volume.

 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # ubirename /dev/ubi0 rootfs foo
 # mount | head -1
 ubi0:rootfs on / type ubifs (rw,relatime)
 # stat /
   File: /
   Size: 1520   Blocks: 0  IO Block: 4096   directory
 Device: dh/13d Inode: 1   Links: 18
 ...
 # ls -l /sys/fs/ubifs/
 drwxr-xr-x2 root root 0 May 23 09:57 0:13
 drwxr-xr-x2 root root 0 May 23 09:57 0:17
 # ls -l /sys/fs/ubifs/0\:13/
 lrwxrwxrwx1 root root 0 May 23 11:45 ubi
   -> ../../../devices/virtual/ubi/ubi0/ubi0_10

Signed-off-by: Rabin Vincent 
---
 Documentation/ABI/testing/sysfs-fs-ubifs |  6 +++
 drivers/mtd/ubi/kapi.c   | 12 ++
 fs/ubifs/Makefile|  2 +-
 fs/ubifs/super.c | 16 +++-
 fs/ubifs/sysfs.c | 66 
 fs/ubifs/sysfs.h | 11 ++
 fs/ubifs/ubifs.h |  7 
 include/linux/mtd/ubi.h  |  3 ++
 8 files changed, 121 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-fs-ubifs
 create mode 100644 fs/ubifs/sysfs.c
 create mode 100644 fs/ubifs/sysfs.h

diff --git a/Documentation/ABI/testing/sysfs-fs-ubifs 
b/Documentation/ABI/testing/sysfs-fs-ubifs
new file mode 100644
index 000..1735859
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-fs-ubifs
@@ -0,0 +1,6 @@
+What:  /sys/fs/ubifs//ubi
+Date:  May 2017
+Description:
+   This symbolic link points to the file system's underlying UBI
+   volume.  The  is the major:minor numbers of the anonymous
+   block device backing the file system.
diff --git a/drivers/mtd/ubi/kapi.c b/drivers/mtd/ubi/kapi.c
index d4b2e87..aa766d9 100644
--- a/drivers/mtd/ubi/kapi.c
+++ b/drivers/mtd/ubi/kapi.c
@@ -107,6 +107,18 @@ void ubi_get_volume_info(struct ubi_volume_desc *desc,
 EXPORT_SYMBOL_GPL(ubi_get_volume_info);
 
 /**
+ * ubi_volume_kobject - get kobject for a UBI volume.
+ * @desc: volume descriptor
+ *
+ * Retrieves a pointer to the struct kobject underlying the UBI volume.
+ * The caller must hold a reference to the UBI volume.
+ */
+struct kobject *ubi_volume_kobj(struct ubi_volume_desc *desc)
+{
+   return >vol->dev.kobj;
+}
+
+/**
  * ubi_open_volume - open UBI volume.
  * @ubi_num: UBI device number
  * @vol_id: volume ID
diff --git a/fs/ubifs/Makefile b/fs/ubifs/Makefile
index 6f3251c..7bf4689 100644
--- a/fs/ubifs/Makefile
+++ b/fs/ubifs/Makefile
@@ -4,5 +4,5 @@ ubifs-y += shrinker.o journal.o file.o dir.o super.o sb.o io.o
 ubifs-y += tnc.o master.o scan.o replay.o log.o commit.o gc.o orphan.o
 ubifs-y += budget.o find.o tnc_commit.o compress.o lpt.o lprops.o
 ubifs-y += recovery.o ioctl.o lpt_commit.o tnc_misc.o xattr.o debug.o
-ubifs-y += misc.o
+ubifs-y += sysfs.o misc.o
 ubifs-$(CONFIG_UBIFS_FS_ENCRYPTION) += crypto.o
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index cf4cc99..fdcfefe 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1164,6 +1164,10 @@ static int mount_ubifs(struct ubifs_info *c)
if (err)
return err;
 
+   err = ubifs_sysfs_register(c);
+   if (err)
+   goto out_debugging;
+
err = check_volume_empty(c);
if (err)
goto out_free;
@@ -1496,6 +1500,8 @@ static int mount_ubifs(struct ubifs_info *c)
vfree(c->ileb_buf);
vfree(c->sbuf);
kfree(c->bottom_up_buf);
+   ubifs_sysfs_unregister(c);
+out_debugging:
ubifs_debugging_exit(c);
return err;
 }
@@ -1536,6 +1542,7 @@ static void ubifs_umount(struct ubifs_info *c)
vfree(c->sbuf);
kfree(c->bottom_up_buf);
ubifs_debugging_exit(c);
+   ubifs_sysfs_unregister(c);
 }
 
 /**
@@ -2271,14 +2278,20 @@ static int __init ubifs_init(void)
if (err)
goto out_compr;
 
+   err = ubifs_sysfs_init();
+   if (err)
+   goto out_dbg;
+
err = register_filesystem(_fs_type);
if (err) {
pr_err("UBIFS error (pid %d): cannot register file system, 
error %d",
   current->pid, err);
-   goto out_dbg;
+   goto out_sysfs;
}
return 0;
 
+out_sysfs:
+   ubifs_sysfs_exit();
 out_dbg:
dbg_debugfs_exit();
 out_compr:
@@ -2308,

[PATCH] CIFS: fix oplock break deadlocks

2017-05-03 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 -
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){.+}, at: process_one_work+0x255/0x8e0
  #1:  ((>work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   taskPC stack   pid father
 kworker/0:1 D016  2 0x
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd  D0   683171 0x
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((>oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
 in-flight: 16:cifs_oplock_break
 delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 fs/cifs/cifsfs.c   | 15 +--
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/misc.c |  2 +-
 fs/cifs/smb2misc.c |  5 +++--
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 34fee9f..5c52c8f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -87,6 +87,7 @@ extern mempool_t *cifs_req_poolp;
 extern mempool_t *cifs_mid_poolp;
 
 struct workqueue_struct*cifsiod_wq;
+struct workqueue_struct*cifsoplockd_wq;
 __u32 cifs_lock_secret;
 
 /*
@@ -1374,9 +1375,16 @@ init_cifs(void)
goto out_clean_proc;
}
 
+   cifsoplockd_wq = alloc_workqueue("ci

[PATCH] CIFS: fix oplock break deadlocks

2017-05-03 Thread Rabin Vincent

From: Rabin Vincent 

When the final cifsFileInfo_put() is called from cifsiod and an oplock
break work is queued, lockdep complains loudly:

 =
 [ INFO: possible recursive locking detected ]
 4.11.0+ #21 Not tainted
 -
 kworker/0:2/78 is trying to acquire lock:
  ("cifsiod"){.+}, at: flush_work+0x215/0x350

 but task is already holding lock:
  ("cifsiod"){.+}, at: process_one_work+0x255/0x8e0

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock("cifsiod");
   lock("cifsiod");

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 2 locks held by kworker/0:2/78:
  #0:  ("cifsiod"){.+}, at: process_one_work+0x255/0x8e0
  #1:  ((>work)){+.+...}, at: process_one_work+0x255/0x8e0

 stack backtrace:
 CPU: 0 PID: 78 Comm: kworker/0:2 Not tainted 4.11.0+ #21
 Workqueue: cifsiod cifs_writev_complete
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? match_held_lock+0x20/0x2b0
  ? trace_hardirqs_off_caller+0x86/0x130
  ? mark_lock+0xa6/0x920
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? flush_work+0x215/0x350
  flush_work+0x236/0x350
  ? flush_work+0x215/0x350
  ? destroy_worker+0x170/0x170
  __cancel_work_timer+0x17d/0x210
  ? ___preempt_schedule+0x16/0x18
  cancel_work_sync+0x10/0x20
  cifsFileInfo_put+0x338/0x7f0
  cifs_writedata_release+0x2a/0x40
  ? cifs_writedata_release+0x2a/0x40
  cifs_writev_complete+0x29d/0x850
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40

This is a real warning.  Since the oplock is queued on the same
workqueue this can deadlock if there is only one worker thread active
for the workqueue (which will be the case during memory pressure when
the rescuer thread is handling it).

Furthermore, there is at least one other kind of hang possible due to
the oplock break handling if there is only worker.  (This can be
reproduced without introducing memory pressure by having passing 1 for
the max_active parameter of cifsiod.) cifs_oplock_break() can wait
indefintely in the filemap_fdatawait() while the cifs_writev_complete()
work is blocked:

 sysrq: SysRq : Show Blocked State
   taskPC stack   pid father
 kworker/0:1 D016  2 0x
 Workqueue: cifsiod cifs_oplock_break
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x4a/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  cifs_oplock_break+0x651/0x710
  ? preempt_count_sub+0x18/0xd0
  process_one_work+0x304/0x8e0
  worker_thread+0x9b/0x6a0
  kthread+0x1b2/0x200
  ? process_one_work+0x8e0/0x8e0
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x31/0x40
 dd  D0   683171 0x
 Call Trace:
  __schedule+0x562/0xf40
  ? mark_held_locks+0x29/0xb0
  schedule+0x57/0xe0
  io_schedule+0x21/0x50
  wait_on_page_bit+0x143/0x190
  ? add_to_page_cache_lru+0x150/0x150
  __filemap_fdatawait_range+0x134/0x190
  ? do_writepages+0x51/0x70
  filemap_fdatawait_range+0x14/0x30
  filemap_fdatawait+0x3b/0x40
  filemap_write_and_wait+0x4e/0x70
  cifs_flush+0x6a/0xb0
  filp_close+0x52/0xa0
  __close_fd+0xdc/0x150
  SyS_close+0x33/0x60
  entry_SYSCALL_64_fastpath+0x1f/0xbe

 Showing all locks held in the system:
 2 locks held by kworker/0:1/16:
  #0:  ("cifsiod"){.+.+.+}, at: process_one_work+0x255/0x8e0
  #1:  ((>oplock_break)){+.+.+.}, at: process_one_work+0x255/0x8e0

 Showing busy workqueues and worker pools:
 workqueue cifsiod: flags=0xc
   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
 in-flight: 16:cifs_oplock_break
 delayed: cifs_writev_complete, cifs_echo_request
 pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=0s workers=3 idle: 750 3

Fix these problems by creating a a new workqueue (with a rescuer) for
the oplock break work.

Signed-off-by: Rabin Vincent 
---
 fs/cifs/cifsfs.c   | 15 +--
 fs/cifs/cifsglob.h |  1 +
 fs/cifs/misc.c |  2 +-
 fs/cifs/smb2misc.c |  5 +++--
 4 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 34fee9f..5c52c8f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -87,6 +87,7 @@ extern mempool_t *cifs_req_poolp;
 extern mempool_t *cifs_mid_poolp;
 
 struct workqueue_struct*cifsiod_wq;
+struct workqueue_struct*cifsoplockd_wq;
 __u32 cifs_lock_secret;
 
 /*
@@ -1374,9 +1375,16 @@ init_cifs(void)
goto out_clean_proc;
}
 
+   cifsoplockd_wq = alloc_workqueue("cifsoplockd",
+

[PATCH] CIFS: silence lockdep splat in cifs_relock_file()

2017-05-03 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv().  Lockdep complains
about this.  AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.

 =
 [ INFO: possible recursive locking detected ]
 4.11.0+ #20 Not tainted
 -
 cat/701 is trying to acquire lock:
  (>lock_sem){.+}, at: cifs_reopen_file+0x7a7/0xc00

 but task is already holding lock:
  (>lock_sem){.+}, at: cifs_strict_readv+0x177/0x310

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(>lock_sem);
   lock(>lock_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by cat/701:
  #0:  (>lock_sem){.+}, at: cifs_strict_readv+0x177/0x310

 stack backtrace:
 CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  ? preempt_schedule_irq+0x6b/0x80
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? cifs_reopen_file+0x7a7/0xc00
  down_read+0x2d/0x70
  ? cifs_reopen_file+0x7a7/0xc00
  cifs_reopen_file+0x7a7/0xc00
  ? printk+0x43/0x4b
  cifs_readpage_worker+0x327/0x8a0
  cifs_readpage+0x8c/0x2a0
  generic_file_read_iter+0x692/0xd00
  cifs_strict_readv+0x29f/0x310
  generic_file_splice_read+0x11c/0x1c0
  do_splice_to+0xa5/0xc0
  splice_direct_to_actor+0xfa/0x350
  ? generic_pipe_buf_nosteal+0x10/0x10
  do_splice_direct+0xb5/0xe0
  do_sendfile+0x278/0x3a0
  SyS_sendfile64+0xc4/0xe0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 fs/cifs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 21d4045..64b590b 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -582,7 +582,7 @@ cifs_relock_file(struct cifsFileInfo *cfile)
struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
int rc = 0;
 
-   down_read(>lock_sem);
+   down_read_nested(>lock_sem, SINGLE_DEPTH_NESTING);
if (cinode->can_cache_brlcks) {
/* can cache locks - no need to relock */
up_read(>lock_sem);
-- 
2.1.4

[PATCH] CIFS: silence lockdep splat in cifs_relock_file()

2017-05-03 Thread Rabin Vincent

From: Rabin Vincent 

cifs_relock_file() can perform a down_write() on the inode's lock_sem even
though it was already performed in cifs_strict_readv().  Lockdep complains
about this.  AFAICS, there is no problem here, and lockdep just needs to be
told that this nesting is OK.

 =
 [ INFO: possible recursive locking detected ]
 4.11.0+ #20 Not tainted
 -
 cat/701 is trying to acquire lock:
  (>lock_sem){.+}, at: cifs_reopen_file+0x7a7/0xc00

 but task is already holding lock:
  (>lock_sem){.+}, at: cifs_strict_readv+0x177/0x310

 other info that might help us debug this:
  Possible unsafe locking scenario:

CPU0

   lock(>lock_sem);
   lock(>lock_sem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by cat/701:
  #0:  (>lock_sem){.+}, at: cifs_strict_readv+0x177/0x310

 stack backtrace:
 CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
 Call Trace:
  dump_stack+0x85/0xc2
  __lock_acquire+0x17dd/0x2260
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  ? preempt_schedule_irq+0x6b/0x80
  lock_acquire+0xcc/0x260
  ? lock_acquire+0xcc/0x260
  ? cifs_reopen_file+0x7a7/0xc00
  down_read+0x2d/0x70
  ? cifs_reopen_file+0x7a7/0xc00
  cifs_reopen_file+0x7a7/0xc00
  ? printk+0x43/0x4b
  cifs_readpage_worker+0x327/0x8a0
  cifs_readpage+0x8c/0x2a0
  generic_file_read_iter+0x692/0xd00
  cifs_strict_readv+0x29f/0x310
  generic_file_splice_read+0x11c/0x1c0
  do_splice_to+0xa5/0xc0
  splice_direct_to_actor+0xfa/0x350
  ? generic_pipe_buf_nosteal+0x10/0x10
  do_splice_direct+0xb5/0xe0
  do_sendfile+0x278/0x3a0
  SyS_sendfile64+0xc4/0xe0
  entry_SYSCALL_64_fastpath+0x1f/0xbe

Signed-off-by: Rabin Vincent 
---
 fs/cifs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 21d4045..64b590b 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -582,7 +582,7 @@ cifs_relock_file(struct cifsFileInfo *cfile)
struct cifs_tcon *tcon = tlink_tcon(cfile->tlink);
int rc = 0;
 
-   down_read(>lock_sem);
+   down_read_nested(>lock_sem, SINGLE_DEPTH_NESTING);
if (cinode->can_cache_brlcks) {
/* can cache locks - no need to relock */
up_read(>lock_sem);
-- 
2.1.4

[PATCH] Makefile: evaluate LDFLAGS_BUILD_ID only once

2017-04-30 Thread Rabin Vincent

Evaluate LDFLAGS_BUILD_ID (which involves invoking the compiler) only
once instead of over and over.

This provides a ~20% reduction in null build time with x86 allnoconfig:

$ make allnoconfig && make -j8
$ perf stat -r5 -e sched:sched_process_exec make -j8
-   2 119  sched:sched_process_exec
+   1 878  sched:sched_process_exec

-   1,238817018 seconds time elapsed
+   0,971020553 seconds time elapsed

Signed-off-by: Rabin Vincent <ra...@rab.in>
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 779302695453..c78a266594c1 100644
--- a/Makefile
+++ b/Makefile
@@ -815,7 +815,7 @@ KBUILD_AFLAGS   += $(ARCH_AFLAGS)   $(KAFLAGS)
 KBUILD_CFLAGS   += $(ARCH_CFLAGS)   $(KCFLAGS)
 
 # Use --build-id when available.
-LDFLAGS_BUILD_ID = $(patsubst -Wl$(comma)%,%,\
+LDFLAGS_BUILD_ID := $(patsubst -Wl$(comma)%,%,\
  $(call cc-ldoption, -Wl$(comma)--build-id,))
 KBUILD_LDFLAGS_MODULE += $(LDFLAGS_BUILD_ID)
 LDFLAGS_vmlinux += $(LDFLAGS_BUILD_ID)
-- 
2.11.0

[PATCH] Makefile: evaluate LDFLAGS_BUILD_ID only once

2017-04-30 Thread Rabin Vincent

Evaluate LDFLAGS_BUILD_ID (which involves invoking the compiler) only
once instead of over and over.

This provides a ~20% reduction in null build time with x86 allnoconfig:

$ make allnoconfig && make -j8
$ perf stat -r5 -e sched:sched_process_exec make -j8
-   2 119  sched:sched_process_exec
+   1 878  sched:sched_process_exec

-   1,238817018 seconds time elapsed
+   0,971020553 seconds time elapsed

Signed-off-by: Rabin Vincent 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 779302695453..c78a266594c1 100644
--- a/Makefile
+++ b/Makefile
@@ -815,7 +815,7 @@ KBUILD_AFLAGS   += $(ARCH_AFLAGS)   $(KAFLAGS)
 KBUILD_CFLAGS   += $(ARCH_CFLAGS)   $(KCFLAGS)
 
 # Use --build-id when available.
-LDFLAGS_BUILD_ID = $(patsubst -Wl$(comma)%,%,\
+LDFLAGS_BUILD_ID := $(patsubst -Wl$(comma)%,%,\
  $(call cc-ldoption, -Wl$(comma)--build-id,))
 KBUILD_LDFLAGS_MODULE += $(LDFLAGS_BUILD_ID)
 LDFLAGS_vmlinux += $(LDFLAGS_BUILD_ID)
-- 
2.11.0

[PATCH] mm: prevent NR_ISOLATE_* stats from going negative

2017-04-20 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

Commit 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn
based migration") moved the dec_node_page_state() call (along with the
page_is_file_cache() call) to after putback_lru_page().  But
page_is_file_cache() can change after putback_lru_page() is called, so
it should be called before putback_lru_page(), as it was before that
patch, to prevent NR_ISOLATE_* stats from going negative.

Without this fix, non-CONFIG_SMP kernels end up hanging in the
while(too_many_isolated()) { congestion_wait() } loop in
shrink_active_list() due to the negative stats.

 Mem-Info:
  active_anon:32567 inactive_anon:121 isolated_anon:1
  active_file:6066 inactive_file:6639 isolated_file:4294967295
^^
  unevictable:0 dirty:115 writeback:0 unstable:0
  slab_reclaimable:2086 slab_unreclaimable:3167
  mapped:3398 shmem:18366 pagetables:1145 bounce:0
  free:1798 free_pcp:13 free_cma:0

Fixes: 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn based 
migration")
Cc: Ming Ling <ming.l...@spreadtrum.com>
Cc: Michal Hocko <mho...@suse.com>
Cc: Minchan Kim <minc...@kernel.org>
Cc: Vlastimil Babka <vba...@suse.cz>
Cc: <sta...@vger.kernel.org>
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index ed97c2c..738f1d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,9 +184,9 @@ void putback_movable_pages(struct list_head *l)
unlock_page(page);
put_page(page);
} else {
-   putback_lru_page(page);
dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
+   putback_lru_page(page);
}
}
 }
-- 
2.7.0

[PATCH] mm: prevent NR_ISOLATE_* stats from going negative

2017-04-20 Thread Rabin Vincent

From: Rabin Vincent 

Commit 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn
based migration") moved the dec_node_page_state() call (along with the
page_is_file_cache() call) to after putback_lru_page().  But
page_is_file_cache() can change after putback_lru_page() is called, so
it should be called before putback_lru_page(), as it was before that
patch, to prevent NR_ISOLATE_* stats from going negative.

Without this fix, non-CONFIG_SMP kernels end up hanging in the
while(too_many_isolated()) { congestion_wait() } loop in
shrink_active_list() due to the negative stats.

 Mem-Info:
  active_anon:32567 inactive_anon:121 isolated_anon:1
  active_file:6066 inactive_file:6639 isolated_file:4294967295
^^
  unevictable:0 dirty:115 writeback:0 unstable:0
  slab_reclaimable:2086 slab_unreclaimable:3167
  mapped:3398 shmem:18366 pagetables:1145 bounce:0
  free:1798 free_pcp:13 free_cma:0

Fixes: 6afcf8ef0ca0 ("mm, compaction: fix NR_ISOLATED_* stats for pfn based 
migration")
Cc: Ming Ling 
Cc: Michal Hocko 
Cc: Minchan Kim 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Rabin Vincent 
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index ed97c2c..738f1d5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -184,9 +184,9 @@ void putback_movable_pages(struct list_head *l)
unlock_page(page);
put_page(page);
} else {
-   putback_lru_page(page);
dec_node_page_state(page, NR_ISOLATED_ANON +
page_is_file_cache(page));
+   putback_lru_page(page);
}
}
 }
-- 
2.7.0

[PATCH] printk: fix lost last line in kmsg dump

2017-04-19 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

kmsg_dump_get_buffer() selects the youngest log messages which fit into
the provided buffer.  When that function determines the correct start
index, by looping and calling msg_print_text() with a NULL buffer, it
allows the youngest log messages to completely fill the provided buffer.

However, when doing the actual printing, an off-by-one error in
msg_print_text() leads to that function allowing the provided buffer to
only be filled to (size - 1).

So if the lengths of the selected youngest log messages happen to
completely fill up the provided buffer, the last log message is lost.

Note that msg_print_text() is also used from the syslog code but this
bug does trigger there since the buffers used in the syslog code are
never filled up completely (since they are only used to print individual
lines, and their size is always LOG_LINE_MAX + PREFIX_MAX, and
PREFIX_MAX is larger than the largest possible prefix).

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 kernel/printk/printk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index de08fc9..abac373 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1260,7 +1260,7 @@ static size_t msg_print_text(const struct printk_log 
*msg, enum log_flags prev,
 
if (buf) {
if (print_prefix(msg, syslog, NULL) +
-   text_len + 1 >= size - len)
+   text_len + 1 > size - len)
break;
 
if (prefix)
-- 
2.7.0

[PATCH] printk: fix lost last line in kmsg dump

2017-04-19 Thread Rabin Vincent

From: Rabin Vincent 

kmsg_dump_get_buffer() selects the youngest log messages which fit into
the provided buffer.  When that function determines the correct start
index, by looping and calling msg_print_text() with a NULL buffer, it
allows the youngest log messages to completely fill the provided buffer.

However, when doing the actual printing, an off-by-one error in
msg_print_text() leads to that function allowing the provided buffer to
only be filled to (size - 1).

So if the lengths of the selected youngest log messages happen to
completely fill up the provided buffer, the last log message is lost.

Note that msg_print_text() is also used from the syslog code but this
bug does trigger there since the buffers used in the syslog code are
never filled up completely (since they are only used to print individual
lines, and their size is always LOG_LINE_MAX + PREFIX_MAX, and
PREFIX_MAX is larger than the largest possible prefix).

Signed-off-by: Rabin Vincent 
---
 kernel/printk/printk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index de08fc9..abac373 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -1260,7 +1260,7 @@ static size_t msg_print_text(const struct printk_log 
*msg, enum log_flags prev,
 
if (buf) {
if (print_prefix(msg, syslog, NULL) +
-   text_len + 1 >= size - len)
+   text_len + 1 > size - len)
break;
 
if (prefix)
-- 
2.7.0

[PATCH] MIPS: perf: fix deadlock

2017-04-05 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

mipsxx_pmu_handle_shared_irq() calls irq_work_run() while holding the
pmuint_rwlock for read.  irq_work_run() can, via perf_pending_event(),
call try_to_wake_up() which can try to take rq->lock.

However, perf can also call perf_pmu_enable() (and thus take the
pmuint_rwlock for write) while holding the rq->lock, from
finish_task_switch() via perf_event_context_sched_in().

This leads to an ABBA deadlock:

 PID: 3855   TASK: 8f7ce288  CPU: 2   COMMAND: "process"
  #0 [89c39ac8] __delay at 803b5be4
  #1 [89c39ac8] do_raw_spin_lock at 8008fdcc
  #2 [89c39af8] try_to_wake_up at 8006e47c
  #3 [89c39b38] pollwake at 8018eab0
  #4 [89c39b68] __wake_up_common at 800879f4
  #5 [89c39b98] __wake_up at 800880e4
  #6 [89c39bc8] perf_event_wakeup at 8012109c
  #7 [89c39be8] perf_pending_event at 80121184
  #8 [89c39c08] irq_work_run_list at 801151f0
  #9 [89c39c38] irq_work_run at 80115274
 #10 [89c39c50] mipsxx_pmu_handle_shared_irq at 8002cc7c

 PID: 1481   TASK: 8eaac6a8  CPU: 3   COMMAND: "process"
  #0 [8de7f900] do_raw_write_lock at 800900e0
  #1 [8de7f918] perf_event_context_sched_in at 80122310
  #2 [8de7f938] __perf_event_task_sched_in at 80122608
  #3 [8de7f958] finish_task_switch at 8006b8a4
  #4 [8de7f998] __schedule at 805e4dc4
  #5 [8de7f9f8] schedule at 805e5558
  #6 [8de7fa10] schedule_hrtimeout_range_clock at 805e9984
  #7 [8de7fa70] poll_schedule_timeout at 8018e8f8
  #8 [8de7fa88] do_select at 8018f338
  #9 [8de7fd88] core_sys_select at 8018f5cc
 #10 [8de7fee0] sys_select at 8018f854
 #11 [8de7ff28] syscall_common at 80028fc8

The lock seems to be there to protect the hardware counters so there is
no need to hold it across irq_work_run().

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 arch/mips/kernel/perf_event_mipsxx.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/mips/kernel/perf_event_mipsxx.c 
b/arch/mips/kernel/perf_event_mipsxx.c
index 8c35b31..9452b02 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -1446,6 +1446,11 @@ static int mipsxx_pmu_handle_shared_irq(void)
HANDLE_COUNTER(0)
}
 
+#ifdef CONFIG_MIPS_PERF_SHARED_TC_COUNTERS
+   read_unlock(_rwlock);
+#endif
+   resume_local_counters();
+
/*
 * Do all the work for the pending perf events. We can do this
 * in here because the performance counter interrupt is a regular
@@ -1454,10 +1459,6 @@ static int mipsxx_pmu_handle_shared_irq(void)
if (handled == IRQ_HANDLED)
irq_work_run();
 
-#ifdef CONFIG_MIPS_PERF_SHARED_TC_COUNTERS
-   read_unlock(_rwlock);
-#endif
-   resume_local_counters();
return handled;
 }
 
-- 
2.7.0

[PATCH] MIPS: perf: fix deadlock

2017-04-05 Thread Rabin Vincent

From: Rabin Vincent 

mipsxx_pmu_handle_shared_irq() calls irq_work_run() while holding the
pmuint_rwlock for read.  irq_work_run() can, via perf_pending_event(),
call try_to_wake_up() which can try to take rq->lock.

However, perf can also call perf_pmu_enable() (and thus take the
pmuint_rwlock for write) while holding the rq->lock, from
finish_task_switch() via perf_event_context_sched_in().

This leads to an ABBA deadlock:

 PID: 3855   TASK: 8f7ce288  CPU: 2   COMMAND: "process"
  #0 [89c39ac8] __delay at 803b5be4
  #1 [89c39ac8] do_raw_spin_lock at 8008fdcc
  #2 [89c39af8] try_to_wake_up at 8006e47c
  #3 [89c39b38] pollwake at 8018eab0
  #4 [89c39b68] __wake_up_common at 800879f4
  #5 [89c39b98] __wake_up at 800880e4
  #6 [89c39bc8] perf_event_wakeup at 8012109c
  #7 [89c39be8] perf_pending_event at 80121184
  #8 [89c39c08] irq_work_run_list at 801151f0
  #9 [89c39c38] irq_work_run at 80115274
 #10 [89c39c50] mipsxx_pmu_handle_shared_irq at 8002cc7c

 PID: 1481   TASK: 8eaac6a8  CPU: 3   COMMAND: "process"
  #0 [8de7f900] do_raw_write_lock at 800900e0
  #1 [8de7f918] perf_event_context_sched_in at 80122310
  #2 [8de7f938] __perf_event_task_sched_in at 80122608
  #3 [8de7f958] finish_task_switch at 8006b8a4
  #4 [8de7f998] __schedule at 805e4dc4
  #5 [8de7f9f8] schedule at 805e5558
  #6 [8de7fa10] schedule_hrtimeout_range_clock at 805e9984
  #7 [8de7fa70] poll_schedule_timeout at 8018e8f8
  #8 [8de7fa88] do_select at 8018f338
  #9 [8de7fd88] core_sys_select at 8018f5cc
 #10 [8de7fee0] sys_select at 8018f854
 #11 [8de7ff28] syscall_common at 80028fc8

The lock seems to be there to protect the hardware counters so there is
no need to hold it across irq_work_run().

Signed-off-by: Rabin Vincent 
---
 arch/mips/kernel/perf_event_mipsxx.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/mips/kernel/perf_event_mipsxx.c 
b/arch/mips/kernel/perf_event_mipsxx.c
index 8c35b31..9452b02 100644
--- a/arch/mips/kernel/perf_event_mipsxx.c
+++ b/arch/mips/kernel/perf_event_mipsxx.c
@@ -1446,6 +1446,11 @@ static int mipsxx_pmu_handle_shared_irq(void)
HANDLE_COUNTER(0)
}
 
+#ifdef CONFIG_MIPS_PERF_SHARED_TC_COUNTERS
+   read_unlock(_rwlock);
+#endif
+   resume_local_counters();
+
/*
 * Do all the work for the pending perf events. We can do this
 * in here because the performance counter interrupt is a regular
@@ -1454,10 +1459,6 @@ static int mipsxx_pmu_handle_shared_irq(void)
if (handled == IRQ_HANDLED)
irq_work_run();
 
-#ifdef CONFIG_MIPS_PERF_SHARED_TC_COUNTERS
-   read_unlock(_rwlock);
-#endif
-   resume_local_counters();
return handled;
 }
 
-- 
2.7.0

Re: [PATCH] arm: ftrace: fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER

2017-03-18 Thread Rabin Vincent

On Wed, Mar 15, 2017 at 06:07:39PM +, Abel Vesa wrote:
> The support for dynamic ftrace with CONFIG_DEBUG_RODATA involves
> overriding the weak arch_ftrace_update_code() with a variant which makes
> the kernel text writable around the patching.
> 
> This override was however added under the CONFIG_OLD_MCOUNT ifdef, and
> CONFIG_OLD_MCOUNT is only enabled if frame pointers are enabled.
> 
> This leads to non-functional dynamic ftrace (ftrace triggers a
> WARN_ON()) when CONFIG_DEBUG_RODATA is enabled and CONFIG_FRAME_POINTER
> is not.
> 
> Move the override out of that ifdef and into the CONFIG_DYNAMIC_FTRACE
> ifdef where it belongs.
> 
> Fixes: 80d6b0c2eed2a ("ARM: mm: allow text and rodata sections to be 
> read-only")
> 
> Suggested-by: Nicolai Stange <nicsta...@gmail.com>
> Suggested-by: Rabin Vincent <ra...@rab.in>
> Signed-off-by: Abel Vesa <abelv...@linux.com>

Acked-by: Rabin Vincent <ra...@rab.in>

Re: [PATCH] arm: ftrace: fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER

2017-03-18 Thread Rabin Vincent

On Wed, Mar 15, 2017 at 06:07:39PM +, Abel Vesa wrote:
> The support for dynamic ftrace with CONFIG_DEBUG_RODATA involves
> overriding the weak arch_ftrace_update_code() with a variant which makes
> the kernel text writable around the patching.
> 
> This override was however added under the CONFIG_OLD_MCOUNT ifdef, and
> CONFIG_OLD_MCOUNT is only enabled if frame pointers are enabled.
> 
> This leads to non-functional dynamic ftrace (ftrace triggers a
> WARN_ON()) when CONFIG_DEBUG_RODATA is enabled and CONFIG_FRAME_POINTER
> is not.
> 
> Move the override out of that ifdef and into the CONFIG_DYNAMIC_FTRACE
> ifdef where it belongs.
> 
> Fixes: 80d6b0c2eed2a ("ARM: mm: allow text and rodata sections to be 
> read-only")
> 
> Suggested-by: Nicolai Stange 
> Suggested-by: Rabin Vincent 
> Signed-off-by: Abel Vesa 

Acked-by: Rabin Vincent

Re: [PATCHv5] arm: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS

2017-03-04 Thread Rabin Vincent

On Sat, Mar 04, 2017 at 12:51:12AM +, Abel Vesa wrote:
> diff --git a/arch/arm/kernel/entry-ftrace.S b/arch/arm/kernel/entry-ftrace.S
> index c73c403..93f9abb 100644
> --- a/arch/arm/kernel/entry-ftrace.S
> +++ b/arch/arm/kernel/entry-ftrace.S
> @@ -92,12 +92,78 @@
>  2:   mcount_exit
>  .endm
>  
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +
> +.macro __ftrace_regs_caller
> +
> + sub sp, sp, #8  @ space for CPSR and OLD_R0 (not used)
> +
> + add ip, sp, #12 @ move in IP the value of SP as it was
> + @ before the push {lr} of the mcount mechanism
> + stmdb   sp!, {ip,lr,pc}

This doesn't build with CONFIG_THUMB2_KERNEL:

entry-ftrace.S:285: Error: PC not allowed in register list -- `stmdb 
sp!,{ip,lr,pc}'

Saving PC in STMDB is prohibited in Thumb and deprecated in ARM.

> + stmdb   sp!, {r0-r11,lr}
> +
> + ldr r0, [sp, #S_LR] @ replace PC with LR
> + str r0, [sp, #S_PC] @ into pt_regs
> +
> + ldr r1, [sp, #PT_REGS_SIZE] @ replace new LR with
> + str r1, [sp, #S_LR] @ previous LR into pt_regs
> +
> + @ stack content at this point:
> + @ 0  4  48   52   5660   6468   72
> + @ R0 | R1 | ... | LR | SP + 4 | previous LR | LR | PSR | OLD_R0 | 
> previous LR |

So this code only adjust the SP by 72 bytes.  Since the stack is not
aligned to 8 bytes at entry to mcount, this code calls into C functions
with a misaligned stack.  This is a violation of the procedure call
standard.

Also, I believe this should have unwind annotations to indicate where it
saved the registers.  See mcount_enter for an example.

> +
> + mov r3, sp  @ struct pt_regs*
> + ldr r2, =function_trace_op
> + ldr r2, [r2]@ pointer to the current
> + @ function tracing op
> + ldr r1, [sp, #PT_REGS_SIZE] @ lr of instrumented func
> + mcount_adjust_addr  r0, lr  @ instrumented function
> +
> + .globl ftrace_regs_call
> +ftrace_regs_call:
> + bl  ftrace_stub
> +
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> + .globl ftrace_graph_regs_call
> +ftrace_graph_regs_call:
> + mov r0, r0
> +#endif
> + @ pop saved regs
> + ldmia   sp, {r0-r15}

This doesn't build either:

entry-ftrace.S:285: Error: LR and PC should not both be in register list -- 
`ldmia sp,{r0-r15}'

Restoring LR and PC together is prohibited in Thumb and deprecated in
ARM.  It's the same case with SP.

> +.endm
> +
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> +.macro __ftrace_graph_regs_caller
> +
> + sub r0, fp, #4  @ lr of instrumented routine (parent)
> +
> + @ called from __ftrace_regs_caller
> + ldr r1, [sp, #S_PC] @ instrumented routine (func)
> + mcount_adjust_addr  r1, r1
> +
> + mov r2, fp  @ frame pointer
> + bl  prepare_ftrace_return
> +
> + @ pop registers saved in ftrace_regs_caller
> + ldmia   sp, {r0-r15}

This doesn't get built on CONFIG_THUMB2_KERNEL since there's no suppport
for CONFIG_FUNCTION_GRAPH_TRACER there, but this is also using
deprecated operands.

> +.endm
> +#endif
> +#endif
...
> diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
> index 3f17594..f165265 100644
> --- a/arch/arm/kernel/ftrace.c
> +++ b/arch/arm/kernel/ftrace.c
> @@ -29,11 +29,6 @@
>  #endif
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE
> -#ifdef CONFIG_OLD_MCOUNT
> -#define OLD_MCOUNT_ADDR  ((unsigned long) mcount)
> -#define OLD_FTRACE_ADDR ((unsigned long) ftrace_caller_old)
> -
> -#define  OLD_NOP 0xe1a0  /* mov r0, r0 */
>  
>  static int __ftrace_modify_code(void *data)
>  {
> @@ -51,6 +46,12 @@ void arch_ftrace_update_code(int command)
>   stop_machine(__ftrace_modify_code, , NULL);
>  }
>  
> +#ifdef CONFIG_OLD_MCOUNT
> +#define OLD_MCOUNT_ADDR  ((unsigned long) mcount)
> +#define OLD_FTRACE_ADDR ((unsigned long) ftrace_caller_old)
> +
> +#define  OLD_NOP 0xe1a0  /* mov r0, r0 */
> +
>  static unsigned long ftrace_nop_replace(struct dyn_ftrace *rec)

This chunk above fixes the problem which Nicolai pointed out earlier.  Since
it's a bug fix for the current dynamic ftrace code, it should really be
submitted as a separate patch.  Feel free to use some variant of this
commit message which I wrote when I ran into this yesterday:

 ARM: ftrace: fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER
 
 The support for dynamic ftrace with CONFIG_DEBUG_RODATA involves
 overriding the weak arch_ftrace_update_code() with a variant which makes
 the kernel text writable around the patching.
 
 This override was however added under the CONFIG_OLD_MCOUNT ifdef, and
 CONFIG_OLD_MCOUNT is only enabled if frame pointers are enabled.
 
 This leads to non-functional dynamic ftrace (ftrace triggers a

Re: [PATCHv5] arm: ftrace: Adds support for CONFIG_DYNAMIC_FTRACE_WITH_REGS

2017-03-04 Thread Rabin Vincent

On Sat, Mar 04, 2017 at 12:51:12AM +, Abel Vesa wrote:
> diff --git a/arch/arm/kernel/entry-ftrace.S b/arch/arm/kernel/entry-ftrace.S
> index c73c403..93f9abb 100644
> --- a/arch/arm/kernel/entry-ftrace.S
> +++ b/arch/arm/kernel/entry-ftrace.S
> @@ -92,12 +92,78 @@
>  2:   mcount_exit
>  .endm
>  
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +
> +.macro __ftrace_regs_caller
> +
> + sub sp, sp, #8  @ space for CPSR and OLD_R0 (not used)
> +
> + add ip, sp, #12 @ move in IP the value of SP as it was
> + @ before the push {lr} of the mcount mechanism
> + stmdb   sp!, {ip,lr,pc}

This doesn't build with CONFIG_THUMB2_KERNEL:

entry-ftrace.S:285: Error: PC not allowed in register list -- `stmdb 
sp!,{ip,lr,pc}'

Saving PC in STMDB is prohibited in Thumb and deprecated in ARM.

> + stmdb   sp!, {r0-r11,lr}
> +
> + ldr r0, [sp, #S_LR] @ replace PC with LR
> + str r0, [sp, #S_PC] @ into pt_regs
> +
> + ldr r1, [sp, #PT_REGS_SIZE] @ replace new LR with
> + str r1, [sp, #S_LR] @ previous LR into pt_regs
> +
> + @ stack content at this point:
> + @ 0  4  48   52   5660   6468   72
> + @ R0 | R1 | ... | LR | SP + 4 | previous LR | LR | PSR | OLD_R0 | 
> previous LR |

So this code only adjust the SP by 72 bytes.  Since the stack is not
aligned to 8 bytes at entry to mcount, this code calls into C functions
with a misaligned stack.  This is a violation of the procedure call
standard.

Also, I believe this should have unwind annotations to indicate where it
saved the registers.  See mcount_enter for an example.

> +
> + mov r3, sp  @ struct pt_regs*
> + ldr r2, =function_trace_op
> + ldr r2, [r2]@ pointer to the current
> + @ function tracing op
> + ldr r1, [sp, #PT_REGS_SIZE] @ lr of instrumented func
> + mcount_adjust_addr  r0, lr  @ instrumented function
> +
> + .globl ftrace_regs_call
> +ftrace_regs_call:
> + bl  ftrace_stub
> +
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> + .globl ftrace_graph_regs_call
> +ftrace_graph_regs_call:
> + mov r0, r0
> +#endif
> + @ pop saved regs
> + ldmia   sp, {r0-r15}

This doesn't build either:

entry-ftrace.S:285: Error: LR and PC should not both be in register list -- 
`ldmia sp,{r0-r15}'

Restoring LR and PC together is prohibited in Thumb and deprecated in
ARM.  It's the same case with SP.

> +.endm
> +
> +#ifdef CONFIG_FUNCTION_GRAPH_TRACER
> +.macro __ftrace_graph_regs_caller
> +
> + sub r0, fp, #4  @ lr of instrumented routine (parent)
> +
> + @ called from __ftrace_regs_caller
> + ldr r1, [sp, #S_PC] @ instrumented routine (func)
> + mcount_adjust_addr  r1, r1
> +
> + mov r2, fp  @ frame pointer
> + bl  prepare_ftrace_return
> +
> + @ pop registers saved in ftrace_regs_caller
> + ldmia   sp, {r0-r15}

This doesn't get built on CONFIG_THUMB2_KERNEL since there's no suppport
for CONFIG_FUNCTION_GRAPH_TRACER there, but this is also using
deprecated operands.

> +.endm
> +#endif
> +#endif
...
> diff --git a/arch/arm/kernel/ftrace.c b/arch/arm/kernel/ftrace.c
> index 3f17594..f165265 100644
> --- a/arch/arm/kernel/ftrace.c
> +++ b/arch/arm/kernel/ftrace.c
> @@ -29,11 +29,6 @@
>  #endif
>  
>  #ifdef CONFIG_DYNAMIC_FTRACE
> -#ifdef CONFIG_OLD_MCOUNT
> -#define OLD_MCOUNT_ADDR  ((unsigned long) mcount)
> -#define OLD_FTRACE_ADDR ((unsigned long) ftrace_caller_old)
> -
> -#define  OLD_NOP 0xe1a0  /* mov r0, r0 */
>  
>  static int __ftrace_modify_code(void *data)
>  {
> @@ -51,6 +46,12 @@ void arch_ftrace_update_code(int command)
>   stop_machine(__ftrace_modify_code, , NULL);
>  }
>  
> +#ifdef CONFIG_OLD_MCOUNT
> +#define OLD_MCOUNT_ADDR  ((unsigned long) mcount)
> +#define OLD_FTRACE_ADDR ((unsigned long) ftrace_caller_old)
> +
> +#define  OLD_NOP 0xe1a0  /* mov r0, r0 */
> +
>  static unsigned long ftrace_nop_replace(struct dyn_ftrace *rec)

This chunk above fixes the problem which Nicolai pointed out earlier.  Since
it's a bug fix for the current dynamic ftrace code, it should really be
submitted as a separate patch.  Feel free to use some variant of this
commit message which I wrote when I ran into this yesterday:

 ARM: ftrace: fix dynamic ftrace with DEBUG_RODATA and !FRAME_POINTER
 
 The support for dynamic ftrace with CONFIG_DEBUG_RODATA involves
 overriding the weak arch_ftrace_update_code() with a variant which makes
 the kernel text writable around the patching.
 
 This override was however added under the CONFIG_OLD_MCOUNT ifdef, and
 CONFIG_OLD_MCOUNT is only enabled if frame pointers are enabled.
 
 This leads to non-functional dynamic ftrace (ftrace triggers a

Re: DIfferent BogoMIPS value after bbaa06702719 ("clocksource/drivers/arm_global_timer: Register delay timer")

2017-03-02 Thread Rabin Vincent

On Thu, Mar 02, 2017 at 02:36:23PM +0100, Rafał Miłecki wrote:
> I just updated kernel on my SmartRG SR400ac (bcm4708-smartrg-sr400ac.dts) from
> 4.4 to 4.9 and noticed that this part of the log:
> [0.020820] Calibrating delay loop... 1594.16 BogoMIPS (lpj=7970816)
> (...)
> [0.190806] Brought up 2 CPUs
> [0.200022] SMP: Total of 2 processors activated (3188.32 BogoMIPS).
> 
> became:
> [0.027082] Calibrating delay loop (skipped), value calculated using timer 
> frequency.. 800.00 BogoMIPS (lpj=400)
> (...)
> [0.078858] Brought up 2 CPUs
> [0.088254] SMP: Total of 2 processors activated (1600.00 BogoMIPS).
> 
> This is caused by commit bbaa06702719 ("clocksource/drivers/arm_global_timer:
> Register delay timer").
> 
> Is this something expected? Or should I be worried about this? I'm aware it's
> just BogoMIPS, but I still prefer to ask :)

Yes, it's expected.

No, you shouldn't be worried.

It's even explained on Wikipedia:
https://en.wikipedia.org/wiki/BogoMips#Timer-based_delays

You can run the udelay test (CONFIG_TEST_UDELAY) if you want to make
sure that your delay loops are fine.

Re: DIfferent BogoMIPS value after bbaa06702719 ("clocksource/drivers/arm_global_timer: Register delay timer")

2017-03-02 Thread Rabin Vincent

On Thu, Mar 02, 2017 at 02:36:23PM +0100, Rafał Miłecki wrote:
> I just updated kernel on my SmartRG SR400ac (bcm4708-smartrg-sr400ac.dts) from
> 4.4 to 4.9 and noticed that this part of the log:
> [0.020820] Calibrating delay loop... 1594.16 BogoMIPS (lpj=7970816)
> (...)
> [0.190806] Brought up 2 CPUs
> [0.200022] SMP: Total of 2 processors activated (3188.32 BogoMIPS).
> 
> became:
> [0.027082] Calibrating delay loop (skipped), value calculated using timer 
> frequency.. 800.00 BogoMIPS (lpj=400)
> (...)
> [0.078858] Brought up 2 CPUs
> [0.088254] SMP: Total of 2 processors activated (1600.00 BogoMIPS).
> 
> This is caused by commit bbaa06702719 ("clocksource/drivers/arm_global_timer:
> Register delay timer").
> 
> Is this something expected? Or should I be worried about this? I'm aware it's
> just BogoMIPS, but I still prefer to ask :)

Yes, it's expected.

No, you shouldn't be worried.

It's even explained on Wikipedia:
https://en.wikipedia.org/wiki/BogoMips#Timer-based_delays

You can run the udelay test (CONFIG_TEST_UDELAY) if you want to make
sure that your delay loops are fine.

Re: [PATCH] printk: fix printk.devkmsg sysctl

2017-01-30 Thread Rabin Vincent

On Fri, Jan 27, 2017 at 07:19:30PM +0100, Borislav Petkov wrote:
> On Fri, Jan 27, 2017 at 04:42:30PM +0100, Rabin Vincent wrote:
> > proc_dostring() eats the '\n' and stops
> 
> Not a problem, see diff below.

Would it be possible for you to please submit it as a patch yourself so
that this gets fixed in the way you like?  Thank you.

Re: [PATCH] printk: fix printk.devkmsg sysctl

2017-01-30 Thread Rabin Vincent

On Fri, Jan 27, 2017 at 07:19:30PM +0100, Borislav Petkov wrote:
> On Fri, Jan 27, 2017 at 04:42:30PM +0100, Rabin Vincent wrote:
> > proc_dostring() eats the '\n' and stops
> 
> Not a problem, see diff below.

Would it be possible for you to please submit it as a patch yourself so
that this gets fixed in the way you like?  Thank you.

Re: [PATCH] printk: fix printk.devkmsg sysctl

2017-01-27 Thread Rabin Vincent

On Fri, Jan 27, 2017 at 04:01:41PM +0100, Borislav Petkov wrote:
> On Fri, Jan 27, 2017 at 02:11:46PM +0100, Rabin Vincent wrote:
> > @@ -177,7 +177,7 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, 
> > int write,
> >  * Do not accept an unknown string OR a known string with
> >  * trailing crap...
> >  */
> > -   if (err < 0 || (err + 1 != *lenp)) {
> 
> Grr, that's that damn '\n'
> 
> echo off > /proc/sys/kernel/printk_devkmsg
> 
> works, of course.
> 
> Ok, I don't want to relax the strncmp() above and would still like to
> return the exact length compared.
> 
> So please change the check above to allow the following inputs:
> 
>   
> 
> or
> 
>   \n
> 
> I.e., a trailing, *optional*, '\n' is allowed.

proc_dostring() eats the '\n' and stops after that so we never see it or
what comes after, so we need an strlen():

8<
>From ec7e02cdf5b6c9fb1492670928bb7ea4386ca87d Mon Sep 17 00:00:00 2001
From: Rabin Vincent <rab...@axis.com>
Date: Fri, 27 Jan 2017 14:03:07 +0100
Subject: [PATCHv2] printk: fix printk.devkmsg sysctl

The comment says that it doesn't want to accept trailing crap but the
code does allow it:

 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #

while at the same time it rejects legitimate uses:

 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument

Fix it.

Before this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 sysctl: short write
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

After this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 kernel.printk_devkmsg = off
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

Fixes: 750afe7babd117d ("printk: add kernel parameter to control writes to 
/dev/kmsg")
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 kernel/printk/printk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b26964..935ed71 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -177,7 +177,8 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, int 
write,
 * Do not accept an unknown string OR a known string with
 * trailing crap...
 */
-   if (err < 0 || (err + 1 != *lenp)) {
+   if (err < 0 || (err != *lenp && err + 1 != *lenp) ||
+   err != strlen(devkmsg_log_str)) {
 
/* ... and restore old setting. */
devkmsg_log = old;
-- 
2.1.4

Re: [PATCH] printk: fix printk.devkmsg sysctl

2017-01-27 Thread Rabin Vincent

On Fri, Jan 27, 2017 at 04:01:41PM +0100, Borislav Petkov wrote:
> On Fri, Jan 27, 2017 at 02:11:46PM +0100, Rabin Vincent wrote:
> > @@ -177,7 +177,7 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, 
> > int write,
> >  * Do not accept an unknown string OR a known string with
> >  * trailing crap...
> >  */
> > -   if (err < 0 || (err + 1 != *lenp)) {
> 
> Grr, that's that damn '\n'
> 
> echo off > /proc/sys/kernel/printk_devkmsg
> 
> works, of course.
> 
> Ok, I don't want to relax the strncmp() above and would still like to
> return the exact length compared.
> 
> So please change the check above to allow the following inputs:
> 
>   
> 
> or
> 
>   \n
> 
> I.e., a trailing, *optional*, '\n' is allowed.

proc_dostring() eats the '\n' and stops after that so we never see it or
what comes after, so we need an strlen():

8<
>From ec7e02cdf5b6c9fb1492670928bb7ea4386ca87d Mon Sep 17 00:00:00 2001
From: Rabin Vincent 
Date: Fri, 27 Jan 2017 14:03:07 +0100
Subject: [PATCHv2] printk: fix printk.devkmsg sysctl

The comment says that it doesn't want to accept trailing crap but the
code does allow it:

 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #

while at the same time it rejects legitimate uses:

 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument

Fix it.

Before this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 sysctl: short write
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

After this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 kernel.printk_devkmsg = off
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

Fixes: 750afe7babd117d ("printk: add kernel parameter to control writes to 
/dev/kmsg")
Signed-off-by: Rabin Vincent 
---
 kernel/printk/printk.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b26964..935ed71 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -177,7 +177,8 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, int 
write,
 * Do not accept an unknown string OR a known string with
 * trailing crap...
 */
-   if (err < 0 || (err + 1 != *lenp)) {
+   if (err < 0 || (err != *lenp && err + 1 != *lenp) ||
+   err != strlen(devkmsg_log_str)) {
 
/* ... and restore old setting. */
devkmsg_log = old;
-- 
2.1.4

[PATCH] printk: fix printk.devkmsg sysctl

2017-01-27 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

The comment says that it doesn't want to accept trailing crap but that's
just what it allows:

 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #

while at the same time it rejects legitimate uses:

 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument

Fix it.

Before this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 sysctl: short write
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

After this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 kernel.printk_devkmsg = off
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 #

As a side effect, the "off\nX" case is not rejected anymore, but that is how
the other sysctl files behave:

 # cat /proc/sys/kernel/printk_ratelimit
 0
 # printf "5\nX" >/proc/sys/kernel/printk_ratelimit
 # cat /proc/sys/kernel/printk_ratelimit
 5

Fixes: 750afe7babd117d ("printk: add kernel parameter to control writes to 
/dev/kmsg")
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 kernel/printk/printk.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b26964..032b9e0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -107,15 +107,15 @@ static int __control_devkmsg(char *str)
if (!str)
return -EINVAL;
 
-   if (!strncmp(str, "on", 2)) {
+   if (!strcmp(str, "on")) {
devkmsg_log = DEVKMSG_LOG_MASK_ON;
-   return 2;
-   } else if (!strncmp(str, "off", 3)) {
+   return 0;
+   } else if (!strcmp(str, "off")) {
devkmsg_log = DEVKMSG_LOG_MASK_OFF;
-   return 3;
-   } else if (!strncmp(str, "ratelimit", 9)) {
+   return 0;
+   } else if (!strcmp(str, "ratelimit")) {
devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
-   return 9;
+   return 0;
}
return -EINVAL;
 }
@@ -177,7 +177,7 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, int 
write,
 * Do not accept an unknown string OR a known string with
 * trailing crap...
 */
-   if (err < 0 || (err + 1 != *lenp)) {
+   if (err < 0) {
 
/* ... and restore old setting. */
devkmsg_log = old;
-- 
2.1.4

[PATCH] printk: fix printk.devkmsg sysctl

2017-01-27 Thread Rabin Vincent

From: Rabin Vincent 

The comment says that it doesn't want to accept trailing crap but that's
just what it allows:

 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #

while at the same time it rejects legitimate uses:

 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument

Fix it.

Before this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 sysctl: short write
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 #
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 -sh: printf: write error: Invalid argument

After this patch:

 # cat /proc/sys/kernel/printk_devkmsg
 ratelimit
 # echo off > /proc/sys/kernel/printk_devkmsg
 # sysctl -w kernel.printk_devkmsg=off
 kernel.printk_devkmsg = off
 # echo -n off > /proc/sys/kernel/printk_devkmsg
 # echo -n offX > /proc/sys/kernel/printk_devkmsg
 -sh: echo: write error: Invalid argument
 # printf "off\nX" >/proc/sys/kernel/printk_devkmsg
 #

As a side effect, the "off\nX" case is not rejected anymore, but that is how
the other sysctl files behave:

 # cat /proc/sys/kernel/printk_ratelimit
 0
 # printf "5\nX" >/proc/sys/kernel/printk_ratelimit
 # cat /proc/sys/kernel/printk_ratelimit
 5

Fixes: 750afe7babd117d ("printk: add kernel parameter to control writes to 
/dev/kmsg")
Signed-off-by: Rabin Vincent 
---
 kernel/printk/printk.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
index 8b26964..032b9e0 100644
--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -107,15 +107,15 @@ static int __control_devkmsg(char *str)
if (!str)
return -EINVAL;
 
-   if (!strncmp(str, "on", 2)) {
+   if (!strcmp(str, "on")) {
devkmsg_log = DEVKMSG_LOG_MASK_ON;
-   return 2;
-   } else if (!strncmp(str, "off", 3)) {
+   return 0;
+   } else if (!strcmp(str, "off")) {
devkmsg_log = DEVKMSG_LOG_MASK_OFF;
-   return 3;
-   } else if (!strncmp(str, "ratelimit", 9)) {
+   return 0;
+   } else if (!strcmp(str, "ratelimit")) {
devkmsg_log = DEVKMSG_LOG_MASK_DEFAULT;
-   return 9;
+   return 0;
}
return -EINVAL;
 }
@@ -177,7 +177,7 @@ int devkmsg_sysctl_set_loglvl(struct ctl_table *table, int 
write,
 * Do not accept an unknown string OR a known string with
 * trailing crap...
 */
-   if (err < 0 || (err + 1 != *lenp)) {
+   if (err < 0) {
 
/* ... and restore old setting. */
devkmsg_log = old;
-- 
2.1.4

Re: Synopsys Ethernet QoS Driver

2016-11-19 Thread Rabin Vincent

On Fri, Nov 18, 2016 at 02:20:27PM +, Joao Pinto wrote:
> For now we are interesting in improving the synopsys QoS driver under
> /nect/ethernet/synopsys. For now the driver structure consists of a single 
> file
> called dwc_eth_qos.c, containing synopsys ethernet qos common ops and platform
> related stuff.
> 
> Our strategy would be:
> 
> a) Implement a platform glue driver (dwc_eth_qos_pltfm.c)
> b) Implement a pci glue driver (dwc_eth_qos_pci.c)
> c) Implement a "core driver" (dwc_eth_qos.c) that would only have Ethernet QoS
> related stuff to be reused by the platform / pci drivers
> d) Add a set of features to the "core driver" that we have available 
> internally

Note that there are actually two drivers in mainline for this hardware:

 drivers/net/ethernet/synopsis/
 drivers/net/ethernet/stmicro/stmmac/

(See http://lists.openwall.net/netdev/2016/02/29/127)

The former only supports 4.x of the hardware.

The later supports 4.x and 3.x and already has a platform glue driver
with support for several platforms, a PCI glue driver, and a core driver
with several features not present in the former (for example: TX/RX
interrupt coalescing, EEE, PTP).

Have you evaluated both drivers?  Why have you decided to work on the
former rather than the latter?

Re: Synopsys Ethernet QoS Driver

2016-11-19 Thread Rabin Vincent

On Fri, Nov 18, 2016 at 02:20:27PM +, Joao Pinto wrote:
> For now we are interesting in improving the synopsys QoS driver under
> /nect/ethernet/synopsys. For now the driver structure consists of a single 
> file
> called dwc_eth_qos.c, containing synopsys ethernet qos common ops and platform
> related stuff.
> 
> Our strategy would be:
> 
> a) Implement a platform glue driver (dwc_eth_qos_pltfm.c)
> b) Implement a pci glue driver (dwc_eth_qos_pci.c)
> c) Implement a "core driver" (dwc_eth_qos.c) that would only have Ethernet QoS
> related stuff to be reused by the platform / pci drivers
> d) Add a set of features to the "core driver" that we have available 
> internally

Note that there are actually two drivers in mainline for this hardware:

 drivers/net/ethernet/synopsis/
 drivers/net/ethernet/stmicro/stmmac/

(See http://lists.openwall.net/netdev/2016/02/29/127)

The former only supports 4.x of the hardware.

The later supports 4.x and 3.x and already has a platform glue driver
with support for several platforms, a PCI glue driver, and a core driver
with several features not present in the former (for example: TX/RX
interrupt coalescing, EEE, PTP).

Have you evaluated both drivers?  Why have you decided to work on the
former rather than the latter?

Re: [PATCH] ARM: ftrace: fix syscall name matching

2016-11-15 Thread Rabin Vincent

On Mon, Nov 14, 2016 at 10:40:08AM -0500, Steven Rostedt wrote:
> On Mon, 14 Nov 2016 13:40:17 +
> Russell King - ARM Linux <li...@armlinux.org.uk> wrote:
> > On Mon, Nov 14, 2016 at 02:03:45PM +0100, Rabin Vincent wrote:
> > > +static inline bool arch_syscall_match_sym_name(const char *sym,
> > > +const char *name)
> > > +{
> > > + /* Skip sys_ */
> > > + sym += 4;
> > > + name += 4;  
> > 
> > Is this really safe?  What guarantees that we can wind forward four
> > bytes here?  If it's always safe, it needs a better comment than just
> > two words.
> 
> I believe it is, but a comment would do well.

I ended up just getting rid of the skip and comparing the whole name
instead.  I've sent a v2.

Re: [PATCH] ARM: ftrace: fix syscall name matching

2016-11-15 Thread Rabin Vincent

On Mon, Nov 14, 2016 at 10:40:08AM -0500, Steven Rostedt wrote:
> On Mon, 14 Nov 2016 13:40:17 +
> Russell King - ARM Linux  wrote:
> > On Mon, Nov 14, 2016 at 02:03:45PM +0100, Rabin Vincent wrote:
> > > +static inline bool arch_syscall_match_sym_name(const char *sym,
> > > +const char *name)
> > > +{
> > > + /* Skip sys_ */
> > > + sym += 4;
> > > + name += 4;  
> > 
> > Is this really safe?  What guarantees that we can wind forward four
> > bytes here?  If it's always safe, it needs a better comment than just
> > two words.
> 
> I believe it is, but a comment would do well.

I ended up just getting rid of the skip and comparing the whole name
instead.  I've sent a v2.

[tip:perf/core] perf callchain: Fixup help/config for no-unwinding

2016-11-15 Thread tip-bot for Rabin Vincent

Commit-ID:  c56cb33b56c13493eeb95612f80e4dd6e35cd109
Gitweb: http://git.kernel.org/tip/c56cb33b56c13493eeb95612f80e4dd6e35cd109
Author: Rabin Vincent <rab...@axis.com>
AuthorDate: Wed, 10 Aug 2016 15:52:28 +0200
Committer:  Arnaldo Carvalho de Melo <a...@redhat.com>
CommitDate: Mon, 7 Nov 2016 22:13:47 -0300

perf callchain: Fixup help/config for no-unwinding

Since 841e3558b2d ("perf callchain: Recording 'dwarf' callchains do not
need DWARF unwinding support"), --call-graph dwarf is allowed in 'perf
record' even without unwind support.  A couple of other places don't
reflect this yet though: the help text should list dwarf as a valid
record mode and the dump_size config should be respected too.

Signed-off-by: Rabin Vincent <rab...@axis.com>
Cc: He Kuang <heku...@huawei.com>
Fixes: 841e3558b2de ("perf callchain: Recording 'dwarf' callchains do not need 
DWARF unwinding support")
Link: 
http://lkml.kernel.org/r/1470837148-7642-1-git-send-email-rabin.vinc...@axis.com
Signed-off-by: Arnaldo Carvalho de Melo <a...@redhat.com>
---
 tools/perf/util/callchain.c | 2 --
 tools/perf/util/callchain.h | 4 
 2 files changed, 6 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 07fd30b..ae58b49 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -193,7 +193,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
if (!strcmp(var, "record-mode"))
return parse_callchain_record_opt(value, _param);
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
if (!strcmp(var, "dump-size")) {
unsigned long size = 0;
int ret;
@@ -203,7 +202,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
return ret;
}
-#endif
if (!strcmp(var, "print-type"))
return parse_callchain_mode(value);
if (!strcmp(var, "order"))
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 13e7554..47cfd10 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -11,11 +11,7 @@
 
 #define CALLCHAIN_HELP "setup and enables call-graph (stack 
chain/backtrace):\n\n"
 
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
 # define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|dwarf|lbr)\n"
-#else
-# define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|lbr)\n"
-#endif
 
 #define RECORD_SIZE_HELP   \
HELP_PAD "record_size:\tif record_mode is 'dwarf', max size of stack 
recording ()\n" \

[tip:perf/core] perf callchain: Fixup help/config for no-unwinding

2016-11-15 Thread tip-bot for Rabin Vincent

Commit-ID:  c56cb33b56c13493eeb95612f80e4dd6e35cd109
Gitweb: http://git.kernel.org/tip/c56cb33b56c13493eeb95612f80e4dd6e35cd109
Author: Rabin Vincent 
AuthorDate: Wed, 10 Aug 2016 15:52:28 +0200
Committer:  Arnaldo Carvalho de Melo 
CommitDate: Mon, 7 Nov 2016 22:13:47 -0300

perf callchain: Fixup help/config for no-unwinding

Since 841e3558b2d ("perf callchain: Recording 'dwarf' callchains do not
need DWARF unwinding support"), --call-graph dwarf is allowed in 'perf
record' even without unwind support.  A couple of other places don't
reflect this yet though: the help text should list dwarf as a valid
record mode and the dump_size config should be respected too.

Signed-off-by: Rabin Vincent 
Cc: He Kuang 
Fixes: 841e3558b2de ("perf callchain: Recording 'dwarf' callchains do not need 
DWARF unwinding support")
Link: 
http://lkml.kernel.org/r/1470837148-7642-1-git-send-email-rabin.vinc...@axis.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/perf/util/callchain.c | 2 --
 tools/perf/util/callchain.h | 4 
 2 files changed, 6 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 07fd30b..ae58b49 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -193,7 +193,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
if (!strcmp(var, "record-mode"))
return parse_callchain_record_opt(value, _param);
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
if (!strcmp(var, "dump-size")) {
unsigned long size = 0;
int ret;
@@ -203,7 +202,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
return ret;
}
-#endif
if (!strcmp(var, "print-type"))
return parse_callchain_mode(value);
if (!strcmp(var, "order"))
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 13e7554..47cfd10 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -11,11 +11,7 @@
 
 #define CALLCHAIN_HELP "setup and enables call-graph (stack 
chain/backtrace):\n\n"
 
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
 # define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|dwarf|lbr)\n"
-#else
-# define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|lbr)\n"
-#endif
 
 #define RECORD_SIZE_HELP   \
HELP_PAD "record_size:\tif record_mode is 'dwarf', max size of stack 
recording ()\n" \

[PATCHv2] ARM: ftrace: fix syscall name matching

2016-11-15 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

ARM has a few system calls (most notably mmap) for which the names of
the functions which are referenced in the syscall table do not match the
names of the syscall tracepoints.  As a consequence of this, these
tracepoints are not made available.  Implement
arch_syscall_match_sym_name to fix this and allow tracing even these
system calls.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
v2: get rid of unsafe-looking pointer adjustment

 arch/arm/include/asm/ftrace.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm/include/asm/ftrace.h b/arch/arm/include/asm/ftrace.h
index bfe2a2f..22b7311 100644
--- a/arch/arm/include/asm/ftrace.h
+++ b/arch/arm/include/asm/ftrace.h
@@ -54,6 +54,24 @@ static inline void *return_address(unsigned int level)
 
 #define ftrace_return_address(n) return_address(n)
 
+#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+
+static inline bool arch_syscall_match_sym_name(const char *sym,
+  const char *name)
+{
+   if (!strcmp(sym, "sys_mmap2"))
+   sym = "sys_mmap_pgoff";
+   else if (!strcmp(sym, "sys_statfs64_wrapper"))
+   sym = "sys_statfs64";
+   else if (!strcmp(sym, "sys_fstatfs64_wrapper"))
+   sym = "sys_fstatfs64";
+   else if (!strcmp(sym, "sys_arm_fadvise64_64"))
+   sym = "sys_fadvise64_64";
+
+   /* Ignore case since sym may start with "SyS" instead of "sys" */
+   return !strcasecmp(sym, name);
+}
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* _ASM_ARM_FTRACE */
-- 
2.1.4

[PATCHv2] ARM: ftrace: fix syscall name matching

2016-11-15 Thread Rabin Vincent

From: Rabin Vincent 

ARM has a few system calls (most notably mmap) for which the names of
the functions which are referenced in the syscall table do not match the
names of the syscall tracepoints.  As a consequence of this, these
tracepoints are not made available.  Implement
arch_syscall_match_sym_name to fix this and allow tracing even these
system calls.

Signed-off-by: Rabin Vincent 
---
v2: get rid of unsafe-looking pointer adjustment

 arch/arm/include/asm/ftrace.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/arch/arm/include/asm/ftrace.h b/arch/arm/include/asm/ftrace.h
index bfe2a2f..22b7311 100644
--- a/arch/arm/include/asm/ftrace.h
+++ b/arch/arm/include/asm/ftrace.h
@@ -54,6 +54,24 @@ static inline void *return_address(unsigned int level)
 
 #define ftrace_return_address(n) return_address(n)
 
+#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+
+static inline bool arch_syscall_match_sym_name(const char *sym,
+  const char *name)
+{
+   if (!strcmp(sym, "sys_mmap2"))
+   sym = "sys_mmap_pgoff";
+   else if (!strcmp(sym, "sys_statfs64_wrapper"))
+   sym = "sys_statfs64";
+   else if (!strcmp(sym, "sys_fstatfs64_wrapper"))
+   sym = "sys_fstatfs64";
+   else if (!strcmp(sym, "sys_arm_fadvise64_64"))
+   sym = "sys_fadvise64_64";
+
+   /* Ignore case since sym may start with "SyS" instead of "sys" */
+   return !strcasecmp(sym, name);
+}
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* _ASM_ARM_FTRACE */
-- 
2.1.4

[PATCH] ARM: ftrace: fix syscall name matching

2016-11-14 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

ARM has a few system calls (most notably mmap) for which the names of
the functions which are referenced in the syscall table do not match the
names of the syscall tracepoints.  As a consequence of this, these
tracepoints are not made available.  Implement
arch_syscall_match_sym_name to fix this and allow tracing even these
system calls.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 arch/arm/include/asm/ftrace.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm/include/asm/ftrace.h b/arch/arm/include/asm/ftrace.h
index bfe2a2f..8467909 100644
--- a/arch/arm/include/asm/ftrace.h
+++ b/arch/arm/include/asm/ftrace.h
@@ -54,6 +54,27 @@ static inline void *return_address(unsigned int level)
 
 #define ftrace_return_address(n) return_address(n)
 
+#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+
+static inline bool arch_syscall_match_sym_name(const char *sym,
+  const char *name)
+{
+   /* Skip sys_ */
+   sym += 4;
+   name += 4;
+
+   if (!strcmp(sym, "mmap2"))
+   sym = "mmap_pgoff";
+   else if (!strcmp(sym, "statfs64_wrapper"))
+   sym = "statfs64";
+   else if (!strcmp(sym, "fstatfs64_wrapper"))
+   sym = "fstatfs64";
+   else if (!strcmp(sym, "arm_fadvise64_64"))
+   sym = "fadvise64_64";
+
+   return !strcmp(sym, name);
+}
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* _ASM_ARM_FTRACE */
-- 
2.1.4

[PATCH] ARM: ftrace: fix syscall name matching

2016-11-14 Thread Rabin Vincent

From: Rabin Vincent 

ARM has a few system calls (most notably mmap) for which the names of
the functions which are referenced in the syscall table do not match the
names of the syscall tracepoints.  As a consequence of this, these
tracepoints are not made available.  Implement
arch_syscall_match_sym_name to fix this and allow tracing even these
system calls.

Signed-off-by: Rabin Vincent 
---
 arch/arm/include/asm/ftrace.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/arm/include/asm/ftrace.h b/arch/arm/include/asm/ftrace.h
index bfe2a2f..8467909 100644
--- a/arch/arm/include/asm/ftrace.h
+++ b/arch/arm/include/asm/ftrace.h
@@ -54,6 +54,27 @@ static inline void *return_address(unsigned int level)
 
 #define ftrace_return_address(n) return_address(n)
 
+#define ARCH_HAS_SYSCALL_MATCH_SYM_NAME
+
+static inline bool arch_syscall_match_sym_name(const char *sym,
+  const char *name)
+{
+   /* Skip sys_ */
+   sym += 4;
+   name += 4;
+
+   if (!strcmp(sym, "mmap2"))
+   sym = "mmap_pgoff";
+   else if (!strcmp(sym, "statfs64_wrapper"))
+   sym = "statfs64";
+   else if (!strcmp(sym, "fstatfs64_wrapper"))
+   sym = "fstatfs64";
+   else if (!strcmp(sym, "arm_fadvise64_64"))
+   sym = "fadvise64_64";
+
+   return !strcmp(sym, name);
+}
+
 #endif /* ifndef __ASSEMBLY__ */
 
 #endif /* _ASM_ARM_FTRACE */
-- 
2.1.4

[PATCH] dm crypt: fix crash on exit

2016-09-21 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop().  The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.

 general protection fault:  [#1] PREEMPT SMP
 CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
 task: 88003bd0df40 task.stack: 8800375b4000
 RIP: 0010: kthread_stop+0x52/0x300
 Call Trace:
  crypt_dtr+0x77/0x120
  dm_table_destroy+0x6f/0x120
  __dm_destroy+0x130/0x250
  dm_destroy+0x13/0x20
  dev_remove+0xe6/0x120
  ? dev_suspend+0x250/0x250
  ctl_ioctl+0x1fc/0x530
  ? __lock_acquire+0x24f/0x1b10
  dm_ctl_ioctl+0x13/0x20
  do_vfs_ioctl+0x91/0x6a0
  ? fput+0xe/0x10
  ? entry_SYSCALL_64_fastpath+0x5/0xbd
  ? trace_hardirqs_on_caller+0x151/0x1e0
  SyS_ioctl+0x41/0x70
  entry_SYSCALL_64_fastpath+0x1f/0xbd

This problem was introduced by bcbd94ff481e ("dm crypt: fix a possible
hang due to race condition on exit").

Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.

| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state().  If
| such reordering happens, there is a possible race on thread
| termination: [...]

So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...).  This
fixes the crash and should also fix the potential hang.

Fixes: bcbd94ff481e ("dm crypt: fix a possible hang due to race condition on 
exit")
Cc: Mikulas Patocka <mpato...@redhat.com>
Cc: sta...@vger.kernel.org # v4.0+
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 drivers/md/dm-crypt.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 8742957..6fc8923 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -113,8 +113,7 @@ struct iv_tcw_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD,
-DM_CRYPT_EXIT_THREAD};
+DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
 
 /*
  * The fields in here must be read only after initialization.
@@ -1207,18 +1206,20 @@ continue_locked:
if (!RB_EMPTY_ROOT(>write_tree))
goto pop_from_list;
 
-   if (unlikely(test_bit(DM_CRYPT_EXIT_THREAD, >flags))) {
-   spin_unlock_irq(>write_thread_wait.lock);
-   break;
-   }
-
-   __set_current_state(TASK_INTERRUPTIBLE);
+   set_current_state(TASK_INTERRUPTIBLE);
__add_wait_queue(>write_thread_wait, );
 
spin_unlock_irq(>write_thread_wait.lock);
 
+   if (unlikely(kthread_should_stop())) {
+   set_task_state(current, TASK_RUNNING);
+   remove_wait_queue(>write_thread_wait, );
+   break;
+   }
+
schedule();
 
+   set_task_state(current, TASK_RUNNING);
spin_lock_irq(>write_thread_wait.lock);
__remove_wait_queue(>write_thread_wait, );
goto continue_locked;
@@ -1533,13 +1534,8 @@ static void crypt_dtr(struct dm_target *ti)
if (!cc)
return;
 
-   if (cc->write_thread) {
-   spin_lock_irq(>write_thread_wait.lock);
-   set_bit(DM_CRYPT_EXIT_THREAD, >flags);
-   wake_up_locked(>write_thread_wait);
-   spin_unlock_irq(>write_thread_wait.lock);
+   if (cc->write_thread)
kthread_stop(cc->write_thread);
-   }
 
if (cc->io_queue)
destroy_workqueue(cc->io_queue);
-- 
2.1.4

[PATCH] dm crypt: fix crash on exit

2016-09-21 Thread Rabin Vincent

From: Rabin Vincent 

As the documentation for kthread_stop() says, "if threadfn() may call
do_exit() itself, the caller must ensure task_struct can't go away".
dm-crypt does not ensure this and therefore crashes when crypt_dtr()
calls kthread_stop().  The crash is trivially reproducible by adding a
delay before the call to kthread_stop() and just opening and closing a
dm-crypt device.

 general protection fault:  [#1] PREEMPT SMP
 CPU: 0 PID: 533 Comm: cryptsetup Not tainted 4.8.0-rc7+ #7
 task: 88003bd0df40 task.stack: 8800375b4000
 RIP: 0010: kthread_stop+0x52/0x300
 Call Trace:
  crypt_dtr+0x77/0x120
  dm_table_destroy+0x6f/0x120
  __dm_destroy+0x130/0x250
  dm_destroy+0x13/0x20
  dev_remove+0xe6/0x120
  ? dev_suspend+0x250/0x250
  ctl_ioctl+0x1fc/0x530
  ? __lock_acquire+0x24f/0x1b10
  dm_ctl_ioctl+0x13/0x20
  do_vfs_ioctl+0x91/0x6a0
  ? fput+0xe/0x10
  ? entry_SYSCALL_64_fastpath+0x5/0xbd
  ? trace_hardirqs_on_caller+0x151/0x1e0
  SyS_ioctl+0x41/0x70
  entry_SYSCALL_64_fastpath+0x1f/0xbd

This problem was introduced by bcbd94ff481e ("dm crypt: fix a possible
hang due to race condition on exit").

Looking at the description of that patch (excerpted below), it seems
like the problem it addresses can be solved by just using
set_current_state instead of __set_current_state, since we obviously
need the memory barrier.

| dm crypt: fix a possible hang due to race condition on exit
|
| A kernel thread executes __set_current_state(TASK_INTERRUPTIBLE),
| __add_wait_queue, spin_unlock_irq and then tests kthread_should_stop().
| It is possible that the processor reorders memory accesses so that
| kthread_should_stop() is executed before __set_current_state().  If
| such reordering happens, there is a possible race on thread
| termination: [...]

So this patch just reverts the aforementioned patch and changes the
__set_current_state(TASK_INTERRUPTIBLE) to set_current_state(...).  This
fixes the crash and should also fix the potential hang.

Fixes: bcbd94ff481e ("dm crypt: fix a possible hang due to race condition on 
exit")
Cc: Mikulas Patocka 
Cc: sta...@vger.kernel.org # v4.0+
Signed-off-by: Rabin Vincent 
---
 drivers/md/dm-crypt.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 8742957..6fc8923 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -113,8 +113,7 @@ struct iv_tcw_private {
  * and encrypts / decrypts at the same time.
  */
 enum flags { DM_CRYPT_SUSPENDED, DM_CRYPT_KEY_VALID,
-DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD,
-DM_CRYPT_EXIT_THREAD};
+DM_CRYPT_SAME_CPU, DM_CRYPT_NO_OFFLOAD };
 
 /*
  * The fields in here must be read only after initialization.
@@ -1207,18 +1206,20 @@ continue_locked:
if (!RB_EMPTY_ROOT(>write_tree))
goto pop_from_list;
 
-   if (unlikely(test_bit(DM_CRYPT_EXIT_THREAD, >flags))) {
-   spin_unlock_irq(>write_thread_wait.lock);
-   break;
-   }
-
-   __set_current_state(TASK_INTERRUPTIBLE);
+   set_current_state(TASK_INTERRUPTIBLE);
__add_wait_queue(>write_thread_wait, );
 
spin_unlock_irq(>write_thread_wait.lock);
 
+   if (unlikely(kthread_should_stop())) {
+   set_task_state(current, TASK_RUNNING);
+   remove_wait_queue(>write_thread_wait, );
+   break;
+   }
+
schedule();
 
+   set_task_state(current, TASK_RUNNING);
spin_lock_irq(>write_thread_wait.lock);
__remove_wait_queue(>write_thread_wait, );
goto continue_locked;
@@ -1533,13 +1534,8 @@ static void crypt_dtr(struct dm_target *ti)
if (!cc)
return;
 
-   if (cc->write_thread) {
-   spin_lock_irq(>write_thread_wait.lock);
-   set_bit(DM_CRYPT_EXIT_THREAD, >flags);
-   wake_up_locked(>write_thread_wait);
-   spin_unlock_irq(>write_thread_wait.lock);
+   if (cc->write_thread)
kthread_stop(cc->write_thread);
-   }
 
if (cc->io_queue)
destroy_workqueue(cc->io_queue);
-- 
2.1.4

Re: [PATCH v3] scripts: add script for translating stack dump function

2016-09-19 Thread Rabin Vincent

On Mon, Sep 19, 2016 at 12:15:42PM -0700, Linus Torvalds wrote:
> Hmm. Would you mind if I change the
> 
> addr2line -fpie $objfile $hexaddr | sed "s;$dir_prefix;;"
> 
> into
> 
> addr2line -fpie $objfile $hexaddr |
> sed "s; at $dir_prefix\(\./\)*; at ;"
> 
> instead? There's two changes there: matching the " at " part (to just
> make the match stricter) but also matching any following "./" thing
> (which shows up for our include tree files, at least for me).

Note that addr2line has localized strings, so the regex with the " at "
won't match for everyone unless you invoke addr2line with LANG=C.

$ ../linux/scripts/faddr2line vmlinux free_reserved_area+61
free_reserved_area+61/0xe4:
__write_once_size på /home/rabinv/dev/linux/include/linux/compiler.h:248
(inline:ad av)set_page_count på 
/home/rabinv/dev/linux/include/linux/page_ref.h:76
(inline:ad av)init_page_count på 
/home/rabinv/dev/linux/include/linux/page_ref.h:87
(inline:ad av)__free_reserved_page på 
/home/rabinv/dev/linux/include/linux/mm.h:1818
(inline:ad av)free_reserved_page på 
/home/rabinv/dev/linux/include/linux/mm.h:1824
(inline:ad av)free_reserved_area på /home/rabinv/dev/linux/mm/page_alloc.c:6476

$ LANG=C ../linux/scripts/faddr2line vmlinux free_reserved_area+61
free_reserved_area+61/0xe4:
__write_once_size at include/linux/compiler.h:248
 (inlined by) set_page_count at include/linux/page_ref.h:76
 (inlined by) init_page_count at include/linux/page_ref.h:87
 (inlined by) __free_reserved_page at include/linux/mm.h:1818
 (inlined by) free_reserved_page at include/linux/mm.h:1824
 (inlined by) free_reserved_area at mm/page_alloc.c:6476

Re: [PATCH v3] scripts: add script for translating stack dump function

2016-09-19 Thread Rabin Vincent

On Mon, Sep 19, 2016 at 12:15:42PM -0700, Linus Torvalds wrote:
> Hmm. Would you mind if I change the
> 
> addr2line -fpie $objfile $hexaddr | sed "s;$dir_prefix;;"
> 
> into
> 
> addr2line -fpie $objfile $hexaddr |
> sed "s; at $dir_prefix\(\./\)*; at ;"
> 
> instead? There's two changes there: matching the " at " part (to just
> make the match stricter) but also matching any following "./" thing
> (which shows up for our include tree files, at least for me).

Note that addr2line has localized strings, so the regex with the " at "
won't match for everyone unless you invoke addr2line with LANG=C.

$ ../linux/scripts/faddr2line vmlinux free_reserved_area+61
free_reserved_area+61/0xe4:
__write_once_size på /home/rabinv/dev/linux/include/linux/compiler.h:248
(inline:ad av)set_page_count på 
/home/rabinv/dev/linux/include/linux/page_ref.h:76
(inline:ad av)init_page_count på 
/home/rabinv/dev/linux/include/linux/page_ref.h:87
(inline:ad av)__free_reserved_page på 
/home/rabinv/dev/linux/include/linux/mm.h:1818
(inline:ad av)free_reserved_page på 
/home/rabinv/dev/linux/include/linux/mm.h:1824
(inline:ad av)free_reserved_area på /home/rabinv/dev/linux/mm/page_alloc.c:6476

$ LANG=C ../linux/scripts/faddr2line vmlinux free_reserved_area+61
free_reserved_area+61/0xe4:
__write_once_size at include/linux/compiler.h:248
 (inlined by) set_page_count at include/linux/page_ref.h:76
 (inlined by) init_page_count at include/linux/page_ref.h:87
 (inlined by) __free_reserved_page at include/linux/mm.h:1818
 (inlined by) free_reserved_page at include/linux/mm.h:1824
 (inlined by) free_reserved_area at mm/page_alloc.c:6476

Re: [PATCH v2] scripts: add script for translating stack dump function offsets

2016-09-17 Thread Rabin Vincent

On Fri, Sep 16, 2016 at 04:26:56PM -0500, Josh Poimboeuf wrote:
> + addr2line -ie $objfile $hexaddr

Could you pass in -f and -p too to addr2line?

Before:

 $ scripts/faddr2line ~/dev/kvm2/vmlinux free_reserved_area+0x90
 /home/rabin/dev/linux/include/linux/compiler.h:222
 /home/rabin/dev/linux/include/linux/page-flags.h:149
 /home/rabin/dev/linux/include/linux/page-flags.h:154
 /home/rabin/dev/linux/include/linux/page-flags.h:287
 /home/rabin/dev/linux/include/linux/mm.h:1778
 /home/rabin/dev/linux/include/linux/mm.h:1785
 /home/rabin/dev/linux/mm/page_alloc.c:6599

After:

 $ scripts/faddr2line ~/dev/kvm2/vmlinux free_reserved_area+0x90
 __read_once_size at /home/rabin/dev/linux/include/linux/compiler.h:222
  (inlined by) PageTail at /home/rabin/dev/linux/include/linux/page-flags.h:149
  (inlined by) PageCompound at 
/home/rabin/dev/linux/include/linux/page-flags.h:154
  (inlined by) ClearPageReserved at 
/home/rabin/dev/linux/include/linux/page-flags.h:287
  (inlined by) __free_reserved_page at 
/home/rabin/dev/linux/include/linux/mm.h:1778
  (inlined by) free_reserved_page at 
/home/rabin/dev/linux/include/linux/mm.h:1785
  (inlined by) free_reserved_area at /home/rabin/dev/linux/mm/page_alloc.c:6599

Re: [PATCH v2] scripts: add script for translating stack dump function offsets

2016-09-17 Thread Rabin Vincent

On Fri, Sep 16, 2016 at 04:26:56PM -0500, Josh Poimboeuf wrote:
> + addr2line -ie $objfile $hexaddr

Could you pass in -f and -p too to addr2line?

Before:

 $ scripts/faddr2line ~/dev/kvm2/vmlinux free_reserved_area+0x90
 /home/rabin/dev/linux/include/linux/compiler.h:222
 /home/rabin/dev/linux/include/linux/page-flags.h:149
 /home/rabin/dev/linux/include/linux/page-flags.h:154
 /home/rabin/dev/linux/include/linux/page-flags.h:287
 /home/rabin/dev/linux/include/linux/mm.h:1778
 /home/rabin/dev/linux/include/linux/mm.h:1785
 /home/rabin/dev/linux/mm/page_alloc.c:6599

After:

 $ scripts/faddr2line ~/dev/kvm2/vmlinux free_reserved_area+0x90
 __read_once_size at /home/rabin/dev/linux/include/linux/compiler.h:222
  (inlined by) PageTail at /home/rabin/dev/linux/include/linux/page-flags.h:149
  (inlined by) PageCompound at 
/home/rabin/dev/linux/include/linux/page-flags.h:154
  (inlined by) ClearPageReserved at 
/home/rabin/dev/linux/include/linux/page-flags.h:287
  (inlined by) __free_reserved_page at 
/home/rabin/dev/linux/include/linux/mm.h:1778
  (inlined by) free_reserved_page at 
/home/rabin/dev/linux/include/linux/mm.h:1785
  (inlined by) free_reserved_area at /home/rabin/dev/linux/mm/page_alloc.c:6599

Re: [PATCH] pstore/core: drop cmpxchg based updates

2016-09-08 Thread Rabin Vincent

On Wed, Aug 24, 2016 at 03:09:35PM +0200, Sebastian Andrzej Siewior wrote:
> I have here a FPGA behind PCIe which exports SRAM which I use for
> pstore. Now it seems that the FPGA no longer supports cmpxchg based
> updates and writes back 0xff…ff and returns the same.  This leads to
> crash during crash rendering pstore useless.
> Since I doubt that there is much benefit from using cmpxchg() here, I am
> dropping this atomic access and use the spinlock based version.
> 
> Cc: Anton Vorontsov <an...@enomsg.org>
> Cc: Colin Cross <ccr...@android.com>
> Cc: Kees Cook <keesc...@chromium.org>
> Cc: Tony Luck <tony.l...@intel.com>
> Signed-off-by: Sebastian Andrzej Siewior <bige...@linutronix.de>

Tested-by: Rabin Vincent <rab...@axis.com>

This patch is needed for pstore to work on (most?) ARMv7 chips.  See
this thread for details:
https://lkml.kernel.org/g/CABXOdTfT7xMfiBvRuUS1hsVs=q5q2wy1x1z8ocyyjnfckm0...@mail.gmail.com

Re: [PATCH] pstore/core: drop cmpxchg based updates

2016-09-08 Thread Rabin Vincent

On Wed, Aug 24, 2016 at 03:09:35PM +0200, Sebastian Andrzej Siewior wrote:
> I have here a FPGA behind PCIe which exports SRAM which I use for
> pstore. Now it seems that the FPGA no longer supports cmpxchg based
> updates and writes back 0xff…ff and returns the same.  This leads to
> crash during crash rendering pstore useless.
> Since I doubt that there is much benefit from using cmpxchg() here, I am
> dropping this atomic access and use the spinlock based version.
> 
> Cc: Anton Vorontsov 
> Cc: Colin Cross 
> Cc: Kees Cook 
> Cc: Tony Luck 
> Signed-off-by: Sebastian Andrzej Siewior 

Tested-by: Rabin Vincent 

This patch is needed for pstore to work on (most?) ARMv7 chips.  See
this thread for details:
https://lkml.kernel.org/g/CABXOdTfT7xMfiBvRuUS1hsVs=q5q2wy1x1z8ocyyjnfckm0...@mail.gmail.com

Re: [PATCH] bdev: fix NULL pointer dereference in sync()/close() race

2016-08-27 Thread Rabin Vincent

On Sat, Aug 27, 2016 at 09:07:28AM +0200, Vegard Nossum wrote:
> I got this with syzkaller:
> 
> general protection fault:  [#1] PREEMPT SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
> task: 880110762cc0 task.stack: 88010229
> RIP: 0010:[]  [] 
> blk_get_backing_dev_info+0x4a/0x70
> RSP: 0018:880102297cd0  EFLAGS: 00010202
> RAX: dc00 RBX:  RCX: c9bb4000
> RDX: 0097 RSI:  RDI: 04b8
> RBP: 880102297cd8 R08:  R09: 0001
> R10:  R11: 0001 R12: 88011a010a90
> R13: 88011a594568 R14: 88011a010890 R15: 7fff
> FS:  7f2445174700() GS:88011aa0() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 200047c8 CR3: 000107eb5000 CR4: 06f0
> DR0: 001e DR1: 001e DR2: 
> DR3:  DR6: 0ff0 DR7: 0600
> Stack:
>  110020452f9e 880102297db8 81508daa 
>  41b58ab3 844e89e1 81508b30 ed0020452001
>  7fff   7fff
> Call Trace:
>  [] __filemap_fdatawrite_range+0x27a/0x2e0
>  [] ? filemap_check_errors+0xe0/0xe0
>  [] ? preempt_schedule+0x27/0x30
>  [] ? ___preempt_schedule+0x16/0x18
>  [] filemap_fdatawrite+0x26/0x30
>  [] fdatawrite_one_bdev+0x50/0x70
>  [] iterate_bdevs+0x194/0x210
>  [] ? fdatawait_one_bdev+0x70/0x70
>  [] ? sync_filesystem+0x240/0x240
>  [] sys_sync+0xce/0x160
>  [] ? sync_filesystem+0x240/0x240
>  [] ? exit_to_usermode_loop+0x190/0x190
>  [] ? __context_tracking_exit.part.4+0x3a/0x1e0
>  [] do_syscall_64+0x1c4/0x4e0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 
> 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 
> 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
> RIP  [] blk_get_backing_dev_info+0x4a/0x70
>  RSP 
> 
> The problem is that sync() calls down to blk_get_backing_dev_info()
> without necessarily having the block device opened (if it races with
> another process doing close()):
> 
> /**
>  * blk_get_backing_dev_info - get the address of a queue's 
> backing_dev_info
>  * @bdev:   device
>  *
>  * Locates the passed device's request queue and returns the address of 
> its
>  * backing_dev_info.  This function can only be called if @bdev is opened 
>  <
>  * and the return value is never NULL.
>  */
> struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
> *bdev)
> {
> struct request_queue *q = bdev_get_queue(bdev);
> 
> return >backing_dev_info;
> }
> 
> bdev_get_queue() crashes on dereferencing bdev->bd_disk->queue because
> ->bd_disk was set to NULL when close() reached __blkdev_put().
> 
> Taking bdev->bd_mutex and checking bdev->bd_disk actually seems to be a
> reliable test of whether it's safe to call filemap_fdatawrite() for the
> block device inode and completely fixes the crash for me.
> 
> What I don't like about this patch is that it simply skips block devices
> which we don't have any open file descriptors for. That seems wrong to
> me because sync() should do writeback on (and wait for) _all_ devices,
> not just the ones that we happen to have an open file descriptor for.
> Imagine if we opened a device, wrote a lot of data to it, closed it,
> called sync(), and sync() returns. Now we should be guaranteed the data
> was written, but I'm not sure we are in this case. But maybe I've
> misunderstood how the writeback mechanism works.
> 
> Another ugly thing is that we're now holding a new mutex over a
> potentially big chunk of code (everything that happens inside
> filemap_fdatawrite()). The only thing I can say is that it seems to
> work here.
> 
> I have a reproducer that reliably triggers the problem in ~2 seconds so
> I can easily test other proposed fixes if there are any. I would also be
> happy to submit a new patch with some guidance on the Right Way to fix
> this.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Vegard Nossum 

Don't know what's the right fix, but I posted a slightly different one
for the same crash some months ago:
https://patchwork.kernel.org/patch/8556941/

Re: [PATCH] bdev: fix NULL pointer dereference in sync()/close() race

2016-08-27 Thread Rabin Vincent

On Sat, Aug 27, 2016 at 09:07:28AM +0200, Vegard Nossum wrote:
> I got this with syzkaller:
> 
> general protection fault:  [#1] PREEMPT SMP KASAN
> Dumping ftrace buffer:
>(ftrace buffer empty)
> CPU: 0 PID: 11941 Comm: syz-executor Not tainted 4.8.0-rc2+ #169
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
> task: 880110762cc0 task.stack: 88010229
> RIP: 0010:[]  [] 
> blk_get_backing_dev_info+0x4a/0x70
> RSP: 0018:880102297cd0  EFLAGS: 00010202
> RAX: dc00 RBX:  RCX: c9bb4000
> RDX: 0097 RSI:  RDI: 04b8
> RBP: 880102297cd8 R08:  R09: 0001
> R10:  R11: 0001 R12: 88011a010a90
> R13: 88011a594568 R14: 88011a010890 R15: 7fff
> FS:  7f2445174700() GS:88011aa0() 
> knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 200047c8 CR3: 000107eb5000 CR4: 06f0
> DR0: 001e DR1: 001e DR2: 
> DR3:  DR6: 0ff0 DR7: 0600
> Stack:
>  110020452f9e 880102297db8 81508daa 
>  41b58ab3 844e89e1 81508b30 ed0020452001
>  7fff   7fff
> Call Trace:
>  [] __filemap_fdatawrite_range+0x27a/0x2e0
>  [] ? filemap_check_errors+0xe0/0xe0
>  [] ? preempt_schedule+0x27/0x30
>  [] ? ___preempt_schedule+0x16/0x18
>  [] filemap_fdatawrite+0x26/0x30
>  [] fdatawrite_one_bdev+0x50/0x70
>  [] iterate_bdevs+0x194/0x210
>  [] ? fdatawait_one_bdev+0x70/0x70
>  [] ? sync_filesystem+0x240/0x240
>  [] sys_sync+0xce/0x160
>  [] ? sync_filesystem+0x240/0x240
>  [] ? exit_to_usermode_loop+0x190/0x190
>  [] ? __context_tracking_exit.part.4+0x3a/0x1e0
>  [] do_syscall_64+0x1c4/0x4e0
>  [] entry_SYSCALL64_slow_path+0x25/0x25
> Code: 89 fa 48 c1 ea 03 80 3c 02 00 75 35 48 8b 9b e0 00 00 00 48 b8 00 
> 00 00 00 00 fc ff df 48 8d bb b8 04 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 
> 75 17 48 8b 83 b8 04 00 00 5b 5d 48 05 10 02 00 00
> RIP  [] blk_get_backing_dev_info+0x4a/0x70
>  RSP 
> 
> The problem is that sync() calls down to blk_get_backing_dev_info()
> without necessarily having the block device opened (if it races with
> another process doing close()):
> 
> /**
>  * blk_get_backing_dev_info - get the address of a queue's 
> backing_dev_info
>  * @bdev:   device
>  *
>  * Locates the passed device's request queue and returns the address of 
> its
>  * backing_dev_info.  This function can only be called if @bdev is opened 
>  <
>  * and the return value is never NULL.
>  */
> struct backing_dev_info *blk_get_backing_dev_info(struct block_device 
> *bdev)
> {
> struct request_queue *q = bdev_get_queue(bdev);
> 
> return >backing_dev_info;
> }
> 
> bdev_get_queue() crashes on dereferencing bdev->bd_disk->queue because
> ->bd_disk was set to NULL when close() reached __blkdev_put().
> 
> Taking bdev->bd_mutex and checking bdev->bd_disk actually seems to be a
> reliable test of whether it's safe to call filemap_fdatawrite() for the
> block device inode and completely fixes the crash for me.
> 
> What I don't like about this patch is that it simply skips block devices
> which we don't have any open file descriptors for. That seems wrong to
> me because sync() should do writeback on (and wait for) _all_ devices,
> not just the ones that we happen to have an open file descriptor for.
> Imagine if we opened a device, wrote a lot of data to it, closed it,
> called sync(), and sync() returns. Now we should be guaranteed the data
> was written, but I'm not sure we are in this case. But maybe I've
> misunderstood how the writeback mechanism works.
> 
> Another ugly thing is that we're now holding a new mutex over a
> potentially big chunk of code (everything that happens inside
> filemap_fdatawrite()). The only thing I can say is that it seems to
> work here.
> 
> I have a reproducer that reliably triggers the problem in ~2 seconds so
> I can easily test other proposed fixes if there are any. I would also be
> happy to submit a new patch with some guidance on the Right Way to fix
> this.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Vegard Nossum 

Don't know what's the right fix, but I posted a slightly different one
for the same crash some months ago:
https://patchwork.kernel.org/patch/8556941/

[PATCH] perf callchain: fixup help/config for no-unwinding

2016-08-10 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

Since 841e3558b2d ("perf callchain: Recording 'dwarf' callchains do not need
DWARF unwinding support"), --call-graph dwarf is allowed in perf record
even without unwind support.  A couple of other places don't reflect
this yet though: the help text should list dwarf as a valid record mode
and the dump_size config should be respected too.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 tools/perf/util/callchain.c | 2 --
 tools/perf/util/callchain.h | 4 
 2 files changed, 6 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 07fd30b..ae58b49 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -193,7 +193,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
if (!strcmp(var, "record-mode"))
return parse_callchain_record_opt(value, _param);
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
if (!strcmp(var, "dump-size")) {
unsigned long size = 0;
int ret;
@@ -203,7 +202,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
return ret;
}
-#endif
if (!strcmp(var, "print-type"))
return parse_callchain_mode(value);
if (!strcmp(var, "order"))
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 13e7554..47cfd10 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -11,11 +11,7 @@
 
 #define CALLCHAIN_HELP "setup and enables call-graph (stack 
chain/backtrace):\n\n"
 
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
 # define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|dwarf|lbr)\n"
-#else
-# define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|lbr)\n"
-#endif
 
 #define RECORD_SIZE_HELP   \
HELP_PAD "record_size:\tif record_mode is 'dwarf', max size of stack 
recording ()\n" \
-- 
2.1.4

[PATCH] perf callchain: fixup help/config for no-unwinding

2016-08-10 Thread Rabin Vincent

From: Rabin Vincent 

Since 841e3558b2d ("perf callchain: Recording 'dwarf' callchains do not need
DWARF unwinding support"), --call-graph dwarf is allowed in perf record
even without unwind support.  A couple of other places don't reflect
this yet though: the help text should list dwarf as a valid record mode
and the dump_size config should be respected too.

Signed-off-by: Rabin Vincent 
---
 tools/perf/util/callchain.c | 2 --
 tools/perf/util/callchain.h | 4 
 2 files changed, 6 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 07fd30b..ae58b49 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -193,7 +193,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
if (!strcmp(var, "record-mode"))
return parse_callchain_record_opt(value, _param);
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
if (!strcmp(var, "dump-size")) {
unsigned long size = 0;
int ret;
@@ -203,7 +202,6 @@ int perf_callchain_config(const char *var, const char 
*value)
 
return ret;
}
-#endif
if (!strcmp(var, "print-type"))
return parse_callchain_mode(value);
if (!strcmp(var, "order"))
diff --git a/tools/perf/util/callchain.h b/tools/perf/util/callchain.h
index 13e7554..47cfd10 100644
--- a/tools/perf/util/callchain.h
+++ b/tools/perf/util/callchain.h
@@ -11,11 +11,7 @@
 
 #define CALLCHAIN_HELP "setup and enables call-graph (stack 
chain/backtrace):\n\n"
 
-#ifdef HAVE_DWARF_UNWIND_SUPPORT
 # define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|dwarf|lbr)\n"
-#else
-# define RECORD_MODE_HELP  HELP_PAD "record_mode:\tcall graph recording mode 
(fp|lbr)\n"
-#endif
 
 #define RECORD_SIZE_HELP   \
HELP_PAD "record_size:\tif record_mode is 'dwarf', max size of stack 
recording ()\n" \
-- 
2.1.4

[PATCH] perf unwind: check symsrc ELF for .debug_frame

2016-08-10 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

When using split debug info, the file without debug info may not have a
.debug_frame section, so we need to check the symsrc ELF also, since
that's the file we actually read the unwind information from.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 tools/perf/util/unwind-libunwind-local.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/util/unwind-libunwind-local.c 
b/tools/perf/util/unwind-libunwind-local.c
index 97c0f8f..d492192 100644
--- a/tools/perf/util/unwind-libunwind-local.c
+++ b/tools/perf/util/unwind-libunwind-local.c
@@ -308,6 +308,20 @@ static int read_unwind_spec_debug_frame(struct dso *dso,
dso__data_put_fd(dso);
}
 
+   /*
+* With split debug info, the file without debug info may not have a
+* .debug_frame, so check the symsrc too.
+*/
+   if (ofs == 0 && dso->symsrc_filename) {
+   fd = open(dso->symsrc_filename, O_RDONLY);
+   if (fd < 0)
+   return -EINVAL;
+
+   ofs = elf_section_offset(fd, ".debug_frame");
+   dso->data.debug_frame_offset = ofs;
+   close(fd);
+   }
+
*offset = ofs;
if (*offset)
return 0;
-- 
2.1.4

[PATCH] perf unwind: check symsrc ELF for .debug_frame

2016-08-10 Thread Rabin Vincent

From: Rabin Vincent 

When using split debug info, the file without debug info may not have a
.debug_frame section, so we need to check the symsrc ELF also, since
that's the file we actually read the unwind information from.

Signed-off-by: Rabin Vincent 
---
 tools/perf/util/unwind-libunwind-local.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/tools/perf/util/unwind-libunwind-local.c 
b/tools/perf/util/unwind-libunwind-local.c
index 97c0f8f..d492192 100644
--- a/tools/perf/util/unwind-libunwind-local.c
+++ b/tools/perf/util/unwind-libunwind-local.c
@@ -308,6 +308,20 @@ static int read_unwind_spec_debug_frame(struct dso *dso,
dso__data_put_fd(dso);
}
 
+   /*
+* With split debug info, the file without debug info may not have a
+* .debug_frame, so check the symsrc too.
+*/
+   if (ofs == 0 && dso->symsrc_filename) {
+   fd = open(dso->symsrc_filename, O_RDONLY);
+   if (fd < 0)
+   return -EINVAL;
+
+   ofs = elf_section_offset(fd, ".debug_frame");
+   dso->data.debug_frame_offset = ofs;
+   close(fd);
+   }
+
*offset = ofs;
if (*offset)
return 0;
-- 
2.1.4

Re: debug tip after earlycon is closed?

2016-08-08 Thread Rabin Vincent

On Thu, Jul 28, 2016 at 01:20:22PM +0100, Russell King - ARM Linux wrote:
> To me, what this means is that the DT parsing of linux,stdout is
> broken - while it may look nice from a design point of view, the
> design is wrong and fails to take account of non-UART consoles in
> the system.

Note that the brokeness you describe also leads to stdout-path being
unable to put the console on any serial port but the first, as reported
here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/323811.html

It always puts the console on the first serial port if CONFIG_VT_CONSOLE
is not enabled, even if the stdout-path property requests a different
serial port.

Re: debug tip after earlycon is closed?

2016-08-08 Thread Rabin Vincent

On Thu, Jul 28, 2016 at 01:20:22PM +0100, Russell King - ARM Linux wrote:
> To me, what this means is that the DT parsing of linux,stdout is
> broken - while it may look nice from a design point of view, the
> design is wrong and fails to take account of non-UART consoles in
> the system.

Note that the brokeness you describe also leads to stdout-path being
unable to put the console on any serial port but the first, as reported
here:
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-February/323811.html

It always puts the console on the first serial port if CONFIG_VT_CONSOLE
is not enabled, even if the stdout-path property requests a different
serial port.

Re: [QUESTION] Is there a better way to get ftrace dump on guest?

2016-06-28 Thread Rabin Vincent

On Tue, Jun 28, 2016 at 03:33:18PM +0900, Namhyung Kim wrote:
> On Tue, Jun 28, 2016 at 3:25 PM, Namhyung Kim  wrote:
> > I'm running some guest machines for kernel development.  For debugging
> > purpose, I use lots of trace_printk() since it's faster than normal
> > printk().  When kernel crash happens the trace buffer is printed on
> > console (I set ftrace_dump_on_oops) but it takes too much time.  I
> > don't want to reduce the size of ring buffer as I want to collect the
> > debug info as much as possible.  And I also want to see trace from all
> > cpu so 'ftrace_dump_on_oop = 2' is not an option.
> >
> > I know the kexec/kdump (and the crash tool) can dump and analyze the
> > trace buffer later.  But it's cumbersome to do it everytime and more
> > importantly, I don't want to spend the memory for the crashkernel.

Assuming you're using QEMU:

QEMU has a dump-guest-memory command which can be used to dump the
guest's entire memory to an ELF which can be loaded by the crash utility
to extract the trace buffer.  This doesn't require kexec/kdump or any
other support from the guest kernel.

It's apparently even possible to run QEMU with the guest memory in a
file and load that to crash directly, although this is not something
I've had a chance to try out myself:

https://github.com/crash-utility/crash/commit/89ed9d0a7f7da4578294a492c1ad857244ce7352

Re: [QUESTION] Is there a better way to get ftrace dump on guest?

2016-06-28 Thread Rabin Vincent

On Tue, Jun 28, 2016 at 03:33:18PM +0900, Namhyung Kim wrote:
> On Tue, Jun 28, 2016 at 3:25 PM, Namhyung Kim  wrote:
> > I'm running some guest machines for kernel development.  For debugging
> > purpose, I use lots of trace_printk() since it's faster than normal
> > printk().  When kernel crash happens the trace buffer is printed on
> > console (I set ftrace_dump_on_oops) but it takes too much time.  I
> > don't want to reduce the size of ring buffer as I want to collect the
> > debug info as much as possible.  And I also want to see trace from all
> > cpu so 'ftrace_dump_on_oop = 2' is not an option.
> >
> > I know the kexec/kdump (and the crash tool) can dump and analyze the
> > trace buffer later.  But it's cumbersome to do it everytime and more
> > importantly, I don't want to spend the memory for the crashkernel.

Assuming you're using QEMU:

QEMU has a dump-guest-memory command which can be used to dump the
guest's entire memory to an ELF which can be loaded by the crash utility
to extract the trace buffer.  This doesn't require kexec/kdump or any
other support from the guest kernel.

It's apparently even possible to run QEMU with the guest memory in a
file and load that to crash directly, although this is not something
I've had a chance to try out myself:

https://github.com/crash-utility/crash/commit/89ed9d0a7f7da4578294a492c1ad857244ce7352

Faster incremental builds with Ninja

2016-06-27 Thread Rabin Vincent

Ninja is a small build system with a focus on speed.  More details at
https://ninja-build.org/.

I made an experimental Ninja build file generator for the kernel.  The
purpose was to see how much we could decrease compile times (especially
to detect errors) for single file changes.  Results on my machine and
x86 defconfig:

| Change| make -j8 | make -j8 objectname | ninja  |
| --|  | --- | -- |
| No changes|  2.254s  |   0.731s| 0.065s |
| One change, compile error |  1.225s  |   0.765s| 0.077s |
| One change, full link |  5.915s  |   NA| 4.482s |

The link time unsuprisingly dominates when performing small changes, but
as can be seen the time until the start of compilation (and thus the
time until any compile errors are detected) is several times smaller
with ninja.

These numbers were measured with the benchmark.sh included in the
repository.

  https://github.com/rabinv/kninja

I'm not posting the code as patches here since it's basically a hack.
It parses make's --print-data-base output and requires a full build to
have been performed with make before it can generate the ninja build
files.  Configuration changes and various files with special generation
rules are not handled.

Perhaps it is useful for "personal use" or as inspiration for
optimizations to Kbuild.

Faster incremental builds with Ninja

2016-06-27 Thread Rabin Vincent

Ninja is a small build system with a focus on speed.  More details at
https://ninja-build.org/.

I made an experimental Ninja build file generator for the kernel.  The
purpose was to see how much we could decrease compile times (especially
to detect errors) for single file changes.  Results on my machine and
x86 defconfig:

| Change| make -j8 | make -j8 objectname | ninja  |
| --|  | --- | -- |
| No changes|  2.254s  |   0.731s| 0.065s |
| One change, compile error |  1.225s  |   0.765s| 0.077s |
| One change, full link |  5.915s  |   NA| 4.482s |

The link time unsuprisingly dominates when performing small changes, but
as can be seen the time until the start of compilation (and thus the
time until any compile errors are detected) is several times smaller
with ninja.

These numbers were measured with the benchmark.sh included in the
repository.

  https://github.com/rabinv/kninja

I'm not posting the code as patches here since it's basically a hack.
It parses make's --print-data-base output and requires a full build to
have been performed with make before it can generate the ninja build
files.  Configuration changes and various files with special generation
rules are not handled.

Perhaps it is useful for "personal use" or as inspiration for
optimizations to Kbuild.

[PATCH 2/2] kbuild: add shell cache

2016-06-27 Thread Rabin Vincent

Running make results in over 40 invocations of the compiler just during
processing of the Makefile, before any actual rules are run.

To reduce this overhead, cache the results of $(shell) calls to the
compiler.

On my machine, this reduces make's processing time by over 96%:

$ make kernelversion && perf stat -r5 make kernelversion
Before: 0,252219929 seconds time elapsed ( +-  0,39% )
 After: 0,008607464 seconds time elapsed ( +-  0,50% )

Signed-off-by: Rabin Vincent <ra...@rab.in>
---
 Makefile   | 15 ---
 scripts/Kbuild.include | 42 +++---
 2 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/Makefile b/Makefile
index 91baa282f3fa..90237e706bf8 100644
--- a/Makefile
+++ b/Makefile
@@ -304,11 +304,6 @@ HOSTCXX  = g++
 HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 
-fomit-frame-pointer -std=gnu89
 HOSTCXXFLAGS = -O2
 
-ifeq ($(shell $(HOSTCC) -v 2>&1 | grep -c "clang version"), 1)
-HOSTCFLAGS  += -Wno-unused-value -Wno-unused-parameter \
-   -Wno-missing-field-initializers -fno-delete-null-pointer-checks
-endif
-
 # Decide whether to build built-in, modular, or both.
 # Normally, just do built-in.
 
@@ -343,6 +338,11 @@ export KBUILD_CHECKSRC KBUILD_SRC KBUILD_EXTMOD
 scripts/Kbuild.include: ;
 include scripts/Kbuild.include
 
+ifeq ($(call cached-shell,$(HOSTCC) -v 2>&1 | grep -c "clang version"), 1)
+HOSTCFLAGS  += -Wno-unused-value -Wno-unused-parameter \
+   -Wno-missing-field-initializers -fno-delete-null-pointer-checks
+endif
+
 # Make variables (CC, etc...)
 AS = $(CROSS_COMPILE)as
 LD = $(CROSS_COMPILE)ld
@@ -767,7 +767,7 @@ KBUILD_CFLAGS += $(call cc-option, 
-fno-inline-functions-called-once)
 endif
 
 # arch Makefile may override CC so keep this after arch Makefile is included
-NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include)
+NOSTDINC_FLAGS += -nostdinc -isystem $(call cached-shell,$(CC) 
-print-file-name=include)
 CHECKFLAGS += $(NOSTDINC_FLAGS)
 
 # warn about C99 declaration after statement
@@ -798,7 +798,7 @@ KBUILD_CFLAGS   += $(call 
cc-option,-Werror=incompatible-pointer-types)
 KBUILD_ARFLAGS := $(call ar-option,D)
 
 # check for 'asm goto'
-ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh $(CC)), y)
+ifeq ($(call cached-shell,$(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh 
$(CC)), y)
KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO
KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO
 endif
@@ -1237,6 +1237,7 @@ endif # CONFIG_MODULES
 
 # Directories & files removed with 'make clean'
 CLEAN_DIRS  += $(MODVERDIR)
+CLEAN_FILES += $(KBUILD_SHELLCACHE)
 
 # Directories & files removed with 'make mrproper'
 MRPROPER_DIRS  += include/config usr/include include/generated  \
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 0f82314621f2..e90e43fa3c81 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -7,6 +7,7 @@ quote   := "
 squote  := '
 empty   :=
 space   := $(empty) $(empty)
+tab := $(empty)$(empty)
 space_escape := _-_SPACE_-_
 
 ###
@@ -83,11 +84,35 @@ cc-cross-prefix =  \
 # output directory for tests below
 TMPOUT := $(if $(KBUILD_EXTMOD),$(firstword $(KBUILD_EXTMOD))/)
 
+KBUILD_SHELLCACHE := .shellcache
+
+ifeq ($(KBUILD_SHELLCACHE),)
+cached-shell = $(shell $(1))
+else
+-include $(KBUILD_SHELLCACHE)
+
+sanitize = $(subst $(space),_, \
+ $(subst $(tab),_, \
+ $(subst \n,_, \
+ $(subst :,_,  \
+ $(subst \#,_, \
+ $(subst =,_,  \
+ $(subst $$,_, \
+ $(subst ',_,  \
+ $(subst ",_,  \
+ $(1))
+
+cached-shell = $(if $(cmd$(call sanitize,$(1))),$(subst CACHED,,$(cmd$(call 
sanitize,$(1,$(shell \
+   res='$(shell $(1))';\
+   echo "$$res";   \
+   echo 'cmd$(call sanitize,$(1)) = CACHED'"$$res" >> 
$(KBUILD_SHELLCACHE)))
+endif
+
 # try-run
 # Usage: option = $(call try-run, $(CC)...-o "$$TMP",option-ok,otherwise)
 # Exit code chooses option. "$$TMP" is can be used as temporary file and
 # is automatically cleaned up.
-try-run = $(shell set -e;  \
+try-run = $(call cached-shell,set -e;  \
TMP="$(TMPOUT)..tmp";   \
TMPO="$(TMPOUT)..o";\
if ($(1)) >/dev/null 2>&1;  \
@@ -131,18 +156,19 @@ cc-disable-warning = $(call try-run,\
 
 # cc-name
 # Expands to either gcc or clang
-cc-name = $(shell $(CC) -v 2>&1 | grep -q "clang version" && echo clang || 
echo gcc)
+cc-name = $(call cached-shell,$(CC) -v 2>&1 | grep -q "clang version" && echo 
clang || echo gcc)
 
 # cc-version
-cc-version =

[PATCH 1/2] Makefile: evaluate LDFLAGS_BUILD_ID only once

2016-06-27 Thread Rabin Vincent

Evaluate LDFLAGS_BUILD_ID (which involves invoking the compiler) only
once instead of over and over.

This provides a 20% reduction in null build time with x86 allnoconfig on
my machine:

$ make -j8 && perf stat -r5 make -j8
Before: 1,086617617 seconds time elapsed ( +-  0,37% )
 After: 0,864487173 seconds time elapsed ( +-  0,35% )

Signed-off-by: Rabin Vincent <ra...@rab.in>
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 6471f20ca400..91baa282f3fa 100644
--- a/Makefile
+++ b/Makefile
@@ -814,7 +814,7 @@ KBUILD_AFLAGS   += $(ARCH_AFLAGS)   $(KAFLAGS)
 KBUILD_CFLAGS   += $(ARCH_CFLAGS)   $(KCFLAGS)
 
 # Use --build-id when available.
-LDFLAGS_BUILD_ID = $(patsubst -Wl$(comma)%,%,\
+LDFLAGS_BUILD_ID := $(patsubst -Wl$(comma)%,%,\
  $(call cc-ldoption, -Wl$(comma)--build-id,))
 KBUILD_LDFLAGS_MODULE += $(LDFLAGS_BUILD_ID)
 LDFLAGS_vmlinux += $(LDFLAGS_BUILD_ID)
-- 
2.8.1

[PATCH 2/2] kbuild: add shell cache

2016-06-27 Thread Rabin Vincent

Running make results in over 40 invocations of the compiler just during
processing of the Makefile, before any actual rules are run.

To reduce this overhead, cache the results of $(shell) calls to the
compiler.

On my machine, this reduces make's processing time by over 96%:

$ make kernelversion && perf stat -r5 make kernelversion
Before: 0,252219929 seconds time elapsed ( +-  0,39% )
 After: 0,008607464 seconds time elapsed ( +-  0,50% )

Signed-off-by: Rabin Vincent 
---
 Makefile   | 15 ---
 scripts/Kbuild.include | 42 +++---
 2 files changed, 43 insertions(+), 14 deletions(-)

diff --git a/Makefile b/Makefile
index 91baa282f3fa..90237e706bf8 100644
--- a/Makefile
+++ b/Makefile
@@ -304,11 +304,6 @@ HOSTCXX  = g++
 HOSTCFLAGS   = -Wall -Wmissing-prototypes -Wstrict-prototypes -O2 
-fomit-frame-pointer -std=gnu89
 HOSTCXXFLAGS = -O2
 
-ifeq ($(shell $(HOSTCC) -v 2>&1 | grep -c "clang version"), 1)
-HOSTCFLAGS  += -Wno-unused-value -Wno-unused-parameter \
-   -Wno-missing-field-initializers -fno-delete-null-pointer-checks
-endif
-
 # Decide whether to build built-in, modular, or both.
 # Normally, just do built-in.
 
@@ -343,6 +338,11 @@ export KBUILD_CHECKSRC KBUILD_SRC KBUILD_EXTMOD
 scripts/Kbuild.include: ;
 include scripts/Kbuild.include
 
+ifeq ($(call cached-shell,$(HOSTCC) -v 2>&1 | grep -c "clang version"), 1)
+HOSTCFLAGS  += -Wno-unused-value -Wno-unused-parameter \
+   -Wno-missing-field-initializers -fno-delete-null-pointer-checks
+endif
+
 # Make variables (CC, etc...)
 AS = $(CROSS_COMPILE)as
 LD = $(CROSS_COMPILE)ld
@@ -767,7 +767,7 @@ KBUILD_CFLAGS += $(call cc-option, 
-fno-inline-functions-called-once)
 endif
 
 # arch Makefile may override CC so keep this after arch Makefile is included
-NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include)
+NOSTDINC_FLAGS += -nostdinc -isystem $(call cached-shell,$(CC) 
-print-file-name=include)
 CHECKFLAGS += $(NOSTDINC_FLAGS)
 
 # warn about C99 declaration after statement
@@ -798,7 +798,7 @@ KBUILD_CFLAGS   += $(call 
cc-option,-Werror=incompatible-pointer-types)
 KBUILD_ARFLAGS := $(call ar-option,D)
 
 # check for 'asm goto'
-ifeq ($(shell $(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh $(CC)), y)
+ifeq ($(call cached-shell,$(CONFIG_SHELL) $(srctree)/scripts/gcc-goto.sh 
$(CC)), y)
KBUILD_CFLAGS += -DCC_HAVE_ASM_GOTO
KBUILD_AFLAGS += -DCC_HAVE_ASM_GOTO
 endif
@@ -1237,6 +1237,7 @@ endif # CONFIG_MODULES
 
 # Directories & files removed with 'make clean'
 CLEAN_DIRS  += $(MODVERDIR)
+CLEAN_FILES += $(KBUILD_SHELLCACHE)
 
 # Directories & files removed with 'make mrproper'
 MRPROPER_DIRS  += include/config usr/include include/generated  \
diff --git a/scripts/Kbuild.include b/scripts/Kbuild.include
index 0f82314621f2..e90e43fa3c81 100644
--- a/scripts/Kbuild.include
+++ b/scripts/Kbuild.include
@@ -7,6 +7,7 @@ quote   := "
 squote  := '
 empty   :=
 space   := $(empty) $(empty)
+tab := $(empty)$(empty)
 space_escape := _-_SPACE_-_
 
 ###
@@ -83,11 +84,35 @@ cc-cross-prefix =  \
 # output directory for tests below
 TMPOUT := $(if $(KBUILD_EXTMOD),$(firstword $(KBUILD_EXTMOD))/)
 
+KBUILD_SHELLCACHE := .shellcache
+
+ifeq ($(KBUILD_SHELLCACHE),)
+cached-shell = $(shell $(1))
+else
+-include $(KBUILD_SHELLCACHE)
+
+sanitize = $(subst $(space),_, \
+ $(subst $(tab),_, \
+ $(subst \n,_, \
+ $(subst :,_,  \
+ $(subst \#,_, \
+ $(subst =,_,  \
+ $(subst $$,_, \
+ $(subst ',_,  \
+ $(subst ",_,  \
+ $(1))
+
+cached-shell = $(if $(cmd$(call sanitize,$(1))),$(subst CACHED,,$(cmd$(call 
sanitize,$(1,$(shell \
+   res='$(shell $(1))';\
+   echo "$$res";   \
+   echo 'cmd$(call sanitize,$(1)) = CACHED'"$$res" >> 
$(KBUILD_SHELLCACHE)))
+endif
+
 # try-run
 # Usage: option = $(call try-run, $(CC)...-o "$$TMP",option-ok,otherwise)
 # Exit code chooses option. "$$TMP" is can be used as temporary file and
 # is automatically cleaned up.
-try-run = $(shell set -e;  \
+try-run = $(call cached-shell,set -e;  \
TMP="$(TMPOUT)..tmp";   \
TMPO="$(TMPOUT)..o";\
if ($(1)) >/dev/null 2>&1;  \
@@ -131,18 +156,19 @@ cc-disable-warning = $(call try-run,\
 
 # cc-name
 # Expands to either gcc or clang
-cc-name = $(shell $(CC) -v 2>&1 | grep -q "clang version" && echo clang || 
echo gcc)
+cc-name = $(call cached-shell,$(CC) -v 2>&1 | grep -q "clang version" && echo 
clang || echo gcc)
 
 # cc-version
-cc-version = $(shell $(CONFIG

[PATCH 1/2] Makefile: evaluate LDFLAGS_BUILD_ID only once

2016-06-27 Thread Rabin Vincent

Evaluate LDFLAGS_BUILD_ID (which involves invoking the compiler) only
once instead of over and over.

This provides a 20% reduction in null build time with x86 allnoconfig on
my machine:

$ make -j8 && perf stat -r5 make -j8
Before: 1,086617617 seconds time elapsed ( +-  0,37% )
 After: 0,864487173 seconds time elapsed ( +-  0,35% )

Signed-off-by: Rabin Vincent 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 6471f20ca400..91baa282f3fa 100644
--- a/Makefile
+++ b/Makefile
@@ -814,7 +814,7 @@ KBUILD_AFLAGS   += $(ARCH_AFLAGS)   $(KAFLAGS)
 KBUILD_CFLAGS   += $(ARCH_CFLAGS)   $(KCFLAGS)
 
 # Use --build-id when available.
-LDFLAGS_BUILD_ID = $(patsubst -Wl$(comma)%,%,\
+LDFLAGS_BUILD_ID := $(patsubst -Wl$(comma)%,%,\
  $(call cc-ldoption, -Wl$(comma)--build-id,))
 KBUILD_LDFLAGS_MODULE += $(LDFLAGS_BUILD_ID)
 LDFLAGS_vmlinux += $(LDFLAGS_BUILD_ID)
-- 
2.8.1

Re: [PATCH] tracing: add support for tracing MMIO helpers

2016-04-29 Thread Rabin Vincent

On Tue, Apr 26, 2016 at 09:14:47PM +0200, Arnd Bergmann wrote:
> On Tuesday 26 April 2016 21:04:42 Rabin Vincent wrote:
> > The support is added via  and as such is only
> > available on the archs which use that header.
> 
> Why that limitation? I think only few architectures use it. Maybe
> at least enable it for x86 as well?

Seems to work to on x86 (and presumably other archs as well, not tested
yet) to insert the include into linux/io.h instead of asm-generic/io.h.

> 
> > +/* This part must be outside protection */
> > +#include 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index e45db6b0d878..feca834436c5 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -372,6 +372,22 @@ config STACK_TRACER
> >  
> >   Say N if unsure.
> >  
> > +config TRACE_MMIO_HELPERS
> > +   bool "Support for tracing MMIO helpers"
> > +   select GENERIC_TRACER
> 
> How about putting a whitelist of architectures here that are including
> the necessary definitions? Or better, a HAVE_TRACE_MMIO_HELPERS
> symbol that gets selected by architectures and that this depends on?

If it works with linux/io.h we won't need to restrict it to specific
archs, otherwise I'll add a symbol as you suggest.

> > +#define DEFINE_MMIO_RW_TRACE(c, type)  
> > \
> > +type read ## c ## _trace(const volatile void __iomem *addr,
> > \
> > +   const char *addrexp, bool relaxed,  \
> > +   unsigned long caller)   \
> > +{  \
> > +   type value; \
> > +   \
> > +   if (relaxed)\
> > +   value = read ## c ## _relaxed_notrace(addr);\
> > +   else\
> > +   value = read ## c ## _notrace(addr);\
> > +   \
> > +   trace_mmio_read(addr, addrexp, value,   \
> > +   sizeof(value), relaxed, caller);\
> > +   \
> > +   return value;   \
> > +}  \
> > +   \
> 
> We have a number of other accessors that are commonly used:
> 
> {ioread,iowrite}{8,16,32,64}{,_be}

I'll make the code handle these.

> {in,out}{b,w,l,q}

These are port-mapped IO accesors, aren't they?  They don't even take
pointer arguments so mmio:* tracepoints don't seem appropriate for them?

> {hi_lo_,lo_hi_}{read,write}q

There's only a single definition of these and that is in terms of
writel()/readl() so they should be coverd by the readl/writel case.

Re: [PATCH] tracing: add support for tracing MMIO helpers

2016-04-29 Thread Rabin Vincent

On Tue, Apr 26, 2016 at 09:14:47PM +0200, Arnd Bergmann wrote:
> On Tuesday 26 April 2016 21:04:42 Rabin Vincent wrote:
> > The support is added via  and as such is only
> > available on the archs which use that header.
> 
> Why that limitation? I think only few architectures use it. Maybe
> at least enable it for x86 as well?

Seems to work to on x86 (and presumably other archs as well, not tested
yet) to insert the include into linux/io.h instead of asm-generic/io.h.

> 
> > +/* This part must be outside protection */
> > +#include 
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index e45db6b0d878..feca834436c5 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -372,6 +372,22 @@ config STACK_TRACER
> >  
> >   Say N if unsure.
> >  
> > +config TRACE_MMIO_HELPERS
> > +   bool "Support for tracing MMIO helpers"
> > +   select GENERIC_TRACER
> 
> How about putting a whitelist of architectures here that are including
> the necessary definitions? Or better, a HAVE_TRACE_MMIO_HELPERS
> symbol that gets selected by architectures and that this depends on?

If it works with linux/io.h we won't need to restrict it to specific
archs, otherwise I'll add a symbol as you suggest.

> > +#define DEFINE_MMIO_RW_TRACE(c, type)  
> > \
> > +type read ## c ## _trace(const volatile void __iomem *addr,
> > \
> > +   const char *addrexp, bool relaxed,  \
> > +   unsigned long caller)   \
> > +{  \
> > +   type value; \
> > +   \
> > +   if (relaxed)\
> > +   value = read ## c ## _relaxed_notrace(addr);\
> > +   else\
> > +   value = read ## c ## _notrace(addr);\
> > +   \
> > +   trace_mmio_read(addr, addrexp, value,   \
> > +   sizeof(value), relaxed, caller);\
> > +   \
> > +   return value;   \
> > +}  \
> > +   \
> 
> We have a number of other accessors that are commonly used:
> 
> {ioread,iowrite}{8,16,32,64}{,_be}

I'll make the code handle these.

> {in,out}{b,w,l,q}

These are port-mapped IO accesors, aren't they?  They don't even take
pointer arguments so mmio:* tracepoints don't seem appropriate for them?

> {hi_lo_,lo_hi_}{read,write}q

There's only a single definition of these and that is in terms of
writel()/readl() so they should be coverd by the readl/writel case.

[PATCH] tracing: add support for tracing MMIO helpers

2016-04-26 Thread Rabin Vincent

Add support for tracing the MMIO helpers readl()/writel() (and their
variants), for use while developing and debugging device drivers.

The address and the value are saved along with the C expressions used in
the driver when calling these MMIO access functions, leading to a trace
of the driver's interactions with the hardware's registers:

 mmcqd/0-659 [000] d.H3 95.600911: mmio_write: mmci_request_end: 0xa08d200c 
[host->base + MMCICOMMAND] <= 0x [0]
 mmcqd/0-659 [000] d.H3 95.600945: mmio_read: mmci_irq: 0xa08d2034 [host->base 
+ MMCISTATUS] => 0x0400
 mmcqd/0-659 [000] d.H3 95.600953: mmio_read: mmci_irq: 0xa08d203c [host->base 
+ MMCIMASK0] => 0x03ff
 mmcqd/0-659 [000] d.H3 95.600961: mmio_write: mmci_irq: 0xa08d2038 [host->base 
+ MMCICLEAR] <= 0x [status]
 mmcqd/0-659 [000] d..2 95.601476: mmio_write: mmci_start_data: 0xa08d2024 
[base + MMCIDATATIMER] <= 0x00124f80 [timeout]
 mmcqd/0-659 [000] d..2 95.601498: mmio_write: mmci_start_data: 0xa08d2028 
[base + MMCIDATALENGTH] <= 0x3e00 [host->size]
 mmcqd/0-659 [000] d..2 95.601512: mmio_write: mmci_write_datactrlreg: 
0xa08d202c [host->base + MMCIDATACTRL] <= 0x0093 [datactrl]
 mmcqd/0-659 [000] d..2 95.601522: mmio_read: mmci_start_data: 0xa08d203c [base 
+ MMCIMASK0] => 0x03ff
 mmcqd/0-659 [000] d..2 95.601531: mmio_write: mmci_start_data: 0xa08d203c 
[base + MMCIMASK0] <= 0x02ff [readl(base + MMCIMASK0) & ~MCI_DATAENDMASK]
 mmcqd/0-659 [000] d..2 95.601540: mmio_write: mmci_set_mask1: 0xa08d2040 [base 
+ MMCIMASK1] <= 0x8000 [mask]
 mmcqd/0-659 [000] d..2 95.601550: mmio_read: mmci_start_command: 0xa08d200c 
[base + MMCICOMMAND] => 0x
 mmcqd/0-659 [000] d..2 95.601559: mmio_write: mmci_start_command: 0xa08d2008 
[base + MMCIARGUMENT] <= 0x7c00 [cmd->arg]
 mmcqd/0-659 [000] d..2 95.601568: mmio_write: mmci_start_command: 0xa08d200c 
[base + MMCICOMMAND] <= 0x0452 [c]
 mmcqd/0-659 [000] d.h3 95.601688: mmio_read: mmci_irq: 0xa08d2034 [host->base 
+ MMCISTATUS] => 0x0022a440
 mmcqd/0-659 [000] d.h3 95.601708: mmio_read: mmci_irq: 0xa08d203c [host->base 
+ MMCIMASK0] => 0x02ff
 mmcqd/0-659 [000] d.h3 95.601718: mmio_write: mmci_irq: 0xa08d2038 [host->base 
+ MMCICLEAR] <= 0x0040 [status]

We do not globally replace the MMIO helpers.  Instead, tracing must be
explicitly enabled in the required driver source files by #defining
TRACE_MMIO_HELPERS at the top of the file.

The support is added via  and as such is only
available on the archs which use that header.

Signed-off-by: Rabin Vincent <ra...@rab.in>
---
 include/asm-generic/io.h   |  4 ++
 include/linux/trace_mmio_helpers.h | 98 ++
 include/trace/events/mmio.h| 84 
 kernel/trace/Kconfig   | 16 +++
 kernel/trace/Makefile  |  1 +
 kernel/trace/trace_mmio_helpers.c  | 45 +
 6 files changed, 248 insertions(+)
 create mode 100644 include/linux/trace_mmio_helpers.h
 create mode 100644 include/trace/events/mmio.h
 create mode 100644 kernel/trace/trace_mmio_helpers.c

diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index eed3bbe88c8a..676ed9d4cddf 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -699,6 +699,10 @@ static inline void iowrite32_rep(volatile void __iomem 
*addr,
 #endif
 #endif /* CONFIG_GENERIC_IOMAP */
 
+#ifdef TRACE_MMIO_HELPERS
+#include 
+#endif
+
 #ifdef __KERNEL__
 
 #include 
diff --git a/include/linux/trace_mmio_helpers.h 
b/include/linux/trace_mmio_helpers.h
new file mode 100644
index ..463f2e38adad
--- /dev/null
+++ b/include/linux/trace_mmio_helpers.h
@@ -0,0 +1,98 @@
+#ifndef _LINUX_TRACE_MMIO_HELPERS_H
+#define _LINUX_TRACE_MMIO_HELPERS_H
+
+#ifdef CONFIG_TRACE_MMIO_HELPERS
+
+#define DECLARE_MMIO_RW_TRACE(c, type) \
+static inline type \
+read ## c ## _notrace(const volatile void __iomem *addr)   \
+{  \
+   return read ## c(addr); \
+}  \
+   \
+static inline type \
+read ## c ## _relaxed_notrace(const volatile void __iomem *addr)   \
+{  \
+   return read ## c ## _relaxed(addr); \
+}  \
+   \
+static inline void

[PATCH] tracing: add support for tracing MMIO helpers

2016-04-26 Thread Rabin Vincent

Add support for tracing the MMIO helpers readl()/writel() (and their
variants), for use while developing and debugging device drivers.

The address and the value are saved along with the C expressions used in
the driver when calling these MMIO access functions, leading to a trace
of the driver's interactions with the hardware's registers:

 mmcqd/0-659 [000] d.H3 95.600911: mmio_write: mmci_request_end: 0xa08d200c 
[host->base + MMCICOMMAND] <= 0x [0]
 mmcqd/0-659 [000] d.H3 95.600945: mmio_read: mmci_irq: 0xa08d2034 [host->base 
+ MMCISTATUS] => 0x0400
 mmcqd/0-659 [000] d.H3 95.600953: mmio_read: mmci_irq: 0xa08d203c [host->base 
+ MMCIMASK0] => 0x03ff
 mmcqd/0-659 [000] d.H3 95.600961: mmio_write: mmci_irq: 0xa08d2038 [host->base 
+ MMCICLEAR] <= 0x [status]
 mmcqd/0-659 [000] d..2 95.601476: mmio_write: mmci_start_data: 0xa08d2024 
[base + MMCIDATATIMER] <= 0x00124f80 [timeout]
 mmcqd/0-659 [000] d..2 95.601498: mmio_write: mmci_start_data: 0xa08d2028 
[base + MMCIDATALENGTH] <= 0x3e00 [host->size]
 mmcqd/0-659 [000] d..2 95.601512: mmio_write: mmci_write_datactrlreg: 
0xa08d202c [host->base + MMCIDATACTRL] <= 0x0093 [datactrl]
 mmcqd/0-659 [000] d..2 95.601522: mmio_read: mmci_start_data: 0xa08d203c [base 
+ MMCIMASK0] => 0x03ff
 mmcqd/0-659 [000] d..2 95.601531: mmio_write: mmci_start_data: 0xa08d203c 
[base + MMCIMASK0] <= 0x02ff [readl(base + MMCIMASK0) & ~MCI_DATAENDMASK]
 mmcqd/0-659 [000] d..2 95.601540: mmio_write: mmci_set_mask1: 0xa08d2040 [base 
+ MMCIMASK1] <= 0x8000 [mask]
 mmcqd/0-659 [000] d..2 95.601550: mmio_read: mmci_start_command: 0xa08d200c 
[base + MMCICOMMAND] => 0x
 mmcqd/0-659 [000] d..2 95.601559: mmio_write: mmci_start_command: 0xa08d2008 
[base + MMCIARGUMENT] <= 0x7c00 [cmd->arg]
 mmcqd/0-659 [000] d..2 95.601568: mmio_write: mmci_start_command: 0xa08d200c 
[base + MMCICOMMAND] <= 0x0452 [c]
 mmcqd/0-659 [000] d.h3 95.601688: mmio_read: mmci_irq: 0xa08d2034 [host->base 
+ MMCISTATUS] => 0x0022a440
 mmcqd/0-659 [000] d.h3 95.601708: mmio_read: mmci_irq: 0xa08d203c [host->base 
+ MMCIMASK0] => 0x02ff
 mmcqd/0-659 [000] d.h3 95.601718: mmio_write: mmci_irq: 0xa08d2038 [host->base 
+ MMCICLEAR] <= 0x0040 [status]

We do not globally replace the MMIO helpers.  Instead, tracing must be
explicitly enabled in the required driver source files by #defining
TRACE_MMIO_HELPERS at the top of the file.

The support is added via  and as such is only
available on the archs which use that header.

Signed-off-by: Rabin Vincent 
---
 include/asm-generic/io.h   |  4 ++
 include/linux/trace_mmio_helpers.h | 98 ++
 include/trace/events/mmio.h| 84 
 kernel/trace/Kconfig   | 16 +++
 kernel/trace/Makefile  |  1 +
 kernel/trace/trace_mmio_helpers.c  | 45 +
 6 files changed, 248 insertions(+)
 create mode 100644 include/linux/trace_mmio_helpers.h
 create mode 100644 include/trace/events/mmio.h
 create mode 100644 kernel/trace/trace_mmio_helpers.c

diff --git a/include/asm-generic/io.h b/include/asm-generic/io.h
index eed3bbe88c8a..676ed9d4cddf 100644
--- a/include/asm-generic/io.h
+++ b/include/asm-generic/io.h
@@ -699,6 +699,10 @@ static inline void iowrite32_rep(volatile void __iomem 
*addr,
 #endif
 #endif /* CONFIG_GENERIC_IOMAP */
 
+#ifdef TRACE_MMIO_HELPERS
+#include 
+#endif
+
 #ifdef __KERNEL__
 
 #include 
diff --git a/include/linux/trace_mmio_helpers.h 
b/include/linux/trace_mmio_helpers.h
new file mode 100644
index ..463f2e38adad
--- /dev/null
+++ b/include/linux/trace_mmio_helpers.h
@@ -0,0 +1,98 @@
+#ifndef _LINUX_TRACE_MMIO_HELPERS_H
+#define _LINUX_TRACE_MMIO_HELPERS_H
+
+#ifdef CONFIG_TRACE_MMIO_HELPERS
+
+#define DECLARE_MMIO_RW_TRACE(c, type) \
+static inline type \
+read ## c ## _notrace(const volatile void __iomem *addr)   \
+{  \
+   return read ## c(addr); \
+}  \
+   \
+static inline type \
+read ## c ## _relaxed_notrace(const volatile void __iomem *addr)   \
+{  \
+   return read ## c ## _relaxed(addr); \
+}  \
+   \
+static inline void

[tip:sched/core] sched/debug: Don't dump sched debug info in SysRq-W

2016-04-13 Thread tip-bot for Rabin Vincent

Commit-ID:  fb90a6e93c0684ab2629a42462400603aa829b9c
Gitweb: http://git.kernel.org/tip/fb90a6e93c0684ab2629a42462400603aa829b9c
Author: Rabin Vincent <rab...@axis.com>
AuthorDate: Mon, 4 Apr 2016 15:42:02 +0200
Committer:  Ingo Molnar <mi...@kernel.org>
CommitDate: Wed, 13 Apr 2016 11:23:21 +0200

sched/debug: Don't dump sched debug info in SysRq-W

sysrq_sched_debug_show() can dump a lot of information.  Don't print out
all that if we're just trying to get a list of blocked tasks (SysRq-W).
The information is still accessible with SysRq-T.

Signed-off-by: Rabin Vincent <rab...@axis.com>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Thomas Gleixner <t...@linutronix.de>
Link: 
http://lkml.kernel.org/r/1459777322-30902-1-git-send-email-rabin.vinc...@axis.com
Signed-off-by: Ingo Molnar <mi...@kernel.org>
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..06efbb9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5050,7 +5050,8 @@ void show_state_filter(unsigned long state_filter)
touch_all_softlockup_watchdogs();
 
 #ifdef CONFIG_SCHED_DEBUG
-   sysrq_sched_debug_show();
+   if (!state_filter)
+   sysrq_sched_debug_show();
 #endif
rcu_read_unlock();
/*

[tip:sched/core] sched/debug: Don't dump sched debug info in SysRq-W

2016-04-13 Thread tip-bot for Rabin Vincent

Commit-ID:  fb90a6e93c0684ab2629a42462400603aa829b9c
Gitweb: http://git.kernel.org/tip/fb90a6e93c0684ab2629a42462400603aa829b9c
Author: Rabin Vincent 
AuthorDate: Mon, 4 Apr 2016 15:42:02 +0200
Committer:  Ingo Molnar 
CommitDate: Wed, 13 Apr 2016 11:23:21 +0200

sched/debug: Don't dump sched debug info in SysRq-W

sysrq_sched_debug_show() can dump a lot of information.  Don't print out
all that if we're just trying to get a list of blocked tasks (SysRq-W).
The information is still accessible with SysRq-T.

Signed-off-by: Rabin Vincent 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Steven Rostedt 
Cc: Thomas Gleixner 
Link: 
http://lkml.kernel.org/r/1459777322-30902-1-git-send-email-rabin.vinc...@axis.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1159423..06efbb9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5050,7 +5050,8 @@ void show_state_filter(unsigned long state_filter)
touch_all_softlockup_watchdogs();
 
 #ifdef CONFIG_SCHED_DEBUG
-   sysrq_sched_debug_show();
+   if (!state_filter)
+   sysrq_sched_debug_show();
 #endif
rcu_read_unlock();
/*

[PATCH] sched: don't dump sched debug info in SysRq-W

2016-04-04 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

sysrq_sched_debug_show() can dump a lot of information.  Don't print out
all that if we're just trying to get a list of blocked tasks (SysRq-W).
The information is still accessible with SysRq-T.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..e330d6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4998,7 +4998,8 @@ void show_state_filter(unsigned long state_filter)
touch_all_softlockup_watchdogs();
 
 #ifdef CONFIG_SCHED_DEBUG
-   sysrq_sched_debug_show();
+   if (!state_filter)
+   sysrq_sched_debug_show();
 #endif
rcu_read_unlock();
/*
-- 
2.7.0

[PATCH] sched: don't dump sched debug info in SysRq-W

2016-04-04 Thread Rabin Vincent

From: Rabin Vincent 

sysrq_sched_debug_show() can dump a lot of information.  Don't print out
all that if we're just trying to get a list of blocked tasks (SysRq-W).
The information is still accessible with SysRq-T.

Signed-off-by: Rabin Vincent 
---
 kernel/sched/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8b489fc..e330d6c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4998,7 +4998,8 @@ void show_state_filter(unsigned long state_filter)
touch_all_softlockup_watchdogs();
 
 #ifdef CONFIG_SCHED_DEBUG
-   sysrq_sched_debug_show();
+   if (!state_filter)
+   sysrq_sched_debug_show();
 #endif
rcu_read_unlock();
/*
-- 
2.7.0

Re: [PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-04-04 Thread Rabin Vincent

On Tue, Mar 29, 2016 at 05:13:05PM +0100, Marc Zyngier wrote:
> - What if my timer is not connected to a controller that implements this
> API? Something that is not a GIC, for example?

As you know irq_set_irqchip_state() does nothing in that case, and the erratum
is thus presumably not worked around in such a system.

By the way, while the original document from ARM is not public, this
erratum can be seen here (among other places):
http://www.nxp.com/files/32bit/doc/errata/IMX6DQCE.pdf#page=13
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/es/es-01042.pdf#page=26

> - How does it work when the GIC (with EOImode==1) performs a priority
> drop (by writing to the EOI register) before calling into the timer
> handler, and finishing the handling with a write to DIR?

I guess it doesn't.

> - What are the comparative costs of taking a spurious (but nonetheless
> harmless) interrupt vs poking the distributor (which is by no mean a
> cheap operation)?

The time taken by irq_set_irqchip_state() appears to be about 15%
shorter than the time gic_handle_irq() and its called functions take to
handle the spurious interrrupt.  The overhead of going in and out of
interrupt context for the spurious interrupt is not included in the
latter time.

Re: [PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-04-04 Thread Rabin Vincent

On Tue, Mar 29, 2016 at 05:13:05PM +0100, Marc Zyngier wrote:
> - What if my timer is not connected to a controller that implements this
> API? Something that is not a GIC, for example?

As you know irq_set_irqchip_state() does nothing in that case, and the erratum
is thus presumably not worked around in such a system.

By the way, while the original document from ARM is not public, this
erratum can be seen here (among other places):
http://www.nxp.com/files/32bit/doc/errata/IMX6DQCE.pdf#page=13
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/es/es-01042.pdf#page=26

> - How does it work when the GIC (with EOImode==1) performs a priority
> drop (by writing to the EOI register) before calling into the timer
> handler, and finishing the handling with a write to DIR?

I guess it doesn't.

> - What are the comparative costs of taking a spurious (but nonetheless
> harmless) interrupt vs poking the distributor (which is by no mean a
> cheap operation)?

The time taken by irq_set_irqchip_state() appears to be about 15%
shorter than the time gic_handle_irq() and its called functions take to
handle the spurious interrrupt.  The overhead of going in and out of
interrupt context for the spurious interrupt is not included in the
latter time.

Re: [PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-04-04 Thread Rabin Vincent

On Tue, Mar 29, 2016 at 04:21:03PM +0200, Daniel Lezcano wrote:
> On 03/29/2016 10:08 AM, Rabin Vincent wrote:
> >  -0 [001] d.h2 99.850527: irq_handler_entry: irq=16 name=gt
> >  -0 [001] d.h2 99.850538: irq_handler_exit: irq=16 ret=handled
> >  -0 [001] d.H2 99.850540: irq_handler_entry: irq=16 name=gt
> >  -0 [001] d.H2 99.850542: irq_handler_exit: irq=16 ret=unhandled
> >  -0 [001] d.h2 99.987832: irq_handler_entry: irq=16 name=gt
> >  -0 [001] dnh2 99.987845: irq_handler_exit: irq=16 ret=handled
> >  -0 [001] dnh2 99.987848: irq_handler_entry: irq=16 name=gt
> >  -0 [001] dnh2 99.987850: irq_handler_exit: irq=16 ret=unhandled
> 
> what platform are you using to test that ?

ARTPEC-6.

Re: [PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-04-04 Thread Rabin Vincent

On Tue, Mar 29, 2016 at 04:21:03PM +0200, Daniel Lezcano wrote:
> On 03/29/2016 10:08 AM, Rabin Vincent wrote:
> >  -0 [001] d.h2 99.850527: irq_handler_entry: irq=16 name=gt
> >  -0 [001] d.h2 99.850538: irq_handler_exit: irq=16 ret=handled
> >  -0 [001] d.H2 99.850540: irq_handler_entry: irq=16 name=gt
> >  -0 [001] d.H2 99.850542: irq_handler_exit: irq=16 ret=unhandled
> >  -0 [001] d.h2 99.987832: irq_handler_entry: irq=16 name=gt
> >  -0 [001] dnh2 99.987845: irq_handler_exit: irq=16 ret=handled
> >  -0 [001] dnh2 99.987848: irq_handler_entry: irq=16 name=gt
> >  -0 [001] dnh2 99.987850: irq_handler_exit: irq=16 ret=unhandled
> 
> what platform are you using to test that ?

ARTPEC-6.

[PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-03-29 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

According to the errata document for the Cortex A9 MPCore, the correct
code sequence in the interrupt handler to workaround erratum 740657
"Global Timer can send two interrupts for the same event" is:

 (1) Read the ICCIAR (Interrupt Acknowledge) register
 (2) Modify the comparator value, to set it to a higher value
 (3) Clear the Global Timer flag
 (4) Clear the Pending Status information for Interrupt 27 (Global
 Timer interrupt) in the Distributor of the Interrupt Controller.
 (5) Write the ICCEOIR (End of Interrupt) register

(1) and (5) are done by the GIC driver and (2) and (3) are done by the
Global Timer driver.  However, nobody does (4) and thus the workaround
is inactive and the timer triggers many spurious interrupts:

 -0 [001] d.h2 99.850527: irq_handler_entry: irq=16 name=gt
 -0 [001] d.h2 99.850538: irq_handler_exit: irq=16 ret=handled
 -0 [001] d.H2 99.850540: irq_handler_entry: irq=16 name=gt
 -0 [001] d.H2 99.850542: irq_handler_exit: irq=16 ret=unhandled
 -0 [001] d.h2 99.987832: irq_handler_entry: irq=16 name=gt
 -0 [001] dnh2 99.987845: irq_handler_exit: irq=16 ret=handled
 -0 [001] dnh2 99.987848: irq_handler_entry: irq=16 name=gt
 -0 [001] dnh2 99.987850: irq_handler_exit: irq=16 ret=unhandled

Make the Global Timer driver perform step (4) via the GIC driver with
the help of the irq_set_irqchip_state() function, to prevent the
spurious interrupts.

Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 drivers/clocksource/arm_global_timer.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/clocksource/arm_global_timer.c 
b/drivers/clocksource/arm_global_timer.c
index 9df0d16..b9d0f86 100644
--- a/drivers/clocksource/arm_global_timer.c
+++ b/drivers/clocksource/arm_global_timer.c
@@ -140,26 +140,31 @@ static int gt_clockevent_set_next_event(unsigned long evt,
 static irqreturn_t gt_clockevent_interrupt(int irq, void *dev_id)
 {
struct clock_event_device *evt = dev_id;
+   bool workaround = clockevent_state_oneshot(evt);
 
if (!(readl_relaxed(gt_base + GT_INT_STATUS) &
GT_INT_STATUS_EVENT_FLAG))
return IRQ_NONE;
 
/**
-* ERRATA 740657( Global Timer can send 2 interrupts for
+* ERRATA 740657 (Global Timer can send 2 interrupts for
 * the same event in single-shot mode)
 * Workaround:
-*  Either disable single-shot mode.
-*  Or
-*  Modify the Interrupt Handler to avoid the
-*  offending sequence. This is achieved by clearing
-*  the Global Timer flag _after_ having incremented
-*  the Comparator register value to a higher value.
+* - Read the ICCIAR (Interrupt Acknowledge) register
+* - Modify the comparator value, to set it to a higher value
+* - Clear the Global Timer flag
+* - Clear the Pending Status information for Interrupt 27 (Global
+*   Timer interrupt) in the Distributor of the Interrupt Controller.
+* - Write the ICCEOIR (End of Interrupt) register
 */
-   if (clockevent_state_oneshot(evt))
+   if (workaround)
gt_compare_set(ULONG_MAX, 0);
 
writel_relaxed(GT_INT_STATUS_EVENT_FLAG, gt_base + GT_INT_STATUS);
+
+   if (workaround)
+   irq_set_irqchip_state(irq, IRQCHIP_STATE_PENDING, false);
+
evt->event_handler(evt);
 
return IRQ_HANDLED;
-- 
2.7.0

[PATCH] clockevents/drivers/arm_global_timer: Fix erratum 740657 workaround

2016-03-29 Thread Rabin Vincent

From: Rabin Vincent 

According to the errata document for the Cortex A9 MPCore, the correct
code sequence in the interrupt handler to workaround erratum 740657
"Global Timer can send two interrupts for the same event" is:

 (1) Read the ICCIAR (Interrupt Acknowledge) register
 (2) Modify the comparator value, to set it to a higher value
 (3) Clear the Global Timer flag
 (4) Clear the Pending Status information for Interrupt 27 (Global
 Timer interrupt) in the Distributor of the Interrupt Controller.
 (5) Write the ICCEOIR (End of Interrupt) register

(1) and (5) are done by the GIC driver and (2) and (3) are done by the
Global Timer driver.  However, nobody does (4) and thus the workaround
is inactive and the timer triggers many spurious interrupts:

 -0 [001] d.h2 99.850527: irq_handler_entry: irq=16 name=gt
 -0 [001] d.h2 99.850538: irq_handler_exit: irq=16 ret=handled
 -0 [001] d.H2 99.850540: irq_handler_entry: irq=16 name=gt
 -0 [001] d.H2 99.850542: irq_handler_exit: irq=16 ret=unhandled
 -0 [001] d.h2 99.987832: irq_handler_entry: irq=16 name=gt
 -0 [001] dnh2 99.987845: irq_handler_exit: irq=16 ret=handled
 -0 [001] dnh2 99.987848: irq_handler_entry: irq=16 name=gt
 -0 [001] dnh2 99.987850: irq_handler_exit: irq=16 ret=unhandled

Make the Global Timer driver perform step (4) via the GIC driver with
the help of the irq_set_irqchip_state() function, to prevent the
spurious interrupts.

Signed-off-by: Rabin Vincent 
---
 drivers/clocksource/arm_global_timer.c | 21 +
 1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/clocksource/arm_global_timer.c 
b/drivers/clocksource/arm_global_timer.c
index 9df0d16..b9d0f86 100644
--- a/drivers/clocksource/arm_global_timer.c
+++ b/drivers/clocksource/arm_global_timer.c
@@ -140,26 +140,31 @@ static int gt_clockevent_set_next_event(unsigned long evt,
 static irqreturn_t gt_clockevent_interrupt(int irq, void *dev_id)
 {
struct clock_event_device *evt = dev_id;
+   bool workaround = clockevent_state_oneshot(evt);
 
if (!(readl_relaxed(gt_base + GT_INT_STATUS) &
GT_INT_STATUS_EVENT_FLAG))
return IRQ_NONE;
 
/**
-* ERRATA 740657( Global Timer can send 2 interrupts for
+* ERRATA 740657 (Global Timer can send 2 interrupts for
 * the same event in single-shot mode)
 * Workaround:
-*  Either disable single-shot mode.
-*  Or
-*  Modify the Interrupt Handler to avoid the
-*  offending sequence. This is achieved by clearing
-*  the Global Timer flag _after_ having incremented
-*  the Comparator register value to a higher value.
+* - Read the ICCIAR (Interrupt Acknowledge) register
+* - Modify the comparator value, to set it to a higher value
+* - Clear the Global Timer flag
+* - Clear the Pending Status information for Interrupt 27 (Global
+*   Timer interrupt) in the Distributor of the Interrupt Controller.
+* - Write the ICCEOIR (End of Interrupt) register
 */
-   if (clockevent_state_oneshot(evt))
+   if (workaround)
gt_compare_set(ULONG_MAX, 0);
 
writel_relaxed(GT_INT_STATUS_EVENT_FLAG, gt_base + GT_INT_STATUS);
+
+   if (workaround)
+   irq_set_irqchip_state(irq, IRQCHIP_STATE_PENDING, false);
+
evt->event_handler(evt);
 
return IRQ_HANDLED;
-- 
2.7.0

Re: [PATCH v2] ARM: DMA: Fix kzalloc flags in __dma_alloc

2016-03-19 Thread Rabin Vincent

On Fri, Mar 18, 2016 at 06:28:49PM +0900, Alexandre Courbot wrote:
> Commit 19e6e5e5392b ("ARM: 8547/1: dma-mapping: store buffer
> information") allocates a structure meant for internal buffer management
> with the GFP flags of the buffer itself. This can trigger the following
> safeguard in the slab/slub allocator:
> 
>   if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
>   pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
>   BUG();
>   }
> 
> Fix this by filtering the flags that make the slab allocator unhappy.
> 
> Signed-off-by: Alexandre Courbot <acour...@nvidia.com>
> Cc: Rabin Vincent <ra...@rab.in>
> ---
> Changes since v1:
> - Filter flags that may cause problem instead of forcing GFP_KERNEL
>   (and risk sleeping in atomic context), as suggested by Rabin.

Acked-by: Rabin Vincent <ra...@rab.in>

Thanks.

Re: [PATCH v2] ARM: DMA: Fix kzalloc flags in __dma_alloc

2016-03-19 Thread Rabin Vincent

On Fri, Mar 18, 2016 at 06:28:49PM +0900, Alexandre Courbot wrote:
> Commit 19e6e5e5392b ("ARM: 8547/1: dma-mapping: store buffer
> information") allocates a structure meant for internal buffer management
> with the GFP flags of the buffer itself. This can trigger the following
> safeguard in the slab/slub allocator:
> 
>   if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
>   pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
>   BUG();
>   }
> 
> Fix this by filtering the flags that make the slab allocator unhappy.
> 
> Signed-off-by: Alexandre Courbot 
> Cc: Rabin Vincent 
> ---
> Changes since v1:
> - Filter flags that may cause problem instead of forcing GFP_KERNEL
>   (and risk sleeping in atomic context), as suggested by Rabin.

Acked-by: Rabin Vincent 

Thanks.

Re: [PATCH] ARM: DMA: Fix kzalloc flags in __dma_alloc

2016-03-19 Thread Rabin Vincent

On Fri, Mar 18, 2016 at 11:12:26AM +0900, Alexandre Courbot wrote:
> Commit 19e6e5e5392b ("ARM: 8547/1: dma-mapping: store buffer
> information") allocates a structure meant for internal buffer management
> with the GFP flags of the buffer itself. This can trigger the following
> safeguard in the slab/slub allocator:
> 
>   if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
>   pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
>   BUG();
>   }
> 
> Fix this by allocating the structure with GFP_KERNEL, as it is meant to
> be used by the kernel and not for DMA.

We can't use GFP_KERNEL here.  The caller may have passed in gfp flags
which indicate that we can't sleep, and we need to respect that.  What we can
do is mask out the region specifiers in the gfp flags that we pass to
kzalloc().  This is what the other architectures do in their dma
allocation functions:

arch/mips/cavium-octeon/dma-octeon.c:   gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/mips/loongson64/common/dma-swiotlb.c:  gfp &= ~(__GFP_DMA | 
__GFP_DMA32 | __GFP_HIGHMEM);
arch/mips/mm/dma-default.c: gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/mips/netlogic/common/nlm-dma.c:gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/x86/kernel/pci-dma.c:  *gfp &= ~(__GFP_DMA | __GFP_HIGHMEM | 
__GFP_DMA32);

Re: [PATCH] ARM: DMA: Fix kzalloc flags in __dma_alloc

2016-03-19 Thread Rabin Vincent

On Fri, Mar 18, 2016 at 11:12:26AM +0900, Alexandre Courbot wrote:
> Commit 19e6e5e5392b ("ARM: 8547/1: dma-mapping: store buffer
> information") allocates a structure meant for internal buffer management
> with the GFP flags of the buffer itself. This can trigger the following
> safeguard in the slab/slub allocator:
> 
>   if (unlikely(flags & GFP_SLAB_BUG_MASK)) {
>   pr_emerg("gfp: %u\n", flags & GFP_SLAB_BUG_MASK);
>   BUG();
>   }
> 
> Fix this by allocating the structure with GFP_KERNEL, as it is meant to
> be used by the kernel and not for DMA.

We can't use GFP_KERNEL here.  The caller may have passed in gfp flags
which indicate that we can't sleep, and we need to respect that.  What we can
do is mask out the region specifiers in the gfp flags that we pass to
kzalloc().  This is what the other architectures do in their dma
allocation functions:

arch/mips/cavium-octeon/dma-octeon.c:   gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/mips/loongson64/common/dma-swiotlb.c:  gfp &= ~(__GFP_DMA | 
__GFP_DMA32 | __GFP_HIGHMEM);
arch/mips/mm/dma-default.c: gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/mips/netlogic/common/nlm-dma.c:gfp &= ~(__GFP_DMA | __GFP_DMA32 | 
__GFP_HIGHMEM);
arch/x86/kernel/pci-dma.c:  *gfp &= ~(__GFP_DMA | __GFP_HIGHMEM | 
__GFP_DMA32);

Re: [PATCH] block: protect iterate_bdevs() against concurrent close

2016-03-14 Thread Rabin Vincent

(fixed Jens' address)

On Thu, Mar 10, 2016 at 06:37:27PM +0100, Jan Kara wrote:
> On Thu 10-03-16 13:26:03, Rabin Vincent wrote:
> > If a block device is closed while iterate_bdevs() is handling it, the
> > following NULL pointer dereference occurs because bdev->b_disk is NULL
> > in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
> > turn called by the mapping_cap_writeback_dirty() call in
> > __filemap_fdatawrite_range()):
> 
> Thanks for spotting the problem. The patch will fix the problem you
> found.  But what prevents e.g. flusher thread from trying to writeback
> the block device inode while that gets invalidated at the same moment?

Don't the sync_block_dev() / bdev_write_inode() calls in __blkdev_put()
prevent this?

Re: [PATCH] block: protect iterate_bdevs() against concurrent close

2016-03-14 Thread Rabin Vincent

(fixed Jens' address)

On Thu, Mar 10, 2016 at 06:37:27PM +0100, Jan Kara wrote:
> On Thu 10-03-16 13:26:03, Rabin Vincent wrote:
> > If a block device is closed while iterate_bdevs() is handling it, the
> > following NULL pointer dereference occurs because bdev->b_disk is NULL
> > in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
> > turn called by the mapping_cap_writeback_dirty() call in
> > __filemap_fdatawrite_range()):
> 
> Thanks for spotting the problem. The patch will fix the problem you
> found.  But what prevents e.g. flusher thread from trying to writeback
> the block device inode while that gets invalidated at the same moment?

Don't the sync_block_dev() / bdev_write_inode() calls in __blkdev_put()
prevent this?

[PATCH] splice: handle zero nr_pages in splice_to_pipe()

2016-03-10 Thread Rabin Vincent

Running the following command:

 busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null

with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:

 BUG: unable to handle kernel NULL pointer dereference at 0020
 IP: [] generic_pipe_buf_release+0xc/0x40
 Call Trace:
  [] splice_direct_to_actor+0x143/0x1e0
  [] ? generic_pipe_buf_nosteal+0x10/0x10
  [] do_splice_direct+0x8f/0xb0
  [] do_sendfile+0x199/0x380
  [] SyS_sendfile64+0x90/0xa0
  [] entry_SYSCALL_64_fastpath+0x12/0x6d

 page dumped because: VM_BUG_ON_PAGE(atomic_read(>_count) == 0)
 kernel BUG at include/linux/mm.h:367!
 invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 RIP: [] generic_pipe_buf_release+0x3c/0x40
 Call Trace:
  [] splice_direct_to_actor+0x143/0x1e0
  [] ? generic_pipe_buf_nosteal+0x10/0x10
  [] do_splice_direct+0x8f/0xb0
  [] do_sendfile+0x199/0x380
  [] SyS_sendfile64+0x90/0xa0
  [] tracesys_phase2+0x84/0x89

(busybox's cat uses sendfile(2), unlike the coreutils version)

This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0.  spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.

All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.

Cc: sta...@vger.kernel.org
Signed-off-by: Rabin Vincent <ra...@rab.in>
---
 fs/splice.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 82bc0d64fc38..19e0b103d253 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -185,6 +185,9 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
unsigned int spd_pages = spd->nr_pages;
int ret, do_wakeup, page_nr;
 
+   if (!spd_pages)
+   return 0;
+
ret = 0;
do_wakeup = 0;
page_nr = 0;
-- 
2.7.0

[PATCH] splice: handle zero nr_pages in splice_to_pipe()

2016-03-10 Thread Rabin Vincent

Running the following command:

 busybox cat /sys/kernel/debug/tracing/trace_pipe > /dev/null

with any tracing enabled pretty very quickly leads to various NULL
pointer dereferences and VM BUG_ON()s, such as these:

 BUG: unable to handle kernel NULL pointer dereference at 0020
 IP: [] generic_pipe_buf_release+0xc/0x40
 Call Trace:
  [] splice_direct_to_actor+0x143/0x1e0
  [] ? generic_pipe_buf_nosteal+0x10/0x10
  [] do_splice_direct+0x8f/0xb0
  [] do_sendfile+0x199/0x380
  [] SyS_sendfile64+0x90/0xa0
  [] entry_SYSCALL_64_fastpath+0x12/0x6d

 page dumped because: VM_BUG_ON_PAGE(atomic_read(>_count) == 0)
 kernel BUG at include/linux/mm.h:367!
 invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 RIP: [] generic_pipe_buf_release+0x3c/0x40
 Call Trace:
  [] splice_direct_to_actor+0x143/0x1e0
  [] ? generic_pipe_buf_nosteal+0x10/0x10
  [] do_splice_direct+0x8f/0xb0
  [] do_sendfile+0x199/0x380
  [] SyS_sendfile64+0x90/0xa0
  [] tracesys_phase2+0x84/0x89

(busybox's cat uses sendfile(2), unlike the coreutils version)

This is because tracing_splice_read_pipe() can call splice_to_pipe()
with spd->nr_pages == 0.  spd_pages underflows in splice_to_pipe() and
we fill the page pointers and the other fields of the pipe_buffers with
garbage.

All other callers of splice_to_pipe() avoid calling it when nr_pages ==
0, and we could make tracing_splice_read_pipe() do that too, but it
seems reasonable to have splice_to_page() handle this condition
gracefully.

Cc: sta...@vger.kernel.org
Signed-off-by: Rabin Vincent 
---
 fs/splice.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 82bc0d64fc38..19e0b103d253 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -185,6 +185,9 @@ ssize_t splice_to_pipe(struct pipe_inode_info *pipe,
unsigned int spd_pages = spd->nr_pages;
int ret, do_wakeup, page_nr;
 
+   if (!spd_pages)
+   return 0;
+
ret = 0;
do_wakeup = 0;
page_nr = 0;
-- 
2.7.0

[PATCH] block: protect iterate_bdevs() against concurrent close

2016-03-10 Thread Rabin Vincent

From: Rabin Vincent <rab...@axis.com>

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0508
 IP: [] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: 880009f4d700 ti: 880009f5c000 task.ti: 880009f5c000
 RIP: 0010:[]  [] 
blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:880009f5fe68  EFLAGS: 00010246
 RAX:  RBX: 88000ec17a38 RCX: 81a4e940
 RDX: 7fff RSI:  RDI: 88000ec176c0
 RBP: 880009f5fe68 R08:  R09: 
 R10: 0001 R11:  R12: 88000ec17860
 R13: 811b25c0 R14: 88000ec178e0 R15: 88000ec17a38
 FS:  7faee505d700() GS:88000fb0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0508 CR3: 09e8a000 CR4: 06e0
 Stack:
  880009f5feb8 8112e7f5  7fff
    7fff 0001
  88000ec178e0 88000ec17860 880009f5fec8 8112e81f
 Call Trace:
  [] __filemap_fdatawrite_range+0x85/0x90
  [] filemap_fdatawrite+0x1f/0x30
  [] fdatawrite_one_bdev+0x16/0x20
  [] iterate_bdevs+0xf2/0x130
  [] sys_sync+0x63/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [] blk_get_backing_dev_info+0x10/0x20
  RSP 
 CR2: 0508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: sta...@vger.kernel.org
Signed-off-by: Rabin Vincent <rab...@axis.com>
---
 fs/block_dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164..78c9f2a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
spin_lock(_superblock->s_inode_list_lock);
list_for_each_entry(inode, _superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
+   struct block_device *bdev;
 
spin_lock(>i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1904,8 +1905,12 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
 */
iput(old_inode);
old_inode = inode;
+   bdev = I_BDEV(inode);
 
-   func(I_BDEV(inode), arg);
+   mutex_lock(>bd_mutex);
+   if (bdev->bd_openers)
+   func(bdev, arg);
+   mutex_unlock(>bd_mutex);
 
spin_lock(_superblock->s_inode_list_lock);
}
-- 
2.7.0

[PATCH] block: protect iterate_bdevs() against concurrent close

2016-03-10 Thread Rabin Vincent

From: Rabin Vincent 

If a block device is closed while iterate_bdevs() is handling it, the
following NULL pointer dereference occurs because bdev->b_disk is NULL
in bdev_get_queue(), which is called from blk_get_backing_dev_info() (in
turn called by the mapping_cap_writeback_dirty() call in
__filemap_fdatawrite_range()):

 BUG: unable to handle kernel NULL pointer dereference at 0508
 IP: [] blk_get_backing_dev_info+0x10/0x20
 PGD 9e62067 PUD 9ee8067 PMD 0
 Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
 Modules linked in:
 CPU: 1 PID: 2422 Comm: sync Not tainted 4.5.0-rc7+ #400
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
 task: 880009f4d700 ti: 880009f5c000 task.ti: 880009f5c000
 RIP: 0010:[]  [] 
blk_get_backing_dev_info+0x10/0x20
 RSP: 0018:880009f5fe68  EFLAGS: 00010246
 RAX:  RBX: 88000ec17a38 RCX: 81a4e940
 RDX: 7fff RSI:  RDI: 88000ec176c0
 RBP: 880009f5fe68 R08:  R09: 
 R10: 0001 R11:  R12: 88000ec17860
 R13: 811b25c0 R14: 88000ec178e0 R15: 88000ec17a38
 FS:  7faee505d700() GS:88000fb0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 0508 CR3: 09e8a000 CR4: 06e0
 Stack:
  880009f5feb8 8112e7f5  7fff
    7fff 0001
  88000ec178e0 88000ec17860 880009f5fec8 8112e81f
 Call Trace:
  [] __filemap_fdatawrite_range+0x85/0x90
  [] filemap_fdatawrite+0x1f/0x30
  [] fdatawrite_one_bdev+0x16/0x20
  [] iterate_bdevs+0xf2/0x130
  [] sys_sync+0x63/0x90
  [] entry_SYSCALL_64_fastpath+0x12/0x76
 Code: 0f 1f 44 00 00 48 8b 87 f0 00 00 00 55 48 89 e5 <48> 8b 80 08 05 00 00 5d
 RIP  [] blk_get_backing_dev_info+0x10/0x20
  RSP 
 CR2: 0508
 ---[ end trace 2487336ceb3de62d ]---

The crash is easily reproducible by running the following command, if an
msleep(100) is inserted before the call to func() in iterate_devs():

 while :; do head -c1 /dev/nullb0; done > /dev/null & while :; do sync; done

Fix it by holding the bd_mutex across the func() call and only calling
func() if the bdev is opened.

Cc: sta...@vger.kernel.org
Signed-off-by: Rabin Vincent 
---
 fs/block_dev.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 826b164..78c9f2a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1884,6 +1884,7 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
spin_lock(_superblock->s_inode_list_lock);
list_for_each_entry(inode, _superblock->s_inodes, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
+   struct block_device *bdev;
 
spin_lock(>i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) ||
@@ -1904,8 +1905,12 @@ void iterate_bdevs(void (*func)(struct block_device *, 
void *), void *arg)
 */
iput(old_inode);
old_inode = inode;
+   bdev = I_BDEV(inode);
 
-   func(I_BDEV(inode), arg);
+   mutex_lock(>bd_mutex);
+   if (bdev->bd_openers)
+   func(bdev, arg);
+   mutex_unlock(>bd_mutex);
 
spin_lock(_superblock->s_inode_list_lock);
}
-- 
2.7.0

1 2 3 4 5 6 >

1 - 100 of 538 matches

Mail list logo