date:20140826

Re: [PATCH v6 5/6] arm64: add SIGSYS siginfo for compat task

2014-08-26 Thread AKASHI Takahiro


On 08/27/2014 02:55 AM, Will Deacon wrote:

On Thu, Aug 21, 2014 at 09:56:44AM +0100, AKASHI Takahiro wrote:

SIGSYS is primarily used in secure computing to notify tracer.
This patch allows signal handler on compat task to get correct information
with SA_SYSINFO specified when this signal is delivered.

Signed-off-by: AKASHI Takahiro 
---
  arch/arm64/include/asm/compat.h |7 +++
  arch/arm64/kernel/signal32.c|8 
  2 files changed, 15 insertions(+)

diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h
index 253e33b..c877915 100644
--- a/arch/arm64/include/asm/compat.h
+++ b/arch/arm64/include/asm/compat.h
@@ -205,6 +205,13 @@ typedef struct compat_siginfo {
compat_long_t _band;/* POLL_IN, POLL_OUT, POLL_MSG 
*/
int _fd;
} _sigpoll;h
+
+   /* SIGSYS */
+   struct {
+   compat_uptr_t _call_addr; /* calling user insn */
+   int _syscall;   /* triggering system call number */
+   unsigned int _arch; /* AUDIT_ARCH_* of syscall */
+   } _sigsys;
} _sifields;
  } compat_siginfo_t;

diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c
index 1b9ad02..aa550d6 100644
--- a/arch/arm64/kernel/signal32.c
+++ b/arch/arm64/kernel/signal32.c
@@ -186,6 +186,14 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, 
const siginfo_t *from)
err |= __put_user(from->si_uid, >si_uid);
err |= __put_user((compat_uptr_t)(unsigned long)from->si_ptr, 
>si_ptr);
break;
+#ifdef __ARCH_SIGSYS
+   case __SI_SYS:
+   err |= __put_user((compat_uptr_t)(unsigned long)
+   from->si_call_addr, >si_call_addr);
+   err |= __put_user(from->si_syscall, >si_syscall);
+   err |= __put_user(from->si_arch, >si_arch);
+   break;
+#endif


I think you should drop this #ifdef. We care about whether arch/arm/ defines
__ARCH_SIGSYS, not whether arm64 defines it (they both happen to define it
anyway).


Thanks. Done

-Takahiro AKASHI


Will


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 4/6] arm64: add seccomp syscall for compat task

2014-08-26 Thread AKASHI Takahiro


On 08/27/2014 02:53 AM, Will Deacon wrote:

On Thu, Aug 21, 2014 at 09:56:43AM +0100, AKASHI Takahiro wrote:

This patch allows compat task to issue seccomp() system call.

Signed-off-by: AKASHI Takahiro 
---
  arch/arm64/include/asm/unistd.h   |2 +-
  arch/arm64/include/asm/unistd32.h |3 +++
  2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 4bc95d2..cf6ee31 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -41,7 +41,7 @@
  #define __ARM_NR_compat_cacheflush(__ARM_NR_COMPAT_BASE+2)
  #define __ARM_NR_compat_set_tls   (__ARM_NR_COMPAT_BASE+5)

-#define __NR_compat_syscalls   383
+#define __NR_compat_syscalls   384
  #endif

  #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index e242600..2922c40 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -787,3 +787,6 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr)
  __SYSCALL(__NR_sched_getattr, sys_sched_getattr)
  #define __NR_renameat2 382
  __SYSCALL(__NR_renameat2, sys_renameat2)
+#define __NR_seccomp 383
+__SYSCALL(__NR_seccomp, sys_seccomp)
+


This will need rebasing onto -rc2, as we're hooked up two new compat
syscalls recently.


Thanks for heads-up. Fixed it.

-Takahiro AKASHI


Will


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 2/6] arm64: ptrace: allow tracer to skip a system call

2014-08-26 Thread AKASHI Takahiro


On 08/27/2014 02:51 AM, Will Deacon wrote:

On Fri, Aug 22, 2014 at 01:35:17AM +0100, AKASHI Takahiro wrote:

On 08/22/2014 02:08 AM, Kees Cook wrote:

On Thu, Aug 21, 2014 at 3:56 AM, AKASHI Takahiro
 wrote:

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index 8876049..c54dbcc 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1121,9 +1121,29 @@ static void tracehook_report_syscall(struct pt_regs 
*regs,

   asmlinkage int syscall_trace_enter(struct pt_regs *regs)
   {
+   unsigned int saved_syscallno = regs->syscallno;
+
  if (test_thread_flag(TIF_SYSCALL_TRACE))
  tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);

+   if (IS_SKIP_SYSCALL(regs->syscallno)) {
+   /*
+* RESTRICTION: we can't modify a return value of user
+* issued syscall(-1) here. In order to ease this flavor,
+* we need to treat whatever value in x0 as a return value,
+* but this might result in a bogus value being returned.
+*/
+   /*
+* NOTE: syscallno may also be set to -1 if fatal signal is
+* detected in tracehook_report_syscall_entry(), but since
+* a value set to x0 here is not used in this case, we may
+* neglect the case.
+*/
+   if (!test_thread_flag(TIF_SYSCALL_TRACE) ||
+   (IS_SKIP_SYSCALL(saved_syscallno)))
+   regs->regs[0] = -ENOSYS;
+   }
+


I don't have a runtime environment yet for arm64, so I can't test this
directly myself, so I'm just trying to eyeball this. :)

Once the seccomp logic is added here, I don't think using -2 as a
special value will work. Doesn't this mean the Oops is possible by the
user issuing a "-2" syscall? As in, if TIF_SYSCALL_WORK is set, and
the user passed -2 as the syscall, audit will be called only on entry,
and then skipped on exit?


Oops, you're absolutely right. I didn't think of this case.
syscall_trace_enter() should not return a syscallno directly, but always
return -1 if syscallno < 0. (except when secure_computing() returns with -1)
This also implies that tracehook_report_syscall() should also have a return 
value.

Will, is this fine with you?


Well, the first thing that jumps out at me is why this is being done
completely differently for arm64 and arm. I thought adding the new ptrace
requests would reconcile the differences?


I'm not sure what portion of my code you mentioned as "completely different", 
but

1)
setting x0 to -ENOSYS is necessary because, otherwise, user-issued syscall(-1) 
will
return a bogus value when audit tracing is on.

Please note that, on arm,
 not traced  traced
 --  --
syscall(-1)  aborted OOPs(BUG_ON)
syscall(-3000)   aborted aborted
syscall(1000)ENOSYS  ENOSYS

So, anyhow, its a bit difficult and meaningless to mimic these invalid cases.

2)
branching a new label, syscall_trace_return_skip (see entry.S), after 
syscall_trace_enter()
is necessary in order to avoid OOPS in audit_syscall_enter() as we discussed.

Did I make it clear?

-Takahiro AKASHI


Will


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [alsa-devel] [PATCH 1/2] regmap: cache: Fix regcache_sync_block for non-autoincrementing devices

2014-08-26 Thread Jarkko Nikula


On 08/26/2014 05:21 PM, Takashi Iwai wrote:

At Tue, 26 Aug 2014 17:03:12 +0300,
Jarkko Nikula wrote:

Commit 75a5f89f635c ("regmap: cache: Write consecutive registers in a single
block write") expected that autoincrementing writes are supported if
hardware has a register format which can support raw writes.

This is not necessarily true and thus for instance rbtree sync can fail when
there is need to sync multiple consecutive registers but block write to
device fails due not supported autoincrementing writes.

Fix this by spliting raw block sync to series of single register writes for
devices that don't support autoincrementing writes.

Wouldn't it suffice to correct regmap_can_raw_write() to return false
if map->use_single_rw is set?

I don't know. I was thinking that also but was unsure about it since 
regcache_sync_block_raw() and regcache_sync_block_single() code paths 
use different regmap write functions. regcache_sync_block_raw() ends up 
calling _regmap_raw_write() which takes care of page select operation 
when needed and regcache_sync_block_single() uses _regmap_write() which 
doesn't.


Which makes me thinking should the regcache_sync_block_single() also use 
_regmap_raw_write() in order to take care of page selects?


--
Jarkko
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[Bugfix] x86, irq: Fix bug in setting IOAPIC pin attributes

2014-08-26 Thread Jiang Liu

Commit 15a3c7cc9154321fc3 "x86, irq: Introduce two helper functions
to support irqdomain map operation" breaks LPSS ACPI enumerated
devices.

On startup, IOAPIC driver preallocates IRQ descriptors and programs
IOAPIC pins with default level and polarity attributes for all legacy
IRQs. Later legacy IRQ users may fail to set IOAPIC pin attributes
if the requested attributes conflicts with the default IOAPIC pin
attributes. So change mp_irqdomain_map() to allow the first legacy IRQ
user to reprogram IOAPIC pin with different attributes.

Reported-by: Mika Westerberg 
Signed-off-by: Jiang Liu 
---
Hi Mika,
We have a plan to kill function mp_set_gsi_attr() later, so
I have slightly modified your changes. Could you please help to test
it again?
Regards!
Gerry
---
 arch/x86/kernel/apic/io_apic.c |   15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 29290f554e79..40a4aa3f4061 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1070,6 +1070,11 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int 
ioapic, int pin,
}
 
if (flags & IOAPIC_MAP_ALLOC) {
+   /* special handling for legacy IRQs */
+   if (irq < nr_legacy_irqs() && info->count == 1 &&
+   mp_irqdomain_map(domain, irq, pin) != 0)
+   irq = -1;
+
if (irq > 0)
info->count++;
else if (info->count == 0)
@@ -3896,7 +3901,15 @@ int mp_irqdomain_map(struct irq_domain *domain, unsigned 
int virq,
info->polarity = 1;
}
info->node = NUMA_NO_NODE;
-   info->set = 1;
+
+   /*
+* setup_IO_APIC_irqs() programs all legacy IRQs with default
+* trigger and polarity attributes. Don't set the flag for that
+* case so the first legacy IRQ user could reprogram the pin
+* with real trigger and polarity attributes.
+*/
+   if (virq >= nr_legacy_irqs() || info->count)
+   info->set = 1;
}
set_io_apic_irq_attr(, ioapic, hwirq, info->trigger,
 info->polarity);
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kernel.org PATCH] Li Zefan is now the 3.4 stable maintainer

2014-08-26 Thread Willy Tarreau

On Tue, Aug 26, 2014 at 04:08:58PM -0700, Greg KH wrote:
> Li has agreed to continue to support the 3.4 stable kernel tree until
> September 2016.

Great! Welcome to Li in this strange world of very long term maintenance :-)

Willy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 1/6] arm64: ptrace: add PTRACE_SET_SYSCALL

2014-08-26 Thread AKASHI Takahiro


Kees,

On 08/27/2014 02:46 AM, Will Deacon wrote:

On Fri, Aug 22, 2014 at 01:19:13AM +0100, AKASHI Takahiro wrote:

On 08/22/2014 01:47 AM, Kees Cook wrote:

On Thu, Aug 21, 2014 at 3:56 AM, AKASHI Takahiro
 wrote:

To allow tracer to be able to change/skip a system call by re-writing
a syscall number, there are several approaches:

(1) modify x8 register with ptrace(PTRACE_SETREGSET), and handle this case
  later on in syscall_trace_enter(), or
(2) support ptrace(PTRACE_SET_SYSCALL) as on arm

Thinking of the fact that user_pt_regs doesn't expose 'syscallno' to
tracer as well as that secure_computing() expects a changed syscall number
to be visible, especially case of -1, before this function returns in
syscall_trace_enter(), we'd better take (2).

Signed-off-by: AKASHI Takahiro 


Thanks, I like having this on both arm and arm64.


Yeah, having this simplified the code of syscall_trace_enter() a bit, but
also imposes some restriction on arm64, too.

  > I wonder if other archs should add this option too.

Do you think so? I assumed that SET_SYSCALL is to be avoided if possible.

I also think that SET_SYSCALL should take an extra argument for a return value
just in case of -1 (or we have SKIP_SYSCALL?).


I think we should propose this as a new request in the generic ptrace code.
We can have an architecture-hook for actually setting the syscall, and allow
architectures to define their own implementation of the request so they can
be moved over one by one.


What do you think about this request?

-Takahiro AKASHI


Will


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ARM: probes: return directly when emulate not set

2014-08-26 Thread Wang Nan

When kprobe decoding instruction, original code calls instruction
specific decoder if emulate is set to false.  However, instructions with
DECODE_TYPE_EMULATE are in fact don't have their decoder. What in the
action table are in fact handlers.  For example:

/* LDRD (immediate)  000x x1x0    1101  */
/* STRD (immediate)  000x x1x0      */
DECODE_EMULATEX (0x0e5000d0, 0x004000d0, PROBES_LDRSTRD,
 REGS(NOPCWB, NOPCX, 0, 0, 0)),

and

const union decode_action kprobes_arm_actions[NUM_PROBES_ARM_ACTIONS] = {
...
[PROBES_LDRSTRD] = {.handler = emulate_ldrdstrd},
...

In this situation, original code calls 'emulate_ldrdstrd' as a decoder,
which is obviously incorrect.

This patch makes it returns INSN_GOOD directly when 'emulate' is not
true.

Signed-off-by: Wang Nan 
Cc: "David A. Long" 
Cc: Russell King 
Cc: Jon Medhurst 
Cc: Taras Kondratiuk 
Cc: Ben Dooks 
---
 arch/arm/kernel/probes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/kernel/probes.c b/arch/arm/kernel/probes.c
index a8ab540..1c77b8d 100644
--- a/arch/arm/kernel/probes.c
+++ b/arch/arm/kernel/probes.c
@@ -436,8 +436,7 @@ probes_decode_insn(probes_opcode_t insn, struct 
arch_probes_insn *asi,
struct decode_emulate *d = (struct decode_emulate *)h;
 
if (!emulate)
-   return actions[d->handler.action].decoder(insn,
-   asi, h);
+   return INSN_GOOD;
 
asi->insn_handler = actions[d->handler.action].handler;
set_emulated_insn(insn, asi, thumb);
-- 
1.8.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v7 3/8] cpufreq: kirkwood: Remove use of the clk provider API

2014-08-26 Thread Mike Turquette

On Tue, Aug 26, 2014 at 5:35 PM, Andrew Lunn  wrote:
>> One final thought I have had is that it might be a good idea to have a
>> mux clock which represents the clock signal that feeds into the cpu. It
>> seems that a mux is exactly what is going on here with cpuclk rate and
>> ddrclk rate.
>
> I did think of this when i implemented the cpufreq driver. What makes
> it hard is that this bit is mixed in the same 32 bit register as the
> gate clocks. It would mean two clock providers sharing the same
> register, sharing a spinlock, etc. And the gating provider is shared
> with all mvebu platforms, dove, kirkword, orion5x, and four different
> armada SoCs. So i'm pushing complexity into this shared code, which
> none of the others need. Does the standard mux clock do what is
> needed? Or would i have to write a new clock provider?

Well I think that the mux-clock type should suffice. Both the standard
gate and mux can have a spinlock passed in at registration-time, so it
could be a shared spinlock. The standard mux clock expects a bitfield
in a register, similar to the single-bit approach taken by the gate
clock. So I think it could do just fine.

>
>> I even wonder if it is even appropriate to model this transition with a
>> clock enable operation? Maybe it is only a multiplex operation, or
>> perhaps a combination of enabling the powersave clock and changing the
>> parent input to the cpu?
>>
>> My idea is instead of relying on a cpufreq driver to parse the state of
>> your clocks and understand the multiplexing, you can use the clock
>> framework for that. In fact that might help you get one step closer to
>> using the cpufreq-cpu0.c/cpufreq-generic.c implementation.
>
> So you want the whole disabling of interrupt delivery to the cpu,
> flipping the mux, wait for interrupt and re-enabling of interrupt
> delivery to the cpu inside the clock provider? That is way past a
> simple mux clock.

No way! I said, "one step closer" for a reason. The interrupt stuff is
clearly out of scope.

Regards,
Mike

>
>Andrew
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 3/4] zram: zram memory size limitation

2014-08-26 Thread Joonsoo Kim

On Wed, Aug 27, 2014 at 11:51:32AM +0900, Minchan Kim wrote:
> Hey Joonsoo,
> 
> On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote:
> > Hello, Minchan and David.
> > 
> > On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote:
> > > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim  wrote:
> > > > Hey Joonsoo,
> > > >
> > > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote:
> > > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote:
> > > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, 
> > > >> > struct bio_vec *bvec, u32 index,
> > > >> > ret = -ENOMEM;
> > > >> > goto out;
> > > >> > }
> > > >> > +
> > > >> > +   if (zram->limit_pages &&
> > > >> > +   zs_get_total_pages(meta->mem_pool) > zram->limit_pages) {
> > > >> > +   zs_free(meta->mem_pool, handle);
> > > >> > +   ret = -ENOMEM;
> > > >> > +   goto out;
> > > >> > +   }
> > > >> > +
> > > >> > cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> > > >>
> > > >> Hello,
> > > >>
> > > >> I don't follow up previous discussion, so I could be wrong.
> > > >> Why this enforcement should be here?
> > > >>
> > > >> I think that this has two problems.
> > > >> 1) alloc/free happens unnecessarilly if we have used memory over the
> > > >> limitation.
> > > >
> > > > True but firstly, I implemented the logic in zsmalloc, not zram but
> > > > as I described in cover-letter, it's not a requirement of zsmalloc
> > > > but zram so it should be in there. If every user want it in future,
> > > > then we could move the function into zsmalloc. That's what we
> > > > concluded in previous discussion.
> > 
> > Hmm...
> > Problem is that we can't avoid these unnecessary overhead in this
> > implementation. If we can implement this feature in zram efficiently,
> > it's okay. But, I think that current form isn't.
> 
> 
> If we can add it in zsmalloc, it would be more clean and efficient
> for zram but as I said, at the moment, I didn't want to put zram's
> requirement into zsmalloc because to me, it's weird to enforce max
> limit to allocator. It's client's role, I think.

AFAIK, many kinds of pools such as thread-pool or memory-pool have
their own limit. It's not weird for me.

> If current implementation is expensive and rather hard to follow,
> It would be one reason to move the feature into zsmalloc but
> I don't think it makes critical trobule in zram usecase.
> See below.
> 
> But I still open and will wait others's opinion.
> If other guys think zsmalloc is better place, I am willing to move
> it into zsmalloc.
> 
> > 
> > > >
> > > > Another idea is we could call zs_get_total_pages right before zs_malloc
> > > > but the problem is we cannot know how many of pages are allocated
> > > > by zsmalloc in advance.
> > > > IOW, zram should be blind on zsmalloc's internal.
> > > >
> > > 
> > > We did however suggest that we could check before hand to see if
> > > max was already exceeded as an optimization.
> > > (possibly with a guess on usage but at least using the minimum of 1 page)
> > > In the contested case, the max may already be exceeded transiently and
> > > therefore we know this one _could_ fail (it could also pass, but odds
> > > aren't good).
> > > As Minchan mentions this was discussed before - but not into great detail.
> > > Testing should be done to determine possible benefit. And as he also
> > > mentions, the better place for it may be in zsmalloc, but that
> > > requires an ABI change.
> > 
> > Why we hesitate to change zsmalloc API? It is in-kernel API and there
> > are just two users now, zswap and zram. We can change it easily.
> > I think that we just need following simple API change in zsmalloc.c.
> > 
> > zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op)
> > =>
> > zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops
> > *zpool_op)
> > 
> > It's pool allocator so there is no obstacle for us to limit maximum
> > memory usage in zsmalloc. It's a natural idea to limit memory usage
> > for pool allocator.
> > 
> > > Certainly a detailed suggestion could happen on this thread and I'm
> > > also interested
> > > in your thoughts, but this patchset should be able to go in as is.
> > > Memory exhaustion avoidance probably trumps the possible thrashing at
> > > threshold.
> > > 
> > > > About alloc/free cost once if it is over the limit,
> > > > I don't think it's important to consider.
> > > > Do you have any scenario in your mind to consider alloc/free cost
> > > > when the limit is over?
> > > >
> > > >> 2) Even if this request doesn't do new allocation, it could be failed
> > > >> due to other's allocation. There is time gap between allocation and
> > > >> free, so legimate user who want to use preallocated zsmalloc memory
> > > >> could also see this condition true and then he will be failed.
> > > >
> > > > Yeb, we already discussed that. :)
> > > > Such false positive shouldn't be a severe problem if we

Re: [kernel.org PATCH] Li Zefan is now the 3.4 stable maintainer

2014-08-26 Thread Guenter Roeck

On Tue, Aug 26, 2014 at 04:08:58PM -0700, Greg KH wrote:
> Li has agreed to continue to support the 3.4 stable kernel tree until
> September 2016.  Update the releases.html page on kernel.org to reflect
> this.
> 
Li,

it would be great if you can send me information about your -stable queue,
ie how you maintain it and where it is located. This will enable me to
continue testing the stable queue for the 3.4 kernel.

Thanks,
Guenter

> Signed-off-by: Greg Kroah-Hartman 
> 
> 
> diff --git a/content/releases.rst b/content/releases.rst
> index 4a3327f4ca9e..c71a33f34f1b 100644
> --- a/content/releases.rst
> +++ b/content/releases.rst
> @@ -43,7 +43,7 @@ Longterm
>  3.14 Greg Kroah-Hartman   2014-03-30   Aug, 2016
>  3.12 Jiri Slaby   2013-11-03   2016
>  3.10 Greg Kroah-Hartman   2013-06-30   Sep, 2015
> -3.4  Greg Kroah-Hartman   2012-05-20   Oct, 2014
> +3.4  Li Zefan 2012-05-20   Sep, 2016
>  3.2  Ben Hutchings2012-01-04   2016
>  2.6.32   Willy Tarreau2009-12-03   Mid-2015
>     ==
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] add selftest for virtio-net

2014-08-26 Thread Jason Wang

On 08/27/2014 09:45 AM, Hengjinxiao wrote:
> Selftest is an important part of network driver, this patch adds selftest for
> virtio-net, including loopback test, negotiate test and reset test. Loopback 
> test checks whether virtio-net can send and receive packets normally. 
> Negotiate test
> executes feature negotiation between virtio-net driver in Guest OS and 
> virtio-net 
> device in Host OS. Reset test resets virtio-net.

Thanks for the patch. Feature negotiation part brings some complicity
and need more through. And this could be extended for CVE regression in
the future. And you probably also need to send a patch of virtio spec to
implement the loop back mode.

See comments inline.
>
> Signed-off-by: Hengjinxiao 
>
> ---
>  drivers/net/virtio_net.c| 233 
> +++-
>  include/uapi/linux/virtio_net.h |   9 ++
>  2 files changed, 241 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 59caa06..f83f6e4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -51,6 +52,23 @@ module_param(gso, bool, 0444);
>  #define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256)
>  
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
> +#define __VIRTNET_TESTING  0
> +

Why need this marco?
> +enum {
> + VIRTNET_LOOPBACK_TEST,
> + VIRTNET_FEATURE_NEG_TEST,
> + VIRTNET_RESET_TEST,
> +};
> +
> +static const struct {
> + const char string[ETH_GSTRING_LEN];
> +} virtnet_gstrings_test[] = {
> + [VIRTNET_LOOPBACK_TEST] = { "loopback test   (offline)" },
> + [VIRTNET_FEATURE_NEG_TEST]  = { "negotiate test  (offline)" },
> + [VIRTNET_RESET_TEST]= { "reset test (offline)" },
> +};
> +
> +#define VIRTNET_NUM_TEST ARRAY_SIZE(virtnet_gstrings_test)
>  
>  struct virtnet_stats {
>   struct u64_stats_sync tx_syncp;
> @@ -104,6 +122,8 @@ struct virtnet_info {
>   struct send_queue *sq;
>   struct receive_queue *rq;
>   unsigned int status;
> + unsigned long flags;
> + atomic_t lb_count;
>  
>   /* Max # of queue pairs supported by the device */
>   u16 max_queue_pairs;
> @@ -436,6 +456,19 @@ err_buf:
>   return NULL;
>  }
>  
> +void virtnet_check_lb_frame(struct virtnet_info *vi,
> +struct sk_buff *skb)
> +{
> + unsigned int frame_size = skb->len;
> +
> + if (*(skb->data + 3) == 0xFF) {
> + if ((*(skb->data + frame_size / 2 + 10) == 0xBE) &&
> +(*(skb->data + frame_size / 2 + 12) == 0xAF)) {
> + atomic_dec(>lb_count);
> + }
> + }
> +}
> +
>  static void receive_buf(struct receive_queue *rq, void *buf, unsigned int 
> len)
>  {
>   struct virtnet_info *vi = rq->vq->vdev->priv;
> @@ -485,7 +518,12 @@ static void receive_buf(struct receive_queue *rq, void 
> *buf, unsigned int len)
>   } else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> -
> + /* loopback self test for ethtool */
> + if (test_bit(__VIRTNET_TESTING, >flags)) {
> + virtnet_check_lb_frame(vi, skb);
> + dev_kfree_skb_any(skb);
> + return;
> + }

Not sure it's a good choice for adding such in fast path. We may need a
test specific rx interrupt handler (and disable NAPI) for this.
>   skb->protocol = eth_type_trans(skb, dev);
>   pr_debug("Receiving skb proto 0x%04x len %i type %i\n",
>ntohs(skb->protocol), skb->len, skb->pkt_type);
> @@ -813,6 +851,9 @@ static int virtnet_open(struct net_device *dev)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
>   int i;
> + /* disallow open during test */
> + if (test_bit(__VIRTNET_TESTING, >flags))
> + return -EBUSY;
>  
>   for (i = 0; i < vi->max_queue_pairs; i++) {
>   if (i < vi->curr_queue_pairs)
> @@ -1363,12 +1404,158 @@ static void virtnet_get_channels(struct net_device 
> *dev,
>   channels->other_count = 0;
>  }
>  
> +static int virtnet_reset(struct virtnet_info *vi);
> +
> +static void virtnet_create_lb_frame(struct sk_buff *skb,
> + unsigned int frame_size)
> +{
> + memset(skb->data, 0xFF, frame_size);
> + frame_size &= ~1;
> + memset(>data[frame_size / 2], 0xAA, frame_size / 2 - 1);
> + memset(>data[frame_size / 2 + 10], 0xBE, 1);
> + memset(>data[frame_size / 2 + 12], 0xAF, 1);
> +}
> +
> +static int virtnet_start_loopback(struct virtnet_info *vi)
> +{
> + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_LOOPBACK,
> +   VIRTIO_NET_CTRL_LOOPBACK_SET, NULL, NULL)) {
> + dev_warn(>dev->dev, "Failed to set loopback.\n");
> + return -EINVAL;
> + }
> + return 0;

Re: [PATCH] acpi: fan.c: printk replacement

2014-08-26 Thread Sudip Mukherjee

On Tue, Aug 26, 2014 at 11:22:12PM +0200, Rafael J. Wysocki wrote:
> On Tuesday, August 26, 2014 01:59:02 PM Joe Perches wrote:
> > On Tue, 2014-08-26 at 23:02 +0200, Rafael J. Wysocki wrote:
> > > On Tuesday, August 26, 2014 09:00:39 PM Sudip Mukherjee wrote:
> > > > On Tue, Aug 26, 2014 at 12:45:20AM +0200, Rafael J. Wysocki wrote:
> > > > > On Friday, August 22, 2014 05:33:21 PM Sudip Mukherjee wrote:
> > > > > > printk replaced with corresponding dev_err and dev_info
> > > > > > fixed one broken user-visible string
> > > > > > multiine comment edited for correct commenting style
> > > > > > asm/uaccess.h replaced with linux/uaccess.h
> > > > > > 
> > > > > > Signed-off-by: Sudip Mukherjee 
> > > > > > ---
> > > > > >  drivers/acpi/fan.c | 18 +-
> > > > > >  1 file changed, 9 insertions(+), 9 deletions(-)
> > > > > > 
> > > > > > diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c
> > > > > > index 8acf53e..7900d55 100644
> > > > > > --- a/drivers/acpi/fan.c
> > > > > > +++ b/drivers/acpi/fan.c
> > > > > > @@ -27,7 +27,7 @@
> > > > > >  #include 
> > > > > >  #include 
> > > > > >  #include 
> > > > > > -#include 
> > > > > > +#include 
> > > > > >  #include 
> > > > > >  #include 
> > > > > >  
> > > > > > @@ -127,8 +127,9 @@ static const struct thermal_cooling_device_ops 
> > > > > > fan_cooling_ops = {
> > > > > >  };
> > > > > >  
> > > > > >  /* 
> > > > > > --
> > > > > > - Driver Interface
> > > > > > -   
> > > > > > --
> > > > > >  */
> > > > > > + *   Driver Interface
> > > > > > + * 
> > > > > > --
> > > > > > +*/
> > > > > >  
> > > > > >  static int acpi_fan_add(struct acpi_device *device)
> > > > > >  {
> > > > > > @@ -143,7 +144,7 @@ static int acpi_fan_add(struct acpi_device 
> > > > > > *device)
> > > > > >  
> > > > > > result = acpi_bus_update_power(device->handle, NULL);
> > > > > > if (result) {
> > > > > > -   printk(KERN_ERR PREFIX "Setting initial power state\n");
> > > > > > +   dev_err(>dev, PREFIX "Setting initial power 
> > > > > > state\n");
> > > > > 
> > > > > While at it, please define a proper pr_fmt() for this file and get 
> > > > > rid of PREFIX
> > > > > too.
> > > > > 
> > > > > Otherwise I don't see a compelling reason to apply this.
> > > > > 
> > > > 
> > > > Hi,
> > > > Since in the patch I am not using any pr_* , so I am unable to 
> > > > understand why
> > > > you are asking for a proper pr_fmt().
> > > 
> > > Never mind, I was confused somehow, not exactly sure why.  Sorry about 
> > > that.
> > > 
> > > > I can get rid of the PREFIX . Then should I use pr_* in the patch 
> > > > instead of dev_* ? 
> > > > My understanding was dev_* is more preffered than pr_*.
> > > > waiting for your suggestion on this.
> > > 
> > > Well, that really depends on the particular case.  It really is better to 
> > > use
> > > dev_err() here, but then PREFIX with it is not really useful, so please 
> > > just
> > > drop PREFIX from the new messages.
> > 
> > PREFIX is "ACPI: " so I think the idea is
> > to be able to grep for that.
> 
> I'm not sure how useful that is in this particular case.  You can grep for 
> "power
> state" istead just fine ...
> 
> Rafael
> 

Then there is one more printk which prints fan state is on or off :

>dev_info(>dev, PREFIX "%s [%s] (%s)\n",
> acpi_device_name(device), acpi_device_bid(device),
>  !device->power.state ? "on" : "off");

So if we drop the PREFIX and if some one wants to grep for this fan on/off , 
then how to do that ..
after removing PREFIX , in dmesg I am getting it as :
[2.056204] fan PNP0C0B:00: Fan [FAN0] (off)
[2.056225] fan PNP0C0B:01: Fan [FAN1] (off)
[2.056245] fan PNP0C0B:02: Fan [FAN2] (off)
[2.056263] fan PNP0C0B:03: Fan [FAN3] (off)
[2.056283] fan PNP0C0B:04: Fan [FAN4] (off)


thanks
sudip

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v7 net-next 00/28] BPF syscall

2014-08-26 Thread Alexei Starovoitov

On Tue, Aug 26, 2014 at 9:49 PM, Andy Lutomirski  wrote:
> On Tue, Aug 26, 2014 at 9:35 PM, Alexei Starovoitov  wrote:
>> On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski  wrote:
>>> On Aug 26, 2014 7:29 PM, "Alexei Starovoitov"  wrote:

 Hi Ingo, David,

 posting whole thing again as RFC to get feedback on syscall only.
 If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok,
 I'll split them into small chunks as requested and will repost without RFC.
>>>
>>> IMO it's much easier to review a syscall if we just look at a
>>> specification of what it does.  The code is, in some sense, secondary.
>>
>> 'specification of what it does'... hmm, you mean beyond what's
>> there in commit logs and in Documentation/networking/filter.txt ?
>> Aren't samples at the end give an idea on 'what it does'?
>> I'm happy to add 'specification', I just don't understand yet what
>> it suppose to talk about beyond what's already written.
>> I understand that the patches are missing explanation on 'why'
>> the syscall is being added, but I don't think it's what you're asking...
>
> I mean a hopefully short document that defines what the syscall does.
> It should be precise enough that one could, in principle, implement
> the syscall just by reading the document and that one could use the
> syscall just by reading the document.
>
> Given that there's a whole instruction set to go with it, it may end
> up being moderately complicated or saying things like "see this other
> thing for a description of the instruction set" and "there are some
> extensible sets of functions you can call with it".

I'm still lost.

Here is the quote from Documentation/networking/filter.txt
"
'maps' is a generic storage of different types for sharing data between kernel
and userspace.

The maps are accessed from user space via BPF syscall,
which has commands:
- create a map with given type and attributes
  map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)
  using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
  returns process-local file descriptor or negative error

- lookup key in a given map
  err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero and stores found elem into value or negative error

- create or update key/value pair in a given map
  err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key, attr->value
  returns zero or negative error

- find and delete element by key in a given map
  err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)
  using attr->map_fd, attr->key

- to delete map: close(fd)
  Exiting process will delete maps automatically

userspace programs uses this API to create/populate/read
maps that eBPF programs are concurrently updating.
"
and more in commit log:
"
- load eBPF program
  fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)

  where 'attr' is
  struct {
  enum bpf_prog_type prog_type;
  __u32 insn_cnt;
  struct bpf_insn __user *insns;
  const char __user *license;
  };
  insns - array of eBPF instructions
  license - must be GPL compatible to call helper functions marked gpl_only

- unload eBPF program
  close(fd)
"

Isn't it short and describes what it does?
Do you want me to describe what eBPF program can do?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v7 net-next 00/28] BPF syscall

2014-08-26 Thread Andy Lutomirski

On Tue, Aug 26, 2014 at 9:35 PM, Alexei Starovoitov  wrote:
> On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski  wrote:
>> On Aug 26, 2014 7:29 PM, "Alexei Starovoitov"  wrote:
>>>
>>> Hi Ingo, David,
>>>
>>> posting whole thing again as RFC to get feedback on syscall only.
>>> If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok,
>>> I'll split them into small chunks as requested and will repost without RFC.
>>
>> IMO it's much easier to review a syscall if we just look at a
>> specification of what it does.  The code is, in some sense, secondary.
>
> 'specification of what it does'... hmm, you mean beyond what's
> there in commit logs and in Documentation/networking/filter.txt ?
> Aren't samples at the end give an idea on 'what it does'?
> I'm happy to add 'specification', I just don't understand yet what
> it suppose to talk about beyond what's already written.
> I understand that the patches are missing explanation on 'why'
> the syscall is being added, but I don't think it's what you're asking...

I mean a hopefully short document that defines what the syscall does.
It should be precise enough that one could, in principle, implement
the syscall just by reading the document and that one could use the
syscall just by reading the document.

Given that there's a whole instruction set to go with it, it may end
up being moderately complicated or saying things like "see this other
thing for a description of the instruction set" and "there are some
extensible sets of functions you can call with it".

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB kthread wakeups

2014-08-26 Thread Amit Shah

On (Sat) 23 Aug 2014 [03:43:38], Pranith Kumar wrote:
> On Fri, Aug 22, 2014 at 5:53 PM, Paul E. McKenney
>  wrote:
> >
> > Hmmm...  Please try replacing the synchronize_rcu() in
> > __sysrq_swap_key_ops() with (say) schedule_timeout_interruptible(HZ / 10).
> > I bet that gets rid of the hang.  (And also introduces a low-probability
> > bug, but should be OK for testing.)
> >
> > The other thing to try is to revert your patch that turned my event
> > traces into printk()s, then put an ftrace_dump(DUMP_ALL); just after
> > the synchronize_rcu() -- that might make it so that the ftrace data
> > actually gets dumped out.
> >
> 
> I was able to reproduce this error on my Ubuntu 14.04 machine. I think
> I found the root cause of the problem after several kvm runs.
> 
> The problem is that earlier we were waiting on nocb_head and now we
> are waiting on nocb_leader_wake.
> 
> So there are a lot of nocb callbacks which are enqueued before the
> nocb thread is spawned. This sets up nocb_head to be non-null, because
> of which the nocb kthread used to wake up immediately after sleeping.
> 
> Now that we have switched to nocb_leader_wake, this is not being set
> when there are pending callbacks, unless the callbacks overflow the
> qhimark. The pending callbacks were around 7000 when the boot hangs.
> 
> So setting the qhimark using the boot parameter rcutree.qhimark=5000
> is one way to allow us to boot past the point by forcefully waking up
> the nocb kthread. I am not sure this is fool-proof.
> 
> Another option to start the nocb kthreads with nocb_leader_wake set,
> so that it can handle any pending callbacks. The following patch also
> allows us to boot properly.
> 
> Phew! Let me know if this makes any sense :)
> 
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index 00dc411..4c397aa 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -2386,6 +2386,9 @@ static int rcu_nocb_kthread(void *arg)
> struct rcu_head **tail;
> struct rcu_data *rdp = arg;
> 
> +   if (rdp->nocb_leader == rdp)
> +   rdp->nocb_leader_wake = true;
> +
> /* Each pass through this loop invokes one batch of callbacks */
> for (;;) {
> /* Wait for callbacks. */

Yes, this patch helps my case as well.

Thanks!

Amit
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 03/21] Fix XIP fault vs truncate race

2014-08-26 Thread Matthew Wilcox

Pagecache faults recheck i_size after taking the page lock to ensure that
the fault didn't race against a truncate.  We don't have a page to lock
in the XIP case, so use the i_mmap_mutex instead.  It is locked in the
truncate path in unmap_mapping_range() after updating i_size.  So while
we hold it in the fault path, we are guaranteed that either i_size has
already been updated in the truncate path, or that the truncate will
subsequently call zap_page_range_single() and so remove the mapping we
have just inserted.

There is a window of time in which i_size has been reduced and the
thread has a mapping to a page which will be removed from the file,
but this is harmless as the page will not be allocated to a different
purpose before the thread's access to it is revoked.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jan Kara 
Acked-by: Kirill A. Shutemov 
---
 mm/filemap_xip.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index d8d9fe3..c8d23e9 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -260,8 +260,17 @@ again:
__xip_unmap(mapping, vmf->pgoff);
 
 found:
+   /* We must recheck i_size under i_mmap_mutex */
+   mutex_lock(>i_mmap_mutex);
+   size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+   PAGE_CACHE_SHIFT;
+   if (unlikely(vmf->pgoff >= size)) {
+   mutex_unlock(>i_mmap_mutex);
+   return VM_FAULT_SIGBUS;
+   }
err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address,
xip_pfn);
+   mutex_unlock(>i_mmap_mutex);
if (err == -ENOMEM)
return VM_FAULT_OOM;
/*
@@ -285,16 +294,27 @@ found:
}
if (error != -ENODATA)
goto out;
+
+   /* We must recheck i_size under i_mmap_mutex */
+   mutex_lock(>i_mmap_mutex);
+   size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >>
+   PAGE_CACHE_SHIFT;
+   if (unlikely(vmf->pgoff >= size)) {
+   ret = VM_FAULT_SIGBUS;
+   goto unlock;
+   }
/* not shared and writable, use xip_sparse_page() */
page = xip_sparse_page();
if (!page)
-   goto out;
+   goto unlock;
err = vm_insert_page(vma, (unsigned long)vmf->virtual_address,
page);
if (err == -ENOMEM)
-   goto out;
+   goto unlock;
 
ret = VM_FAULT_NOPAGE;
+unlock:
+   mutex_unlock(>i_mmap_mutex);
 out:
write_seqcount_end(_sparse_seq);
mutex_unlock(_sparse_mutex);
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 17/21] ext2: Remove ext2_aops_xip

2014-08-26 Thread Matthew Wilcox

We shouldn't need a special address_space_operations any more

Signed-off-by: Matthew Wilcox 
---
 fs/ext2/ext2.h  | 1 -
 fs/ext2/inode.c | 7 +--
 fs/ext2/namei.c | 4 ++--
 3 files changed, 3 insertions(+), 9 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b30c3bd..b8b1c11 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -793,7 +793,6 @@ extern const struct file_operations 
ext2_xip_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
-extern const struct address_space_operations ext2_aops_xip;
 extern const struct address_space_operations ext2_nobh_aops;
 
 /* namei.c */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 154cbcf..034fd42 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
-const struct address_space_operations ext2_aops_xip = {
-   .bmap   = ext2_bmap,
-   .direct_IO  = ext2_direct_IO,
-};
-
 const struct address_space_operations ext2_nobh_aops = {
.readpage   = ext2_readpage,
.readpages  = ext2_readpages,
@@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned 
long ino)
if (S_ISREG(inode->i_mode)) {
inode->i_op = _file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
-   inode->i_mapping->a_ops = _aops_xip;
+   inode->i_mapping->a_ops = _aops;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 7ca803f..0db888c 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * 
dentry, umode_t mode
 
inode->i_op = _file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
-   inode->i_mapping->a_ops = _aops_xip;
+   inode->i_mapping->a_ops = _aops;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry 
*dentry, umode_t mode)
 
inode->i_op = _file_inode_operations;
if (test_opt(inode->i_sb, XIP)) {
-   inode->i_mapping->a_ops = _aops_xip;
+   inode->i_mapping->a_ops = _aops;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page

2014-08-26 Thread Matthew Wilcox

It takes a get_block parameter just like nobh_truncate_page() and
block_truncate_page()

Signed-off-by: Matthew Wilcox 
---
 fs/dax.c   | 44 
 fs/ext2/inode.c|  2 +-
 include/linux/fs.h |  4 ++--
 mm/filemap_xip.c   | 40 
 4 files changed, 47 insertions(+), 43 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index f134078..d54f7d3 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -443,3 +443,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
return result;
 }
 EXPORT_SYMBOL_GPL(dax_fault);
+
+/**
+ * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * @inode: The file being truncated
+ * @from: The file offset that is being truncated to
+ * @get_block: The filesystem method used to translate file offsets to blocks
+ *
+ * Similar to block_truncate_page(), this function can be called by a
+ * filesystem when it is truncating an DAX file to handle the partial page.
+ *
+ * We work in terms of PAGE_CACHE_SIZE here for commonality with
+ * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
+ * took care of disposing of the unnecessary blocks.  Even if the filesystem
+ * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
+ * since the file might be mmaped.
+ */
+int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+{
+   struct buffer_head bh;
+   pgoff_t index = from >> PAGE_CACHE_SHIFT;
+   unsigned offset = from & (PAGE_CACHE_SIZE-1);
+   unsigned length = PAGE_CACHE_ALIGN(from) - from;
+   int err;
+
+   /* Block boundary? Nothing to do */
+   if (!length)
+   return 0;
+
+   memset(, 0, sizeof(bh));
+   bh.b_size = PAGE_CACHE_SIZE;
+   err = get_block(inode, index, , 0);
+   if (err < 0)
+   return err;
+   if (buffer_written()) {
+   void *addr;
+   err = dax_get_addr(, , inode->i_blkbits);
+   if (err < 0)
+   return err;
+   memset(addr + offset, 0, length);
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(dax_truncate_page);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 52978b8..5ac0a34 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t 
newsize)
inode_dio_wait(inode);
 
if (IS_DAX(inode))
-   error = xip_truncate_page(inode->i_mapping, newsize);
+   error = dax_truncate_page(inode, newsize, ext2_get_block);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 338f04b..eee848d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2492,7 +2492,7 @@ extern int nonseekable_open(struct inode * inode, struct 
file * filp);
 
 #ifdef CONFIG_FS_XIP
 int dax_clear_blocks(struct inode *, sector_t block, long size);
-extern int xip_truncate_page(struct address_space *mapping, loff_t from);
+int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
 int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
@@ -2503,7 +2503,7 @@ static inline int dax_clear_blocks(struct inode *i, 
sector_t blk, long sz)
return 0;
 }
 
-static inline int xip_truncate_page(struct address_space *mapping, loff_t from)
+static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t 
gb)
 {
return 0;
 }
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
index 9dd45f3..6316578 100644
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -21,43 +21,3 @@
 #include 
 #include 
 
-/*
- * truncate a page used for execute in place
- * functionality is analog to block_truncate_page but does use get_xip_mem
- * to get the page instead of page cache
- */
-int
-xip_truncate_page(struct address_space *mapping, loff_t from)
-{
-   pgoff_t index = from >> PAGE_CACHE_SHIFT;
-   unsigned offset = from & (PAGE_CACHE_SIZE-1);
-   unsigned blocksize;
-   unsigned length;
-   void *xip_mem;
-   unsigned long xip_pfn;
-   int err;
-
-   BUG_ON(!mapping->a_ops->get_xip_mem);
-
-   blocksize = 1 << mapping->host->i_blkbits;
-   length = offset & (blocksize - 1);
-
-   /* Block boundary? Nothing to do */
-   if (!length)
-   return 0;
-
-   length = blocksize - length;
-
-   err = mapping->a_ops->get_xip_mem(mapping, index, 0,
-   _mem, _pfn);
-   if (unlikely(err)) {
-   if (err == -ENODATA)
-   /* Hole? No need to truncate */
-   return 0;
-   else
-

[PATCH v10 02/21] Change direct_access calling convention

2014-08-26 Thread Matthew Wilcox

In order to support accesses to larger chunks of memory, pass in a
'size' parameter (counted in bytes), and return the amount available at
that address.

Add a new helper function, bdev_direct_access(), to handle common
functionality including partition handling, checking the length requested
is positive, checking for the sector being page-aligned, and checking
the length of the request does not pass the end of the partition.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jan Kara 
Reviewed-by: Boaz Harrosh 
---
 Documentation/filesystems/xip.txt | 15 +--
 arch/powerpc/sysdev/axonram.c | 17 -
 drivers/block/brd.c   | 12 +---
 drivers/s390/block/dcssblk.c  | 21 +---
 fs/block_dev.c| 40 +++
 fs/ext2/xip.c | 31 +-
 include/linux/blkdev.h|  6 --
 7 files changed, 84 insertions(+), 58 deletions(-)

diff --git a/Documentation/filesystems/xip.txt 
b/Documentation/filesystems/xip.txt
index 0466ee5..b774729 100644
--- a/Documentation/filesystems/xip.txt
+++ b/Documentation/filesystems/xip.txt
@@ -28,12 +28,15 @@ Implementation
 Execute-in-place is implemented in three steps: block device operation,
 address space operation, and file operations.
 
-A block device operation named direct_access is used to retrieve a
-reference (pointer) to a block on-disk. The reference is supposed to be
-cpu-addressable, physical address and remain valid until the release operation
-is performed. A struct block_device reference is used to address the device,
-and a sector_t argument is used to identify the individual block. As an
-alternative, memory technology devices can be used for this.
+A block device operation named direct_access is used to translate the
+block device sector number to a page frame number (pfn) that identifies
+the physical page for the memory.  It also returns a kernel virtual
+address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
 
 The block device operation is optional, these block devices support it as of
 today:
diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 830edc8..8709b9f 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct 
bio *bio)
  * axon_ram_direct_access - direct_access() method for block device
  * @device, @sector, @data: see block_device_operations method
  */
-static int
+static long
 axon_ram_direct_access(struct block_device *device, sector_t sector,
-  void **kaddr, unsigned long *pfn)
+  void **kaddr, unsigned long *pfn, long size)
 {
struct axon_ram_bank *bank = device->bd_disk->private_data;
-   loff_t offset;
-
-   offset = sector;
-   if (device->bd_part != NULL)
-   offset += device->bd_part->start_sect;
-   offset <<= AXON_RAM_SECTOR_SHIFT;
-   if (offset >= bank->size) {
-   dev_err(>device->dev, "Access outside of address 
space\n");
-   return -ERANGE;
-   }
+   loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT;
 
*kaddr = (void *)(bank->ph_addr + offset);
*pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
-   return 0;
+   return bank->size - offset;
 }
 
 static const struct block_device_operations axon_ram_devops = {
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index c7d138e..fee10bf 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, 
sector_t sector,
 }
 
 #ifdef CONFIG_BLK_DEV_XIP
-static int brd_direct_access(struct block_device *bdev, sector_t sector,
-   void **kaddr, unsigned long *pfn)
+static long brd_direct_access(struct block_device *bdev, sector_t sector,
+   void **kaddr, unsigned long *pfn, long size)
 {
struct brd_device *brd = bdev->bd_disk->private_data;
struct page *page;
 
if (!brd)
return -ENODEV;
-   if (sector & (PAGE_SECTORS-1))
-   return -EINVAL;
-   if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk))
-   return -ERANGE;
page = brd_insert_page(brd, sector);
if (!page)
return -ENOSPC;
*kaddr = page_address(page);
*pfn = page_to_pfn(page);
 
-   return 0;
+   /* If size > PAGE_SIZE, we could look to see if the next page in the
+* file happens to be mapped to the next page of physical RAM */
+   return PAGE_SIZE;
 }
 #endif
 
diff --git a/drivers/s390/block/dcssblk.c

[PATCH v10 07/21] Replace XIP read and write with DAX I/O

2014-08-26 Thread Matthew Wilcox

Use the generic AIO infrastructure instead of custom read and write
methods.  In addition to giving us support for AIO, this adds the missing
locking between read() and truncate().

Signed-off-by: Matthew Wilcox 
Reviewed-by: Ross Zwisler 
Reviewed-by: Jan Kara 
---
 MAINTAINERS|   6 ++
 fs/Makefile|   1 +
 fs/dax.c   | 195 
 fs/ext2/file.c |   6 +-
 fs/ext2/inode.c|   8 +-
 include/linux/fs.h |  18 -
 mm/filemap.c   |   6 +-
 mm/filemap_xip.c   | 234 -
 8 files changed, 229 insertions(+), 245 deletions(-)
 create mode 100644 fs/dax.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1ff06de..3f29153 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2929,6 +2929,12 @@ L:   linux-...@vger.kernel.org
 S: Maintained
 F: drivers/i2c/busses/i2c-diolan-u2c.c
 
+DIRECT ACCESS (DAX)
+M: Matthew Wilcox 
+L: linux-fsde...@vger.kernel.org
+S: Supported
+F: fs/dax.c
+
 DIRECTORY NOTIFICATION (DNOTIFY)
 M: Eric Paris 
 S: Maintained
diff --git a/fs/Makefile b/fs/Makefile
index 90c8852..0325ec3 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD)+= signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_AIO)   += aio.o
+obj-$(CONFIG_FS_XIP)   += dax.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)  += binfmt_aout.o
diff --git a/fs/dax.c b/fs/dax.c
new file mode 100644
index 000..108c68e
--- /dev/null
+++ b/fs/dax.c
@@ -0,0 +1,195 @@
+/*
+ * fs/dax.c - Direct Access filesystem code
+ * Copyright (c) 2013-2014 Intel Corporation
+ * Author: Matthew Wilcox 
+ * Author: Ross Zwisler 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
+{
+   unsigned long pfn;
+   sector_t sector = bh->b_blocknr << (blkbits - 9);
+   return bdev_direct_access(bh->b_bdev, sector, addr, , bh->b_size);
+}
+
+static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
+   loff_t end)
+{
+   loff_t final = end - pos + first; /* The final byte of the buffer */
+
+   if (first > 0)
+   memset(addr, 0, first);
+   if (final < size)
+   memset(addr + final, 0, size - final);
+}
+
+static bool buffer_written(struct buffer_head *bh)
+{
+   return buffer_mapped(bh) && !buffer_unwritten(bh);
+}
+
+/*
+ * When ext4 encounters a hole, it returns without modifying the buffer_head
+ * which means that we can't trust b_size.  To cope with this, we set b_state
+ * to 0 before calling get_block and, if any bit is set, we know we can trust
+ * b_size.  Unfortunate, really, since ext4 knows precisely how long a hole is
+ * and would save us time calling get_block repeatedly.
+ */
+static bool buffer_size_valid(struct buffer_head *bh)
+{
+   return bh->b_state != 0;
+}
+
+static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter,
+   loff_t start, loff_t end, get_block_t get_block,
+   struct buffer_head *bh)
+{
+   ssize_t retval = 0;
+   loff_t pos = start;
+   loff_t max = start;
+   loff_t bh_max = start;
+   void *addr;
+   bool hole = false;
+
+   if (rw != WRITE)
+   end = min(end, i_size_read(inode));
+
+   while (pos < end) {
+   unsigned len;
+   if (pos == max) {
+   unsigned blkbits = inode->i_blkbits;
+   sector_t block = pos >> blkbits;
+   unsigned first = pos - (block << blkbits);
+   long size;
+
+   if (pos == bh_max) {
+   bh->b_size = PAGE_ALIGN(end - pos);
+   bh->b_state = 0;
+   retval = get_block(inode, block, bh,
+   rw == WRITE);
+   if (retval)
+   break;
+   if (!buffer_size_valid(bh))
+   bh->b_size = 1 << blkbits;
+   bh_max = pos - first + bh->b_size;
+   } else {
+

[PATCH v10 20/21] ext4: Add DAX functionality

2014-08-26 Thread Matthew Wilcox

From: Ross Zwisler 

This is a port of the DAX functionality found in the current version of
ext2.

Signed-off-by: Ross Zwisler 
Reviewed-by: Andreas Dilger 
[heavily tweaked]
Signed-off-by: Matthew Wilcox 
---
 Documentation/filesystems/dax.txt  |  1 +
 Documentation/filesystems/ext4.txt |  2 ++
 fs/ext4/ext4.h |  6 +
 fs/ext4/file.c | 49 ++--
 fs/ext4/indirect.c | 18 ++
 fs/ext4/inode.c| 51 --
 fs/ext4/namei.c| 10 ++--
 fs/ext4/super.c| 39 -
 8 files changed, 148 insertions(+), 28 deletions(-)

diff --git a/Documentation/filesystems/dax.txt 
b/Documentation/filesystems/dax.txt
index ebcd97f..be376d9 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -73,6 +73,7 @@ or a write()) work correctly.
 
 These filesystems may be used for inspiration:
 - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt
 
 
 Shortcomings
diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 919a329..9c511c4 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -386,6 +386,8 @@ max_dir_size_kb=n   This limits the size of directories so 
that any
 i_version  Enable 64-bit inode version support. This option is
off by default.
 
+daxUse direct access if possible
+
 Data Mode
 =
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5b19760..c065a3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -969,6 +969,11 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_ERRORS_MASK 0x00070
 #define EXT4_MOUNT_MINIX_DF0x00080 /* Mimics the Minix statfs */
 #define EXT4_MOUNT_NOLOAD  0x00100 /* Don't use existing journal*/
+#ifdef CONFIG_FS_DAX
+#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */
+#else
+#define EXT4_MOUNT_DAX 0
+#endif
 #define EXT4_MOUNT_DATA_FLAGS  0x00C00 /* Mode for data writes: */
 #define EXT4_MOUNT_JOURNAL_DATA0x00400 /* Write data to 
journal */
 #define EXT4_MOUNT_ORDERED_DATA0x00800 /* Flush data before 
commit */
@@ -2558,6 +2563,7 @@ extern const struct file_operations ext4_dir_operations;
 /* file.c */
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
+extern const struct file_operations ext4_dax_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
 
 /* inline.c */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index aca7b24..9c7bde5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter 
*from)
struct inode *inode = file_inode(iocb->ki_filp);
struct mutex *aio_mutex = NULL;
struct blk_plug plug;
-   int o_direct = file->f_flags & O_DIRECT;
+   int o_direct = io_is_direct(file);
int overwrite = 0;
size_t length = iov_iter_count(from);
ssize_t ret;
@@ -191,6 +191,27 @@ errout:
return ret;
 }
 
+#ifdef CONFIG_FS_DAX
+static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   return dax_fault(vma, vmf, ext4_get_block);
+   /* Is this the right get_block? */
+}
+
+static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+   return dax_mkwrite(vma, vmf, ext4_get_block);
+}
+
+static const struct vm_operations_struct ext4_dax_vm_ops = {
+   .fault  = ext4_dax_fault,
+   .page_mkwrite   = ext4_dax_mkwrite,
+   .remap_pages= generic_file_remap_pages,
+};
+#else
+#define ext4_dax_vm_opsext4_file_vm_ops
+#endif
+
 static const struct vm_operations_struct ext4_file_vm_ops = {
.fault  = filemap_fault,
.map_pages  = filemap_map_pages,
@@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops 
= {
 static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
file_accessed(file);
-   vma->vm_ops = _file_vm_ops;
+   if (IS_DAX(file_inode(file))) {
+   vma->vm_ops = _dax_vm_ops;
+   vma->vm_flags |= VM_MIXEDMAP;
+   } else {
+   vma->vm_ops = _file_vm_ops;
+   }
return 0;
 }
 
@@ -600,6 +626,25 @@ const struct file_operations ext4_file_operations = {
.fallocate  = ext4_fallocate,
 };
 
+#ifdef CONFIG_FS_DAX
+const struct file_operations ext4_dax_file_operations = {
+   .llseek = ext4_llseek,
+   .read   = new_sync_read,
+   .write  = new_sync_write,

[PATCH v10 21/21] brd: Rename XIP to DAX

2014-08-26 Thread Matthew Wilcox

From: Matthew Wilcox 

Since this is relating to FS_XIP, not KERNEL_XIP, it should be called
DAX instead of XIP.

Signed-off-by: Matthew Wilcox 
---
 drivers/block/Kconfig | 13 +++--
 drivers/block/brd.c   | 14 +++---
 fs/Kconfig|  4 ++--
 3 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig
index 014a1cf..1b8094d 100644
--- a/drivers/block/Kconfig
+++ b/drivers/block/Kconfig
@@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
-config BLK_DEV_XIP
-   bool "Support XIP filesystems on RAM block device"
-   depends on BLK_DEV_RAM
+config BLK_DEV_RAM_DAX
+   bool "Support Direct Access (DAX) to RAM block devices"
+   depends on BLK_DEV_RAM && FS_DAX
default n
help
- Support XIP filesystems (such as ext2 with XIP support on) on
- top of block ram device. This will slightly enlarge the kernel, and
- will prevent RAM block device backing store memory from being
+ Support filesystems using DAX to access RAM block devices.  This
+ avoids double-buffering data in the page cache before copying it
+ to the block device.  Answering Y will slightly enlarge the kernel,
+ and will prevent RAM block device backing store memory from being
  allocated from highmem (only a problem for highmem systems).
 
 config CDROM_PKTCDVD
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index fee10bf..344681a 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, 
sector_t sector)
 * Must use NOIO because we don't want to recurse back into the
 * block or filesystem layers from page reclaim.
 *
-* Cannot support XIP and highmem, because our ->direct_access
-* routine for XIP must return memory that is always addressable.
-* If XIP was reworked to use pfns and kmap throughout, this
+* Cannot support DAX and highmem, because our ->direct_access
+* routine for DAX must return memory that is always addressable.
+* If DAX was reworked to use pfns and kmap throughout, this
 * restriction might be able to be lifted.
 */
gfp_flags = GFP_NOIO | __GFP_ZERO;
-#ifndef CONFIG_BLK_DEV_XIP
+#ifndef CONFIG_BLK_DEV_RAM_DAX
gfp_flags |= __GFP_HIGHMEM;
 #endif
page = alloc_page(gfp_flags);
@@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t 
sector,
return err;
 }
 
-#ifdef CONFIG_BLK_DEV_XIP
+#ifdef CONFIG_BLK_DEV_RAM_DAX
 static long brd_direct_access(struct block_device *bdev, sector_t sector,
void **kaddr, unsigned long *pfn, long size)
 {
@@ -388,6 +388,8 @@ static long brd_direct_access(struct block_device *bdev, 
sector_t sector,
 * file happens to be mapped to the next page of physical RAM */
return PAGE_SIZE;
 }
+#else
+#define brd_direct_access NULL
 #endif
 
 static int brd_ioctl(struct block_device *bdev, fmode_t mode,
@@ -428,9 +430,7 @@ static const struct block_device_operations brd_fops = {
.owner =THIS_MODULE,
.rw_page =  brd_rw_page,
.ioctl =brd_ioctl,
-#ifdef CONFIG_BLK_DEV_XIP
.direct_access =brd_direct_access,
-#endif
 };
 
 /*
diff --git a/fs/Kconfig b/fs/Kconfig
index a9eb53d..117900f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
 config FS_DAX
-   bool "Direct Access support"
+   bool "Direct Access (DAX) support"
depends on MMU
help
  Direct Access (DAX) can be used on memory-backed block devices.
@@ -45,7 +45,7 @@ config FS_DAX
 
  If you do not have a block device that is capable of using this,
  or if unsure, say N.  Saying Y will increase the size of the kernel
- by about 2kB.
+ by about 5kB.
 
 endif # BLOCK
 
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX

2014-08-26 Thread Matthew Wilcox

The fewer Kconfig options we have the better.  Use the generic
CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core.

Signed-off-by: Matthew Wilcox 
---
 fs/Kconfig | 21 ++---
 fs/Makefile|  2 +-
 fs/ext2/Kconfig| 11 ---
 fs/ext2/ext2.h |  2 +-
 fs/ext2/file.c |  4 ++--
 fs/ext2/super.c|  4 ++--
 include/linux/fs.h |  4 ++--
 7 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/fs/Kconfig b/fs/Kconfig
index 312393f..a9eb53d 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -13,13 +13,6 @@ if BLOCK
 source "fs/ext2/Kconfig"
 source "fs/ext3/Kconfig"
 source "fs/ext4/Kconfig"
-
-config FS_XIP
-# execute in place
-   bool
-   depends on EXT2_FS_XIP
-   default y
-
 source "fs/jbd/Kconfig"
 source "fs/jbd2/Kconfig"
 
@@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig"
 source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 
+config FS_DAX
+   bool "Direct Access support"
+   depends on MMU
+   help
+ Direct Access (DAX) can be used on memory-backed block devices.
+ If the block device supports DAX and the filesystem supports DAX,
+ then you can avoid using the pagecache to buffer I/Os.  Turning
+ on this option will compile in support for DAX; you will need to
+ mount the filesystem using the -o xip option.
+
+ If you do not have a block device that is capable of using this,
+ or if unsure, say N.  Saying Y will increase the size of the kernel
+ by about 2kB.
+
 endif # BLOCK
 
 # Posix ACL utility routines
diff --git a/fs/Makefile b/fs/Makefile
index 0325ec3..df4a4cf 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD)+= signalfd.o
 obj-$(CONFIG_TIMERFD)  += timerfd.o
 obj-$(CONFIG_EVENTFD)  += eventfd.o
 obj-$(CONFIG_AIO)   += aio.o
-obj-$(CONFIG_FS_XIP)   += dax.o
+obj-$(CONFIG_FS_DAX)   += dax.o
 obj-$(CONFIG_FILE_LOCKING)  += locks.o
 obj-$(CONFIG_COMPAT)   += compat.o compat_ioctl.o
 obj-$(CONFIG_BINFMT_AOUT)  += binfmt_aout.o
diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig
index 14a6780..c634874e 100644
--- a/fs/ext2/Kconfig
+++ b/fs/ext2/Kconfig
@@ -42,14 +42,3 @@ config EXT2_FS_SECURITY
 
  If you are not using a security module that requires using
  extended attributes for file security labels, say N.
-
-config EXT2_FS_XIP
-   bool "Ext2 execute in place support"
-   depends on EXT2_FS && MMU
-   help
- Execute in place can be used on memory-backed block devices. If you
- enable this option, you can select to mount block devices which are
- capable of this feature without using the page cache.
-
- If you do not use a block device that is capable of using this,
- or if unsure, say N.
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 5ecf570..b30c3bd 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,7 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID320x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER  0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL   0x008000  /* POSIX Access Control Lists 
*/
-#ifdef CONFIG_FS_XIP
+#ifdef CONFIG_FS_DAX
 #define EXT2_MOUNT_XIP 0x01  /* Execute in place */
 #else
 #define EXT2_MOUNT_XIP 0
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index da8dc64..46b333d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,7 +25,7 @@
 #include "xattr.h"
 #include "acl.h"
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
 {
return dax_fault(vma, vmf, ext2_get_block);
@@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = {
.splice_write   = iter_file_splice_write,
 };
 
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
 const struct file_operations ext2_xip_file_operations = {
.llseek = generic_file_llseek,
.read   = new_sync_read,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0393c6d..feb53d8 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct 
dentry *root)
seq_puts(seq, ",grpquota");
 #endif
 
-#if defined(CONFIG_EXT2_FS_XIP)
+#ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt & EXT2_MOUNT_XIP)
seq_puts(seq, ",xip");
 #endif
@@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block 
*sb)
break;
 #endif
case Opt_xip:
-#ifdef CONFIG_EXT2_FS_XIP
+#ifdef CONFIG_FS_DAX
set_opt (sbi->s_mount_opt, XIP);
 #else
ext2_msg(sb, KERN_INFO, "xip option not supported");
diff --git a/include/linux/fs.h b/include/linux/fs.h
index d73db11..e6b48cc 100644
--- a/include/linux/fs.h
+++

[PATCH v10 01/21] axonram: Fix bug in direct_access

2014-08-26 Thread Matthew Wilcox

The 'pfn' returned by axonram was completely bogus, and has been since
2008.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jan Kara 
---
 arch/powerpc/sysdev/axonram.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index 47b6b9f..830edc8 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, 
sector_t sector,
}
 
*kaddr = (void *)(bank->ph_addr + offset);
-   *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT;
+   *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT;
 
return 0;
 }
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 12/21] Remove get_xip_mem

2014-08-26 Thread Matthew Wilcox

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
Also remove mm/filemap_xip.c as it is now empty.

Signed-off-by: Matthew Wilcox 
---
 Documentation/filesystems/Locking |  3 ---
 fs/exofs/inode.c  |  1 -
 fs/ext2/inode.c   |  1 -
 fs/ext2/xip.c | 45 ---
 fs/ext2/xip.h |  3 ---
 fs/open.c |  5 +
 include/linux/fs.h|  2 --
 mm/Makefile   |  1 -
 mm/fadvise.c  |  6 --
 mm/filemap_xip.c  | 23 
 mm/madvise.c  |  2 +-
 11 files changed, 6 insertions(+), 86 deletions(-)
 delete mode 100644 mm/filemap_xip.c

diff --git a/Documentation/filesystems/Locking 
b/Documentation/filesystems/Locking
index f1997e9..226ccc3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -197,8 +197,6 @@ prototypes:
int (*releasepage) (struct page *, int);
void (*freepage)(struct page *);
int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t 
offset);
-   int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **,
-   unsigned long *);
int (*migratepage)(struct address_space *, struct page *, struct page 
*);
int (*launder_page)(struct page *);
int (*is_partially_uptodate)(struct page *, unsigned long, unsigned 
long);
@@ -223,7 +221,6 @@ invalidatepage: yes
 releasepage:   yes
 freepage:  yes
 direct_IO:
-get_xip_mem:   maybe
 migratepage:   yes (both)
 launder_page:  yes
 is_partially_uptodate: yes
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index 3f9cafd..c408a53 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = {
.direct_IO  = exofs_direct_IO,
 
/* With these NULL has special meaning or default is not exported */
-   .get_xip_mem= NULL,
.migratepage= NULL,
.launder_page   = NULL,
.is_partially_uptodate = NULL,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 5ac0a34..59d6c7d 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = {
 
 const struct address_space_operations ext2_aops_xip = {
.bmap   = ext2_bmap,
-   .get_xip_mem= ext2_get_xip_mem,
.direct_IO  = ext2_direct_IO,
 };
 
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 8cfca3a..132d4da 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,35 +13,6 @@
 #include "ext2.h"
 #include "xip.h"
 
-static inline long __inode_direct_access(struct inode *inode, sector_t block,
-   void **kaddr, unsigned long *pfn, long size)
-{
-   struct block_device *bdev = inode->i_sb->s_bdev;
-   sector_t sector = block * (PAGE_SIZE / 512);
-   return bdev_direct_access(bdev, sector, kaddr, pfn, size);
-}
-
-static inline int
-__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create,
-  sector_t *result)
-{
-   struct buffer_head tmp;
-   int rc;
-
-   memset(, 0, sizeof(struct buffer_head));
-   tmp.b_size = 1 << inode->i_blkbits;
-   rc = ext2_get_block(inode, pgoff, , create);
-   *result = tmp.b_blocknr;
-
-   /* did we get a sparse block (hole in the file)? */
-   if (!tmp.b_blocknr && !rc) {
-   BUG_ON(create);
-   rc = -ENODATA;
-   }
-
-   return rc;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb)
 "not supported by bdev");
}
 }
-
-int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create,
-   void **kmem, unsigned long *pfn)
-{
-   long rc;
-   sector_t block;
-
-   /* first, retrieve the sector number */
-   rc = __ext2_get_block(mapping->host, pgoff, create, );
-   if (rc)
-   return rc;
-
-   /* retrieve address of the target data */
-   rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE);
-   return (rc < 0) ? rc : 0;
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index b2592f2..e7b9f0a 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb)
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
-int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
-   void **, unsigned long *);
 #else
 #define ext2_xip_verify_sb(sb) do { }

[PATCH v10 15/21] ext2: Remove xip.c and xip.h

2014-08-26 Thread Matthew Wilcox

These files are now empty, so delete them

Signed-off-by: Matthew Wilcox 
---
 fs/ext2/Makefile |  1 -
 fs/ext2/inode.c  |  1 -
 fs/ext2/namei.c  |  1 -
 fs/ext2/super.c  |  1 -
 fs/ext2/xip.c| 15 ---
 fs/ext2/xip.h| 16 
 6 files changed, 35 deletions(-)
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index f42af45..445b0e9 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
 ext2-$(CONFIG_EXT2_FS_XATTR)+= xattr.o xattr_user.o xattr_trusted.o
 ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
 ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o
-ext2-$(CONFIG_EXT2_FS_XIP)  += xip.o
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index cba3833..154cbcf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -34,7 +34,6 @@
 #include 
 #include "ext2.h"
 #include "acl.h"
-#include "xip.h"
 #include "xattr.h"
 
 static int __ext2_write_inode(struct inode *inode, int do_sync);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 846c356..7ca803f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode)
 {
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d862031..0393c6d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -35,7 +35,6 @@
 #include "ext2.h"
 #include "xattr.h"
 #include "acl.h"
-#include "xip.h"
 
 static void ext2_sync_super(struct super_block *sb,
struct ext2_super_block *es, int wait);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
deleted file mode 100644
index 66ca113..000
--- a/fs/ext2/xip.c
+++ /dev/null
@@ -1,15 +0,0 @@
-/*
- *  linux/fs/ext2/xip.c
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (co...@de.ibm.com)
- */
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include "ext2.h"
-#include "xip.h"
-
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
deleted file mode 100644
index 87eeb04..000
--- a/fs/ext2/xip.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/*
- *  linux/fs/ext2/xip.h
- *
- * Copyright (C) 2005 IBM Corporation
- * Author: Carsten Otte (co...@de.ibm.com)
- */
-
-#ifdef CONFIG_EXT2_FS_XIP
-static inline int ext2_use_xip (struct super_block *sb)
-{
-   struct ext2_sb_info *sbi = EXT2_SB(sb);
-   return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
-}
-#else
-#define ext2_use_xip(sb)   0
-#endif
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 18/21] Get rid of most mentions of XIP in ext2

2014-08-26 Thread Matthew Wilcox

To help people transition, accept the 'xip' mount option (and report it
in /proc/mounts), but print a message encouraging people to switch over
to the 'dax' option.
---
 fs/ext2/ext2.h  | 13 +++--
 fs/ext2/file.c  |  2 +-
 fs/ext2/inode.c |  6 +++---
 fs/ext2/namei.c |  8 
 fs/ext2/super.c | 25 -
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index b8b1c11..46133a0 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,14 +380,15 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID320x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER  0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL   0x008000  /* POSIX Access Control Lists 
*/
-#ifdef CONFIG_FS_DAX
-#define EXT2_MOUNT_XIP 0x01  /* Execute in place */
-#else
-#define EXT2_MOUNT_XIP 0
-#endif
+#define EXT2_MOUNT_XIP 0x01  /* Obsolete, use DAX */
 #define EXT2_MOUNT_USRQUOTA0x02  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA0x04  /* group quota */
 #define EXT2_MOUNT_RESERVATION 0x08  /* Preallocation */
+#ifdef CONFIG_FS_DAX
+#define EXT2_MOUNT_DAX 0x10  /* Direct Access */
+#else
+#define EXT2_MOUNT_DAX 0
+#endif
 
 
 #define clear_opt(o, opt)  o &= ~EXT2_MOUNT_##opt
@@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, 
loff_t end,
  int datasync);
 extern const struct inode_operations ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
-extern const struct file_operations ext2_xip_file_operations;
+extern const struct file_operations ext2_dax_file_operations;
 
 /* inode.c */
 extern const struct address_space_operations ext2_aops;
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 46b333d..5b8cab5 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = {
 };
 
 #ifdef CONFIG_FS_DAX
-const struct file_operations ext2_xip_file_operations = {
+const struct file_operations ext2_dax_file_operations = {
.llseek = generic_file_llseek,
.read   = new_sync_read,
.write  = new_sync_write,
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 034fd42..6434bc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
-   if (test_opt(inode->i_sb, XIP))
+   if (test_opt(inode->i_sb, DAX))
inode->i_flags |= S_DAX;
 }
 
@@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned 
long ino)
 
if (S_ISREG(inode->i_mode)) {
inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, XIP)) {
+   if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _xip_file_operations;
+   inode->i_fop = _dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
inode->i_fop = _file_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 0db888c..148f6e3 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * 
dentry, umode_t mode
return PTR_ERR(inode);
 
inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, XIP)) {
+   if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _xip_file_operations;
+   inode->i_fop = _dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
inode->i_fop = _file_operations;
@@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry 
*dentry, umode_t mode)
return PTR_ERR(inode);
 
inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, XIP)) {
+   if (test_opt(inode->i_sb, DAX)) {
inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _xip_file_operations;
+   inode->i_fop = _dax_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
inode->i_mapping->a_ops = _nobh_aops;
inode->i_fop = _file_operations;
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index feb53d8..8b9debf 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct 
dentry *root)
 #ifdef CONFIG_FS_DAX
if (sbi->s_mount_opt &

[PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler

2014-08-26 Thread Matthew Wilcox

Instead of calling aops->get_xip_mem from the fault handler, the
filesystem passes a get_block_t that is used to find the appropriate
blocks.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jan Kara 
---
 fs/dax.c   | 215 +
 fs/ext2/file.c |  35 -
 include/linux/fs.h |   4 +-
 mm/filemap_xip.c   | 206 --
 4 files changed, 251 insertions(+), 209 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 02e226f..f134078 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -19,9 +19,13 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
+#include 
 
 int dax_clear_blocks(struct inode *inode, sector_t block, long size)
 {
@@ -64,6 +68,14 @@ static long dax_get_addr(struct buffer_head *bh, void 
**addr, unsigned blkbits)
return bdev_direct_access(bh->b_bdev, sector, addr, , bh->b_size);
 }
 
+static long dax_get_pfn(struct buffer_head *bh, unsigned long *pfn,
+   unsigned blkbits)
+{
+   void *addr;
+   sector_t sector = bh->b_blocknr << (blkbits - 9);
+   return bdev_direct_access(bh->b_bdev, sector, , pfn, bh->b_size);
+}
+
 static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos,
loff_t end)
 {
@@ -228,3 +240,206 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct 
inode *inode,
return retval;
 }
 EXPORT_SYMBOL_GPL(dax_do_io);
+
+/*
+ * The user has performed a load from a hole in the file.  Allocating
+ * a new page in the file would cause excessive storage usage for
+ * workloads with sparse files.  We allocate a page cache page instead.
+ * We'll kick it out of the page cache if it's ever written to,
+ * otherwise it will simply fall out of the page cache under memory
+ * pressure without ever having been dirtied.
+ */
+static int dax_load_hole(struct address_space *mapping, struct page *page,
+   struct vm_fault *vmf)
+{
+   unsigned long size;
+   struct inode *inode = mapping->host;
+   if (!page)
+   page = find_or_create_page(mapping, vmf->pgoff,
+   GFP_KERNEL | __GFP_ZERO);
+   if (!page)
+   return VM_FAULT_OOM;
+   /* Recheck i_size under page lock to avoid truncate race */
+   size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   if (vmf->pgoff >= size) {
+   unlock_page(page);
+   page_cache_release(page);
+   return VM_FAULT_SIGBUS;
+   }
+
+   vmf->page = page;
+   return VM_FAULT_LOCKED;
+}
+
+static int copy_user_bh(struct page *to, struct buffer_head *bh,
+   unsigned blkbits, unsigned long vaddr)
+{
+   void *vfrom, *vto;
+   if (dax_get_addr(bh, , blkbits) < 0)
+   return -EIO;
+   vto = kmap_atomic(to);
+   copy_user_page(vto, vfrom, vaddr, to);
+   kunmap_atomic(vto);
+   return 0;
+}
+
+static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
+   get_block_t get_block)
+{
+   struct file *file = vma->vm_file;
+   struct inode *inode = file_inode(file);
+   struct address_space *mapping = file->f_mapping;
+   struct page *page;
+   struct buffer_head bh;
+   unsigned long vaddr = (unsigned long)vmf->virtual_address;
+   unsigned blkbits = inode->i_blkbits;
+   sector_t block;
+   pgoff_t size;
+   unsigned long pfn;
+   int error;
+   int major = 0;
+
+   size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+   if (vmf->pgoff >= size)
+   return VM_FAULT_SIGBUS;
+
+   memset(, 0, sizeof(bh));
+   block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits);
+   bh.b_size = PAGE_SIZE;
+
+ repeat:
+   page = find_get_page(mapping, vmf->pgoff);
+   if (page) {
+   if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
+   page_cache_release(page);
+   return VM_FAULT_RETRY;
+   }
+   if (unlikely(page->mapping != mapping)) {
+   unlock_page(page);
+   page_cache_release(page);
+   goto repeat;
+   }
+   }
+
+   error = get_block(inode, block, , 0);
+   if (!error && (bh.b_size < PAGE_SIZE))
+   error = -EIO;
+   if (error)
+   goto unlock_page;
+
+   if (!buffer_written() && !vmf->cow_page) {
+   if (vmf->flags & FAULT_FLAG_WRITE) {
+   error = get_block(inode, block, , 1);
+   count_vm_event(PGMAJFAULT);
+   mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+   major = VM_FAULT_MAJOR;
+   if (!error && (bh.b_size < PAGE_SIZE))
+

[PATCH v10 14/21] ext2: Remove ext2_use_xip

2014-08-26 Thread Matthew Wilcox

Replace ext2_use_xip() with test_opt(XIP) which expands to the same code

Signed-off-by: Matthew Wilcox 
---
 fs/ext2/ext2.h  | 4 
 fs/ext2/inode.c | 2 +-
 fs/ext2/namei.c | 4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index d9a17d0..5ecf570 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -380,7 +380,11 @@ struct ext2_inode {
 #define EXT2_MOUNT_NO_UID320x000200  /* Disable 32-bit UIDs */
 #define EXT2_MOUNT_XATTR_USER  0x004000  /* Extended user attributes */
 #define EXT2_MOUNT_POSIX_ACL   0x008000  /* POSIX Access Control Lists 
*/
+#ifdef CONFIG_FS_XIP
 #define EXT2_MOUNT_XIP 0x01  /* Execute in place */
+#else
+#define EXT2_MOUNT_XIP 0
+#endif
 #define EXT2_MOUNT_USRQUOTA0x02  /* user quota */
 #define EXT2_MOUNT_GRPQUOTA0x04  /* group quota */
 #define EXT2_MOUNT_RESERVATION 0x08  /* Preallocation */
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 59d6c7d..cba3833 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned 
long ino)
 
if (S_ISREG(inode->i_mode)) {
inode->i_op = _file_inode_operations;
-   if (ext2_use_xip(inode->i_sb)) {
+   if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = _aops_xip;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index c268d0a..846c356 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * 
dentry, umode_t mode
return PTR_ERR(inode);
 
inode->i_op = _file_inode_operations;
-   if (ext2_use_xip(inode->i_sb)) {
+   if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = _aops_xip;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
@@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry 
*dentry, umode_t mode)
return PTR_ERR(inode);
 
inode->i_op = _file_inode_operations;
-   if (ext2_use_xip(inode->i_sb)) {
+   if (test_opt(inode->i_sb, XIP)) {
inode->i_mapping->a_ops = _aops_xip;
inode->i_fop = _xip_file_operations;
} else if (test_opt(inode->i_sb, NOBH)) {
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks

2014-08-26 Thread Matthew Wilcox

This is practically generic code; other filesystems will want to call
it from other places, but there's nothing ext2-specific about it.

Make it a little more generic by allowing it to take a count of the number
of bytes to zero rather than fixing it to a single page.  Thanks to Dave
Hansen for suggesting that I need to call cond_resched() if zeroing more
than one page.

Signed-off-by: Matthew Wilcox 
---
 fs/dax.c   | 35 +++
 fs/ext2/inode.c|  8 +---
 fs/ext2/xip.c  | 14 --
 fs/ext2/xip.h  |  3 ---
 include/linux/fs.h |  6 ++
 5 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 108c68e..02e226f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -20,8 +20,43 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
+int dax_clear_blocks(struct inode *inode, sector_t block, long size)
+{
+   struct block_device *bdev = inode->i_sb->s_bdev;
+   sector_t sector = block << (inode->i_blkbits - 9);
+
+   might_sleep();
+   do {
+   void *addr;
+   unsigned long pfn;
+   long count;
+
+   count = bdev_direct_access(bdev, sector, , , size);
+   if (count < 0)
+   return count;
+   while (count > 0) {
+   unsigned pgsz = PAGE_SIZE - offset_in_page(addr);
+   if (pgsz > count)
+   pgsz = count;
+   if (pgsz < PAGE_SIZE)
+   memset(addr, 0, pgsz);
+   else
+   clear_page(addr);
+   addr += pgsz;
+   size -= pgsz;
+   count -= pgsz;
+   sector += pgsz / 512;
+   cond_resched();
+   }
+   } while (size);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(dax_clear_blocks);
+
 static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits)
 {
unsigned long pfn;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 3ccd5fd..52978b8 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode,
 
if (IS_DAX(inode)) {
/*
-* we need to clear the block
+* block must be initialised before we put it in the tree
+* so that it's not found by another thread before it's
+* initialised
 */
-   err = ext2_clear_xip_target (inode,
-   le32_to_cpu(chain[depth-1].key));
+   err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key),
+   1 << inode->i_blkbits);
if (err) {
mutex_unlock(>truncate_mutex);
goto cleanup;
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index bbc5fec..8cfca3a 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int 
create,
return rc;
 }
 
-int
-ext2_clear_xip_target(struct inode *inode, sector_t block)
-{
-   void *kaddr;
-   unsigned long pfn;
-   long size;
-
-   size = __inode_direct_access(inode, block, , , PAGE_SIZE);
-   if (size < 0)
-   return size;
-   clear_page(kaddr);
-   return 0;
-}
-
 void ext2_xip_verify_sb(struct super_block *sb)
 {
struct ext2_sb_info *sbi = EXT2_SB(sb);
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 29be737..b2592f2 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -7,8 +7,6 @@
 
 #ifdef CONFIG_EXT2_FS_XIP
 extern void ext2_xip_verify_sb (struct super_block *);
-extern int ext2_clear_xip_target (struct inode *, sector_t);
-
 static inline int ext2_use_xip (struct super_block *sb)
 {
struct ext2_sb_info *sbi = EXT2_SB(sb);
@@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
 #else
 #define ext2_xip_verify_sb(sb) do { } while (0)
 #define ext2_use_xip(sb)   0
-#define ext2_clear_xip_target(inode, chain)0
 #define ext2_get_xip_mem   NULL
 #endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 45839e8..c04d371 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,11 +2490,17 @@ extern int generic_file_open(struct inode * inode, 
struct file * filp);
 extern int nonseekable_open(struct inode * inode, struct file * filp);
 
 #ifdef CONFIG_FS_XIP
+int dax_clear_blocks(struct inode *, sector_t block, long size);
 extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma);
 extern int xip_truncate_page(struct address_space *mapping, loff_t from);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
 #else
+static inline int

Re: [PATCH RFC v7 net-next 00/28] BPF syscall

2014-08-26 Thread Alexei Starovoitov

On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski  wrote:
> On Aug 26, 2014 7:29 PM, "Alexei Starovoitov"  wrote:
>>
>> Hi Ingo, David,
>>
>> posting whole thing again as RFC to get feedback on syscall only.
>> If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok,
>> I'll split them into small chunks as requested and will repost without RFC.
>
> IMO it's much easier to review a syscall if we just look at a
> specification of what it does.  The code is, in some sense, secondary.

'specification of what it does'... hmm, you mean beyond what's
there in commit logs and in Documentation/networking/filter.txt ?
Aren't samples at the end give an idea on 'what it does'?
I'm happy to add 'specification', I just don't understand yet what
it suppose to talk about beyond what's already written.
I understand that the patches are missing explanation on 'why'
the syscall is being added, but I don't think it's what you're asking...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 04/21] Allow page fault handlers to perform the COW

2014-08-26 Thread Matthew Wilcox

Currently COW of an XIP file is done by first bringing in a read-only
mapping, then retrying the fault and copying the page.  It is much more
efficient to tell the fault handler that a COW is being attempted (by
passing in the pre-allocated page in the vm_fault structure), and allow
the handler to perform the COW operation itself.

The handler cannot insert the page itself if there is already a read-only
mapping at that address, so allow the handler to return VM_FAULT_LOCKED
and set the fault_page to be NULL.  This indicates to the MM code that
the i_mmap_mutex is held instead of the page lock.

Signed-off-by: Matthew Wilcox 
Acked-by: Kirill A. Shutemov 
---
 include/linux/mm.h |  1 +
 mm/memory.c| 33 -
 2 files changed, 25 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8981cc8..0a47817 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -208,6 +208,7 @@ struct vm_fault {
pgoff_t pgoff;  /* Logical page offset based on vma */
void __user *virtual_address;   /* Faulting virtual address */
 
+   struct page *cow_page;  /* Handler may choose to COW */
struct page *page;  /* ->fault handlers should return a
 * page here, unless VM_FAULT_NOPAGE
 * is set (which is also implied by
diff --git a/mm/memory.c b/mm/memory.c
index adeac30..3368785 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, 
struct page *page,
vmf.pgoff = page->index;
vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE;
vmf.page = page;
+   vmf.cow_page = NULL;
 
ret = vma->vm_ops->page_mkwrite(vma, );
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))
@@ -2698,7 +2699,8 @@ oom:
  * See filemap_fault() and __lock_page_retry().
  */
 static int __do_fault(struct vm_area_struct *vma, unsigned long address,
-   pgoff_t pgoff, unsigned int flags, struct page **page)
+   pgoff_t pgoff, unsigned int flags,
+   struct page *cow_page, struct page **page)
 {
struct vm_fault vmf;
int ret;
@@ -2707,10 +2709,13 @@ static int __do_fault(struct vm_area_struct *vma, 
unsigned long address,
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.page = NULL;
+   vmf.cow_page = cow_page;
 
ret = vma->vm_ops->fault(vma, );
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
+   if (!vmf.page)
+   goto out;
 
if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -2724,6 +2729,7 @@ static int __do_fault(struct vm_area_struct *vma, 
unsigned long address,
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
 
+ out:
*page = vmf.page;
return ret;
 }
@@ -2897,7 +2903,7 @@ static int do_read_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}
 
-   ret = __do_fault(vma, address, pgoff, flags, _page);
+   ret = __do_fault(vma, address, pgoff, flags, NULL, _page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
 
@@ -2937,26 +2943,35 @@ static int do_cow_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
return VM_FAULT_OOM;
}
 
-   ret = __do_fault(vma, address, pgoff, flags, _page);
+   ret = __do_fault(vma, address, pgoff, flags, new_page, _page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
 
-   copy_user_highpage(new_page, fault_page, address, vma);
+   if (fault_page)
+   copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);
 
pte = pte_offset_map_lock(mm, pmd, address, );
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
-   unlock_page(fault_page);
-   page_cache_release(fault_page);
+   if (fault_page) {
+   unlock_page(fault_page);
+   page_cache_release(fault_page);
+   } else {
+   mutex_unlock(>vm_file->f_mapping->i_mmap_mutex);
+   }
goto uncharge_out;
}
do_set_pte(vma, address, new_page, pte, true, true);
mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
-   unlock_page(fault_page);
-   page_cache_release(fault_page);
+   if (fault_page) {
+   unlock_page(fault_page);
+   page_cache_release(fault_page);
+   } else {
+

[RFC PATCH v2] tpm_tis: verify interrupt during init

2014-08-26 Thread Scot Doyle

On Mon, 25 Aug 2014, Jason Gunthorpe wrote:
> On Mon, Aug 25, 2014, Scot Doyle wrote:
>> 3. Custom SeaBIOS. Blacklist the tpm_tis module so that it doesn't load
>>and therefore doesn't issue startup(clear) to the TPM chip.
>
> It seems to me at least in this case you should be able to get rid of
> the IRQ entry, people are going to be flashing the custom SeaBIOS
> anyhow.

The person building many of these custom SeaBIOS packages has removed the
TPM section from the DSDT, so this may be addressed.


On Mon, 25 Aug 2014, Jason Gunthorpe wrote:
> I think you'll have to directly test in the tis driver if the
> interrupt is working.
>
> The ordering in the TIS driver is wrong, interrupts should be turned
> on before any TPM commands are issued. This is what other drivers are
> doing.
>
> If you fix this, tis can then just count interrupts recieved and check
> if that is 0 to detect failure and then turn them off.

How about something like this?

It doesn't enable stock SeaBIOS machines to suspend/resume before the 30 
second interrupt timeout, unless using interrupts=0 or force=1.

---

diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c
index 2c46734..ae701d8 100644
--- a/drivers/char/tpm/tpm_tis.c
+++ b/drivers/char/tpm/tpm_tis.c
@@ -493,6 +493,8 @@ static irqreturn_t tis_int_probe(int irq, void *dev_id)
return IRQ_HANDLED;
 }

+static bool interrupted = false;
+
 static irqreturn_t tis_int_handler(int dummy, void *dev_id)
 {
struct tpm_chip *chip = dev_id;
@@ -511,6 +513,8 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id)
for (i = 0; i < 5; i++)
if (check_locality(chip, i) >= 0)
break;
+   if (interrupt & TPM_INTF_CMD_READY_INT)
+   interrupted = true;
if (interrupt &
(TPM_INTF_LOCALITY_CHANGE_INT | TPM_INTF_STS_VALID_INT |
 TPM_INTF_CMD_READY_INT))
@@ -612,12 +616,6 @@ static int tpm_tis_init(struct device *dev, 
resource_size_t start,
goto out_err;
}

-   if (tpm_do_selftest(chip)) {
-   dev_err(dev, "TPM self test failed\n");
-   rc = -ENODEV;
-   goto out_err;
-   }
-
/* INTERRUPT Setup */
init_waitqueue_head(>vendor.read_queue);
init_waitqueue_head(>vendor.int_queue);
@@ -693,7 +691,7 @@ static int tpm_tis_init(struct device *dev, resource_size_t 
start,
free_irq(i, chip);
}
}
-   if (chip->vendor.irq) {
+   if (interrupts && chip->vendor.irq) {
iowrite8(chip->vendor.irq,
 chip->vendor.iobase +
 TPM_INT_VECTOR(chip->vendor.locality));
@@ -719,6 +717,32 @@ static int tpm_tis_init(struct device *dev, 
resource_size_t start,
}
}

+   /* Test interrupt and/or prepare for later save state */
+   interrupted = false;
+   if (tpm_do_selftest(chip)) {
+   if (!interrupts || interrupted) {
+   dev_err(dev, "TPM self test failed\n");
+   rc = -ENODEV;
+   goto out_err;
+   } else {
+   /* Turn off interrupt */
+   iowrite32(intmask,
+ chip->vendor.iobase +
+ TPM_INT_ENABLE(chip->vendor.locality));
+   free_irq(chip->vendor.irq, chip);
+
+   /* Retry in polling mode */
+   chip->vendor.irq = 0;
+   if (tpm_do_selftest(chip)) {
+   dev_err(dev, "TPM self test failed\n");
+   rc = -ENODEV;
+   goto out_err;
+   } else {
+   dev_err(dev, "ACPI DSDT entry incorrect, 
polling instead\n");
+   }
+   }
+   }
+
INIT_LIST_HEAD(>vendor.list);
mutex_lock(_lock);
list_add(>vendor.list, _chips);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 19/21] xip: Add xip_zero_page_range

2014-08-26 Thread Matthew Wilcox

This new function allows us to support hole-punch for XIP files by zeroing
a partial page, as opposed to the xip_truncate_page() function which can
only truncate to the end of the page.  Reimplement xip_truncate_page() as
a macro that calls xip_zero_page_range().

Signed-off-by: Matthew Wilcox 
[ported to 3.13-rc2]
Signed-off-by: Ross Zwisler 
---
 Documentation/filesystems/dax.txt |  1 +
 fs/dax.c  | 20 ++--
 include/linux/fs.h|  9 -
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/Documentation/filesystems/dax.txt 
b/Documentation/filesystems/dax.txt
index 635adaa..ebcd97f 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -62,6 +62,7 @@ Filesystem support consists of
   for fault and page_mkwrite (which should probably call dax_fault() and
   dax_mkwrite(), passing the appropriate get_block() callback)
 - calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- calling dax_zero_page_range() instead of zero_user() for DAX files
 - ensuring that there is sufficient locking between reads, writes,
   truncates and page faults
 
diff --git a/fs/dax.c b/fs/dax.c
index d54f7d3..96c4fed 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -445,13 +445,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault 
*vmf,
 EXPORT_SYMBOL_GPL(dax_fault);
 
 /**
- * dax_truncate_page - handle a partial page being truncated in a DAX file
+ * dax_zero_page_range - zero a range within a page of a DAX file
  * @inode: The file being truncated
  * @from: The file offset that is being truncated to
+ * @length: The number of bytes to zero
  * @get_block: The filesystem method used to translate file offsets to blocks
  *
- * Similar to block_truncate_page(), this function can be called by a
- * filesystem when it is truncating an DAX file to handle the partial page.
+ * This function can be called by a filesystem when it is zeroing part of a
+ * page in a DAX file.  This is intended for hole-punch operations.  If
+ * you are truncating a file, the helper function dax_truncate_page() may be
+ * more convenient.
  *
  * We work in terms of PAGE_CACHE_SIZE here for commonality with
  * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem
@@ -459,12 +462,12 @@ EXPORT_SYMBOL_GPL(dax_fault);
  * block size is smaller than PAGE_SIZE, we have to zero the rest of the page
  * since the file might be mmaped.
  */
-int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block)
+int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length,
+   get_block_t get_block)
 {
struct buffer_head bh;
pgoff_t index = from >> PAGE_CACHE_SHIFT;
unsigned offset = from & (PAGE_CACHE_SIZE-1);
-   unsigned length = PAGE_CACHE_ALIGN(from) - from;
int err;
 
/* Block boundary? Nothing to do */
@@ -481,9 +484,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, 
get_block_t get_block)
err = dax_get_addr(, , inode->i_blkbits);
if (err < 0)
return err;
+   /*
+* ext4 sometimes asks to zero past the end of a block.  It
+* really just wants to zero to the end of the block.
+*/
+   length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset);
memset(addr + offset, 0, length);
}
 
return 0;
 }
-EXPORT_SYMBOL_GPL(dax_truncate_page);
+EXPORT_SYMBOL_GPL(dax_zero_page_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e6b48cc..b0078df 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct 
file * filp);
 
 #ifdef CONFIG_FS_DAX
 int dax_clear_blocks(struct inode *, sector_t block, long size);
+int dax_zero_page_range(struct inode *, loff_t from, unsigned len, 
get_block_t);
 int dax_truncate_page(struct inode *, loff_t from, get_block_t);
 ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *,
loff_t, get_block_t, dio_iodone_t, int flags);
@@ -2501,7 +2502,8 @@ static inline int dax_clear_blocks(struct inode *i, 
sector_t blk, long sz)
return 0;
 }
 
-static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t 
gb)
+static inline int dax_zero_page_range(struct inode *inode, loff_t from,
+   unsigned len, get_block_t gb)
 {
return 0;
 }
@@ -2514,6 +2516,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb 
*iocb,
 }
 #endif
 
+/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */
+#define dax_truncate_page(inode, from, get_block)  \
+   dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block)
+
+
 #ifdef CONFIG_BLOCK
 typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode

[PATCH v10 05/21] Introduce IS_DAX(inode)

2014-08-26 Thread Matthew Wilcox

Use an inode flag to tag inodes which should avoid using the page cache.
Convert ext2 to use it instead of mapping_is_xip().

Signed-off-by: Matthew Wilcox 
Reviewed-by: Jan Kara 
---
 fs/ext2/inode.c| 9 ++---
 fs/ext2/xip.h  | 2 --
 include/linux/fs.h | 6 ++
 3 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 36d35c3..0cb0448 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode,
goto cleanup;
}
 
-   if (ext2_use_xip(inode->i_sb)) {
+   if (IS_DAX(inode)) {
/*
 * we need to clear the block
 */
@@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t 
newsize)
 
inode_dio_wait(inode);
 
-   if (mapping_is_xip(inode->i_mapping))
+   if (IS_DAX(inode))
error = xip_truncate_page(inode->i_mapping, newsize);
else if (test_opt(inode->i_sb, NOBH))
error = nobh_truncate_page(inode->i_mapping,
@@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode)
 {
unsigned int flags = EXT2_I(inode)->i_flags;
 
-   inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC);
+   inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME |
+   S_DIRSYNC | S_DAX);
if (flags & EXT2_SYNC_FL)
inode->i_flags |= S_SYNC;
if (flags & EXT2_APPEND_FL)
@@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_NOATIME;
if (flags & EXT2_DIRSYNC_FL)
inode->i_flags |= S_DIRSYNC;
+   if (test_opt(inode->i_sb, XIP))
+   inode->i_flags |= S_DAX;
 }
 
 /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index 18b34d2..29be737 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb)
 }
 int ext2_get_xip_mem(struct address_space *, pgoff_t, int,
void **, unsigned long *);
-#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem)
 #else
-#define mapping_is_xip(map)0
 #define ext2_xip_verify_sb(sb) do { } while (0)
 #define ext2_use_xip(sb)   0
 #define ext2_clear_xip_target(inode, chain)0
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9418772..e99e5c4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1605,6 +1605,7 @@ struct super_operations {
 #define S_IMA  1024/* Inode has an associated IMA struct */
 #define S_AUTOMOUNT2048/* Automount/referral quasi-directory */
 #define S_NOSEC4096/* no suid or xattr security attributes 
*/
+#define S_DAX  8192/* Direct Access, avoiding the page cache */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -1642,6 +1643,11 @@ struct super_operations {
 #define IS_IMA(inode)  ((inode)->i_flags & S_IMA)
 #define IS_AUTOMOUNT(inode)((inode)->i_flags & S_AUTOMOUNT)
 #define IS_NOSEC(inode)((inode)->i_flags & S_NOSEC)
+#ifdef CONFIG_FS_XIP
+#define IS_DAX(inode)  ((inode)->i_flags & S_DAX)
+#else
+#define IS_DAX(inode)  0
+#endif
 
 /*
  * Inode state bits.  Protected by inode->i_lock
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 11/21] Replace XIP documentation with DAX documentation

2014-08-26 Thread Matthew Wilcox

From: Matthew Wilcox 

Based on the original XIP documentation, this documents the current
state of affairs, and includes instructions on how users can enable DAX
if their devices and kernel support it.

Signed-off-by: Matthew Wilcox 
Reviewed-by: Randy Dunlap 
---
 Documentation/filesystems/dax.txt | 89 +++
 Documentation/filesystems/xip.txt | 71 ---
 2 files changed, 89 insertions(+), 71 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt

diff --git a/Documentation/filesystems/dax.txt 
b/Documentation/filesystems/dax.txt
new file mode 100644
index 000..635adaa
--- /dev/null
+++ b/Documentation/filesystems/dax.txt
@@ -0,0 +1,89 @@
+Direct Access for files
+---
+
+Motivation
+--
+
+The page cache is usually used to buffer reads and writes to files.
+It is also used to provide the pages which are mapped into userspace
+by a call to mmap.
+
+For block devices that are memory-like, the page cache pages would be
+unnecessary copies of the original storage.  The DAX code removes the
+extra copy by performing reads and writes directly to the storage device.
+For file mappings, the storage device is mapped directly into userspace.
+
+
+Usage
+-
+
+If you have a block device which supports DAX, you can make a filesystem
+on it as usual.  When mounting it, use the -o dax option manually
+or add 'dax' to the options in /etc/fstab.
+
+
+Implementation Tips for Block Driver Writers
+
+
+To support DAX in your block driver, implement the 'direct_access'
+block device operation.  It is used to translate the sector number
+(expressed in units of 512-byte sectors) to a page frame number (pfn)
+that identifies the physical page for the memory.  It also returns a
+kernel virtual address that can be used to access the memory.
+
+The direct_access method takes a 'size' parameter that indicates the
+number of bytes being requested.  The function should return the number
+of bytes that can be contiguously accessed at that offset.  It may also
+return a negative errno if an error occurs.
+
+In order to support this method, the storage must be byte-accessible by
+the CPU at all times.  If your device uses paging techniques to expose
+a large amount of memory through a smaller window, then you cannot
+implement direct_access.  Equally, if your device can occasionally
+stall the CPU for an extended period, you should also not attempt to
+implement direct_access.
+
+These block devices may be used for inspiration:
+- axonram: Axon DDR2 device driver
+- brd: RAM backed block device driver
+- dcssblk: s390 dcss block device driver
+
+
+Implementation Tips for Filesystem Writers
+--
+
+Filesystem support consists of
+- adding support to mark inodes as being DAX by setting the S_DAX flag in
+  i_flags
+- implementing the direct_IO address space operation, and calling
+  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
+- implementing an mmap file operation for DAX files which sets the
+  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
+  for fault and page_mkwrite (which should probably call dax_fault() and
+  dax_mkwrite(), passing the appropriate get_block() callback)
+- calling dax_truncate_page() instead of block_truncate_page() for DAX files
+- ensuring that there is sufficient locking between reads, writes,
+  truncates and page faults
+
+The get_block() callback passed to the DAX functions may return
+uninitialised extents.  If it does, it must ensure that simultaneous
+calls to get_block() (for example by a page-fault racing with a read()
+or a write()) work correctly.
+
+These filesystems may be used for inspiration:
+- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt
+
+
+Shortcomings
+
+
+Even if the kernel or its modules are stored on a filesystem that supports
+DAX on a block device that supports DAX, they will still be copied into RAM.
+
+Calling get_user_pages() on a range of user memory that has been mmaped
+from a DAX file will fail as there are no 'struct page' to describe
+those pages.  This problem is being worked on.  That means that O_DIRECT
+reads/writes to those memory ranges from a non-DAX file will fail (note
+that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory
+that is being accessed that is key here).  Other things that will not
+work include RDMA, sendfile() and splice().
diff --git a/Documentation/filesystems/xip.txt 
b/Documentation/filesystems/xip.txt
deleted file mode 100644
index b774729..000
--- a/Documentation/filesystems/xip.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-Execute-in-place for file mappings
---
-
-Motivation
---
-File mappings are performed by mapping page cache pages to userspace. In
-addition, read type file

[PATCH v10 00/21] Support ext4 on NV-DIMMs

2014-08-26 Thread Matthew Wilcox

One of the primary uses for NV-DIMMs is to expose them as a block device
and use a filesystem to store files on the NV-DIMM.  While that works,
it currently wastes memory and CPU time buffering the files in the page
cache.  We have support in ext2 for bypassing the page cache, but it
has some races which are unfixable in the current design.  This series
of patches rewrite the underlying support, and add support for direct
access to ext4.

Note that patch 6/21 has been included in
https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate

This iteration of the patchset rebases to 3.17-rc2, changes the page fault
locking, fixes a couple of bugs and makes a few other minor changes.

 - Move the calculation of the maximum size available at the requested
   location from the ->direct_access implementations to bdev_direct_access()
 - Fix a comment typo (Ross Zwisler)
 - Check that the requested length is positive in bdev_direct_access().  If
   it is not, assume that it's an errno, and just return it.
 - Fix some whitespace issues flagged by checkpatch
 - Added the Acked-by responses from Kirill that I forget in the last round
 - Added myself to MAINTAINERS for DAX
 - Fixed compilation with !CONFIG_DAX (Vishal Verma)
 - Revert the locking in the page fault handler back to an earlier version.
   If we hit the race that we were trying to protect against, we will leave
   blocks allocated past the end of the file.  They will be removed on file
   removal, the next truncate, or fsck.


Matthew Wilcox (20):
  axonram: Fix bug in direct_access
  Change direct_access calling convention
  Fix XIP fault vs truncate race
  Allow page fault handlers to perform the COW
  Introduce IS_DAX(inode)
  Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
  Replace XIP read and write with DAX I/O
  Replace ext2_clear_xip_target with dax_clear_blocks
  Replace the XIP page fault handler with the DAX page fault handler
  Replace xip_truncate_page with dax_truncate_page
  Replace XIP documentation with DAX documentation
  Remove get_xip_mem
  ext2: Remove ext2_xip_verify_sb()
  ext2: Remove ext2_use_xip
  ext2: Remove xip.c and xip.h
  Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
  ext2: Remove ext2_aops_xip
  Get rid of most mentions of XIP in ext2
  xip: Add xip_zero_page_range
  brd: Rename XIP to DAX

Ross Zwisler (1):
  ext4: Add DAX functionality

 Documentation/filesystems/Locking  |   3 -
 Documentation/filesystems/dax.txt  |  91 +++
 Documentation/filesystems/ext4.txt |   2 +
 Documentation/filesystems/xip.txt  |  68 -
 MAINTAINERS|   6 +
 arch/powerpc/sysdev/axonram.c  |  19 +-
 drivers/block/Kconfig  |  13 +-
 drivers/block/brd.c|  26 +-
 drivers/s390/block/dcssblk.c   |  21 +-
 fs/Kconfig |  21 +-
 fs/Makefile|   1 +
 fs/block_dev.c |  40 +++
 fs/dax.c   | 497 +
 fs/exofs/inode.c   |   1 -
 fs/ext2/Kconfig|  11 -
 fs/ext2/Makefile   |   1 -
 fs/ext2/ext2.h |  10 +-
 fs/ext2/file.c |  45 +++-
 fs/ext2/inode.c|  38 +--
 fs/ext2/namei.c|  13 +-
 fs/ext2/super.c|  53 ++--
 fs/ext2/xip.c  |  91 ---
 fs/ext2/xip.h  |  26 --
 fs/ext4/ext4.h |   6 +
 fs/ext4/file.c |  49 +++-
 fs/ext4/indirect.c |  18 +-
 fs/ext4/inode.c|  51 ++--
 fs/ext4/namei.c|  10 +-
 fs/ext4/super.c|  39 ++-
 fs/open.c  |   5 +-
 include/linux/blkdev.h |   6 +-
 include/linux/fs.h |  49 +++-
 include/linux/mm.h |   1 +
 include/linux/uio.h|   3 +
 mm/Makefile|   1 -
 mm/fadvise.c   |   6 +-
 mm/filemap.c   |   6 +-
 mm/filemap_xip.c   | 483 ---
 mm/iov_iter.c  | 237 --
 mm/madvise.c   |   2 +-
 mm/memory.c|  33 ++-
 41 files changed, 1229 insertions(+), 873 deletions(-)
 create mode 100644 Documentation/filesystems/dax.txt
 delete mode 100644 Documentation/filesystems/xip.txt
 create mode 100644 fs/dax.c
 delete mode 100644 fs/ext2/xip.c
 delete mode 100644 fs/ext2/xip.h
 delete mode 100644 mm/filemap_xip.c

-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero()

2014-08-26 Thread Matthew Wilcox

From: Matthew Wilcox 

For DAX, we want to be able to copy between iovecs and kernel addresses
that don't necessarily have a struct page.  This is a fairly simple
rearrangement for bvec iters to kmap the pages outside and pass them in,
but for user iovecs it gets more complicated because we might try various
different ways to kmap the memory.  Duplicating the existing logic works
out best in this case.

We need to be able to write zeroes to an iovec for reads from unwritten
ranges in a file.  This is performed by the new iov_iter_zero() function,
again patterned after the existing code that handles iovec iterators.

Signed-off-by: Matthew Wilcox 
---
 include/linux/uio.h |   3 +
 mm/iov_iter.c   | 237 
 2 files changed, 226 insertions(+), 14 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 48d64e6..1863ddd 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, 
size_t bytes,
 struct iov_iter *i);
 size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes,
 struct iov_iter *i);
+size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i);
+size_t iov_iter_zero(size_t bytes, struct iov_iter *);
 unsigned long iov_iter_alignment(const struct iov_iter *i);
 void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov,
unsigned long nr_segs, size_t count);
diff --git a/mm/iov_iter.c b/mm/iov_iter.c
index ab88dc0..d481fd8 100644
--- a/mm/iov_iter.c
+++ b/mm/iov_iter.c
@@ -4,6 +4,96 @@
 #include 
 #include 
 
+static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i)
+{
+   size_t skip, copy, left, wanted;
+   const struct iovec *iov;
+   char __user *buf;
+
+   if (unlikely(bytes > i->count))
+   bytes = i->count;
+
+   if (unlikely(!bytes))
+   return 0;
+
+   wanted = bytes;
+   iov = i->iov;
+   skip = i->iov_offset;
+   buf = iov->iov_base + skip;
+   copy = min(bytes, iov->iov_len - skip);
+
+   left = __copy_to_user(buf, from, copy);
+   copy -= left;
+   skip += copy;
+   from += copy;
+   bytes -= copy;
+   while (unlikely(!left && bytes)) {
+   iov++;
+   buf = iov->iov_base;
+   copy = min(bytes, iov->iov_len);
+   left = __copy_to_user(buf, from, copy);
+   copy -= left;
+   skip = copy;
+   from += copy;
+   bytes -= copy;
+   }
+
+   if (skip == iov->iov_len) {
+   iov++;
+   skip = 0;
+   }
+   i->count -= wanted - bytes;
+   i->nr_segs -= iov - i->iov;
+   i->iov = iov;
+   i->iov_offset = skip;
+   return wanted - bytes;
+}
+
+static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i)
+{
+   size_t skip, copy, left, wanted;
+   const struct iovec *iov;
+   char __user *buf;
+
+   if (unlikely(bytes > i->count))
+   bytes = i->count;
+
+   if (unlikely(!bytes))
+   return 0;
+
+   wanted = bytes;
+   iov = i->iov;
+   skip = i->iov_offset;
+   buf = iov->iov_base + skip;
+   copy = min(bytes, iov->iov_len - skip);
+
+   left = __copy_from_user(to, buf, copy);
+   copy -= left;
+   skip += copy;
+   to += copy;
+   bytes -= copy;
+   while (unlikely(!left && bytes)) {
+   iov++;
+   buf = iov->iov_base;
+   copy = min(bytes, iov->iov_len);
+   left = __copy_from_user(to, buf, copy);
+   copy -= left;
+   skip = copy;
+   to += copy;
+   bytes -= copy;
+   }
+
+   if (skip == iov->iov_len) {
+   iov++;
+   skip = 0;
+   }
+   i->count -= wanted - bytes;
+   i->nr_segs -= iov - i->iov;
+   i->iov = iov;
+   i->iov_offset = skip;
+   return wanted - bytes;
+}
+
 static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t 
bytes,
 struct iov_iter *i)
 {
@@ -166,6 +256,50 @@ done:
return wanted - bytes;
 }
 
+static size_t zero_iovec(size_t bytes, struct iov_iter *i)
+{
+   size_t skip, copy, left, wanted;
+   const struct iovec *iov;
+   char __user *buf;
+
+   if (unlikely(bytes > i->count))
+   bytes = i->count;
+
+   if (unlikely(!bytes))
+   return 0;
+
+   wanted = bytes;
+   iov = i->iov;
+   skip = i->iov_offset;
+   buf = iov->iov_base + skip;
+   copy = min(bytes, iov->iov_len - skip);
+
+   left = __clear_user(buf, copy);
+   copy -= left;
+   skip += copy;
+   bytes -= copy;
+
+   while (unlikely(!left &&

[PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb()

2014-08-26 Thread Matthew Wilcox

Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount()
doesn't make sense, since changing the XIP option on remount isn't
allowed.  It also doesn't make sense to re-check whether blocksize is
supported since it can't change between mounts.

Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the
equivalent check and delete the definition.

Signed-off-by: Matthew Wilcox 
---
 fs/ext2/super.c | 33 -
 fs/ext2/xip.c   | 12 
 fs/ext2/xip.h   |  2 --
 3 files changed, 12 insertions(+), 35 deletions(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index b88edc0..d862031 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ?
 MS_POSIXACL : 0);
 
-   ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-   EXT2_MOUNT_XIP if not */
-
if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV &&
(EXT2_HAS_COMPAT_FEATURE(sb, ~0U) ||
 EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) ||
@@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void 
*data, int silent)
 
blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size);
 
-   if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) {
-   if (!silent)
+   if (sbi->s_mount_opt & EXT2_MOUNT_XIP) {
+   if (blocksize != PAGE_SIZE) {
ext2_msg(sb, KERN_ERR,
-   "error: unsupported blocksize for xip");
-   goto failed_mount;
+   "error: unsupported blocksize for xip");
+   goto failed_mount;
+   }
+   if (!sb->s_bdev->bd_disk->fops->direct_access) {
+   ext2_msg(sb, KERN_ERR,
+   "error: device does not support xip");
+   goto failed_mount;
+   }
}
 
/* If the blocksize doesn't match, re-read the thing.. */
@@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * 
flags, char * data)
 {
struct ext2_sb_info * sbi = EXT2_SB(sb);
struct ext2_super_block * es;
-   unsigned long old_mount_opt = sbi->s_mount_opt;
struct ext2_mount_options old_opts;
unsigned long old_sb_flags;
int err;
@@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * 
flags, char * data)
sb->s_flags = (sb->s_flags & ~MS_POSIXACL) |
((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0);
 
-   ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset
-   EXT2_MOUNT_XIP if not */
-
-   if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) {
-   ext2_msg(sb, KERN_WARNING,
-   "warning: unsupported blocksize for xip");
-   err = -EINVAL;
-   goto restore_opts;
-   }
-
es = sbi->s_es;
-   if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) {
+   if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) {
ext2_msg(sb, KERN_WARNING, "warning: refusing change of "
 "xip flag with busy inodes while remounting");
-   sbi->s_mount_opt &= ~EXT2_MOUNT_XIP;
-   sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP;
+   sbi->s_mount_opt ^= EXT2_MOUNT_XIP;
}
if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) {
spin_unlock(>s_lock);
diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c
index 132d4da..66ca113 100644
--- a/fs/ext2/xip.c
+++ b/fs/ext2/xip.c
@@ -13,15 +13,3 @@
 #include "ext2.h"
 #include "xip.h"
 
-void ext2_xip_verify_sb(struct super_block *sb)
-{
-   struct ext2_sb_info *sbi = EXT2_SB(sb);
-
-   if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) &&
-   !sb->s_bdev->bd_disk->fops->direct_access) {
-   sbi->s_mount_opt &= (~EXT2_MOUNT_XIP);
-   ext2_msg(sb, KERN_WARNING,
-"warning: ignoring xip option - "
-"not supported by bdev");
-   }
-}
diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h
index e7b9f0a..87eeb04 100644
--- a/fs/ext2/xip.h
+++ b/fs/ext2/xip.h
@@ -6,13 +6,11 @@
  */
 
 #ifdef CONFIG_EXT2_FS_XIP
-extern void ext2_xip_verify_sb (struct super_block *);
 static inline int ext2_use_xip (struct super_block *sb)
 {
struct ext2_sb_info *sbi = EXT2_SB(sb);
return (sbi->s_mount_opt & EXT2_MOUNT_XIP);
 }
 #else
-#define ext2_xip_verify_sb(sb) do { } while (0)
 #define ext2_use_xip(sb)   0
 #endif
-- 
2.0.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to

[PATCH V2] regulator: DA9211 : support device tree

2014-08-26 Thread James Ban

This is a patch for supporting device tree of DA9211/DA9213.

Signed-off-by: James Ban 
---

This patch is relative to linux-next repository tag next-20140826.

Changes in V2:
- defined what the valid regulators for the device and
  where their configuration should be specified in the device tree.
  
 .../devicetree/bindings/regulator/da9211.txt   |   63 +++
 drivers/regulator/da9211-regulator.c   |   85 ++--
 include/linux/regulator/da9211.h   |2 +-
 3 files changed, 142 insertions(+), 8 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/regulator/da9211.txt

diff --git a/Documentation/devicetree/bindings/regulator/da9211.txt 
b/Documentation/devicetree/bindings/regulator/da9211.txt
new file mode 100644
index 000..240019a
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/da9211.txt
@@ -0,0 +1,63 @@
+* Dialog Semiconductor DA9211/DA9213 Voltage Regulator
+
+Required properties:
+- compatible: "dlg,da9211" or "dlg,da9213".
+- reg: I2C slave address, usually 0x68.
+- interrupts: the interrupt outputs of the controller
+- regulators: A node that houses a sub-node for each regulator within the
+  device. Each sub-node is identified using the node's name, with valid
+  values listed below. The content of each sub-node is defined by the
+  standard binding for regulators; see regulator.txt.
+  BUCKA and BUCKB.
+
+Optional properties:
+- Any optional property defined in regulator.txt
+
+Example 1) DA9211
+
+   pmic: da9211@68 {
+   compatible = "dlg,da9211";
+   reg = <0x68>;
+   interrupts = <3 27>;
+
+   regulators {
+   BUCKA {
+   regulator-name = "VBUCKA";
+   regulator-min-microvolt = < 30>;
+   regulator-max-microvolt = <157>;
+   regulator-min-microamp  = <200>;
+   regulator-max-microamp  = <500>;
+   };
+   BUCKB {
+   regulator-name = "VBUCKB";
+   regulator-min-microvolt = < 30>;
+   regulator-max-microvolt = <157>;
+   regulator-min-microamp  = <200>;
+   regulator-max-microamp  = <500>;
+   };
+   };
+   };
+
+Example 2) DA92113
+   pmic: da9213@68 {
+   compatible = "dlg,da9213";
+   reg = <0x68>;
+   interrupts = <3 27>;
+
+   regulators {
+   BUCKA {
+   regulator-name = "VBUCKA";
+   regulator-min-microvolt = < 30>;
+   regulator-max-microvolt = <157>;
+   regulator-min-microamp  = <300>;
+   regulator-max-microamp  = <600>;
+   };
+   BUCKB {
+   regulator-name = "VBUCKB";
+   regulator-min-microvolt = < 30>;
+   regulator-max-microvolt = <157>;
+   regulator-min-microamp  = <300>;
+   regulator-max-microamp  = <600>;
+   };
+   };
+   };
diff --git a/drivers/regulator/da9211-regulator.c 
b/drivers/regulator/da9211-regulator.c
index a26f1d2..5aabbac 100644
--- a/drivers/regulator/da9211-regulator.c
+++ b/drivers/regulator/da9211-regulator.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "da9211-regulator.h"
 
@@ -236,6 +237,59 @@ static struct regulator_desc da9211_regulators[] = {
DA9211_BUCK(BUCKB),
 };
 
+#ifdef CONFIG_OF
+static struct of_regulator_match da9211_matches[] = {
+   [DA9211_ID_BUCKA] = { .name = "BUCKA" },
+   [DA9211_ID_BUCKB] = { .name = "BUCKB" },
+   };
+
+static struct da9211_pdata *da9211_parse_regulators_dt(
+   struct device *dev)
+{
+   struct da9211_pdata *pdata;
+   struct device_node *node;
+   int i, num, n;
+
+   node = of_get_child_by_name(dev->of_node, "regulators");
+   if (!node) {
+   dev_err(dev, "regulators node not found\n");
+   return ERR_PTR(-ENODEV);
+   }
+
+   num = of_regulator_match(dev, node, da9211_matches,
+ARRAY_SIZE(da9211_matches));
+   of_node_put(node);
+   if (num < 0) {
+   dev_err(dev, "Failed to match reg

[PATCH V2 3/6] arm64: LLVMLinux: Calculate current_thread_info from current_stack_pointer

2014-08-26 Thread behanw

From: Behan Webster 

Use the global current_stack_pointer to get the value of the stack pointer.
This change supports being able to compile the kernel with both gcc and clang.

Signed-off-by: Behan Webster 
Signed-off-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Reviewed-by: Olof Johansson 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/thread_info.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index 356e037..459bf8e 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -80,8 +80,8 @@ static inline struct thread_info *current_thread_info(void) 
__attribute_const__;
 
 static inline struct thread_info *current_thread_info(void)
 {
-   register unsigned long sp asm ("sp");
-   return (struct thread_info *)(sp & ~(THREAD_SIZE - 1));
+   return (struct thread_info *)
+   (current_stack_pointer & ~(THREAD_SIZE - 1));
 }
 
 #define thread_saved_pc(tsk)   \
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 2/6] arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk

2014-08-26 Thread behanw

From: Behan Webster 

Use the global current_stack_pointer to get the value of the stack pointer.
This change supports being able to compile the kernel with both gcc and clang.

Signed-off-by: Behan Webster 
Signed-off-by: Mark Charlebois 
Reviewed-by: Jan-Simon Möller 
Reviewed-by: Olof Johansson 
Acked-by: Will Deacon 
---
 arch/arm64/kernel/stacktrace.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 55437ba..407991b 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -111,10 +111,9 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct 
stack_trace *trace)
frame.sp = thread_saved_sp(tsk);
frame.pc = thread_saved_pc(tsk);
} else {
-   register unsigned long current_sp asm("sp");
data.no_sched_functions = 0;
frame.fp = (unsigned long)__builtin_frame_address(0);
-   frame.sp = current_sp;
+   frame.sp = current_stack_pointer;
frame.pc = (unsigned long)save_stack_trace_tsk;
}
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 5/6] arm64: LLVMLinux: Use global stack register variable for aarch64

2014-08-26 Thread behanw

From: Mark Charlebois 

To support both Clang and GCC, use the global stack register variable vs
a local register variable.

Author: Mark Charlebois 
Signed-off-by: Mark Charlebois 
Signed-off-by: Behan Webster 
---
 arch/arm64/include/asm/percpu.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h
index 453a179..5279e57 100644
--- a/arch/arm64/include/asm/percpu.h
+++ b/arch/arm64/include/asm/percpu.h
@@ -26,13 +26,13 @@ static inline void set_my_cpu_offset(unsigned long off)
 static inline unsigned long __my_cpu_offset(void)
 {
unsigned long off;
-   register unsigned long *sp asm ("sp");
 
/*
 * We want to allow caching the value, so avoid using volatile and
 * instead use a fake stack read to hazard against barrier().
 */
-   asm("mrs %0, tpidr_el1" : "=r" (off) : "Q" (*sp));
+   asm("mrs %0, tpidr_el1" : "=r" (off) :
+   "Q" (*(const unsigned long *)current_stack_pointer));
 
return off;
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 4/6] arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c

2014-08-26 Thread behanw

From: Behan Webster 

Use the global current_stack_pointer to get the value of the stack pointer.
This change supports being able to compile the kernel with both gcc and clang.

Signed-off-by: Behan Webster 
Signed-off-by: Mark Charlebois 
Reviewed-by: Olof Johansson 
Acked-by: Will Deacon 
---
 arch/arm64/kernel/traps.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index 02cd3f0..de1b085 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -132,7 +132,6 @@ static void dump_instr(const char *lvl, struct pt_regs 
*regs)
 static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 {
struct stackframe frame;
-   const register unsigned long current_sp asm ("sp");
 
pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
 
@@ -145,7 +144,7 @@ static void dump_backtrace(struct pt_regs *regs, struct 
task_struct *tsk)
frame.pc = regs->pc;
} else if (tsk == current) {
frame.fp = (unsigned long)__builtin_frame_address(0);
-   frame.sp = current_sp;
+   frame.sp = current_stack_pointer;
frame.pc = (unsigned long)dump_backtrace;
} else {
/*
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 6/6] arm64: LLVMLinux: Use global stack pointer in return_address()

2014-08-26 Thread behanw

From: Behan Webster 

The global register current_stack_pointer holds the current stack pointer.
This change supports being able to compile the kernel with both gcc and clang.

Author: Mark Charlebois 
Signed-off-by: Mark Charlebois 
Signed-off-by: Behan Webster 
---
 arch/arm64/kernel/return_address.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/return_address.c 
b/arch/arm64/kernel/return_address.c
index 89102a6..6c4fd28 100644
--- a/arch/arm64/kernel/return_address.c
+++ b/arch/arm64/kernel/return_address.c
@@ -36,13 +36,12 @@ void *return_address(unsigned int level)
 {
struct return_address_data data;
struct stackframe frame;
-   register unsigned long current_sp asm ("sp");
 
data.level = level + 2;
data.addr = NULL;
 
frame.fp = (unsigned long)__builtin_frame_address(0);
-   frame.sp = current_sp;
+   frame.sp = current_stack_pointer;
frame.pc = (unsigned long)return_address; /* dummy */
 
walk_stackframe(, save_return_addr, );
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 0/6] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM

2014-08-26 Thread behanw

From: Behan Webster 

This patch set moves from using locally defined named registers to access the
stack pointer to using a globally defined named register. This allows the code
to work both with gcc and clang.

The LLVMLinux project aims to fully build the Linux kernel using both gcc and
clang (the C front end for the LLVM compiler infrastructure project). 


Behan Webster (5):
  arm64: LLVMLinux: Add current_stack_pointer() for arm64
  arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk
  arm64: LLVMLinux: Calculate current_thread_info from
current_stack_pointer
  arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c
  arm64: LLVMLinux: Use global stack pointer in return_address()

Mark Charlebois (1):
  arm64: LLVMLinux: Use global stack register variable for aarch64

 arch/arm64/include/asm/percpu.h  | 4 ++--
 arch/arm64/include/asm/thread_info.h | 9 +++--
 arch/arm64/kernel/return_address.c   | 3 +--
 arch/arm64/kernel/stacktrace.c   | 3 +--
 arch/arm64/kernel/traps.c| 3 +--
 5 files changed, 12 insertions(+), 10 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 1/6] arm64: LLVMLinux: Add current_stack_pointer() for arm64

2014-08-26 Thread behanw

From: Behan Webster 

Define a global named register for current_stack_pointer. The use of this new
variable guarantees that both gcc and clang can access this register in C code.

Signed-off-by: Behan Webster 
Reviewed-by: Jan-Simon Möller 
Reviewed-by: Mark Charlebois 
Reviewed-by: Olof Johansson 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/thread_info.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/include/asm/thread_info.h 
b/arch/arm64/include/asm/thread_info.h
index 45108d8..356e037 100644
--- a/arch/arm64/include/asm/thread_info.h
+++ b/arch/arm64/include/asm/thread_info.h
@@ -69,6 +69,11 @@ struct thread_info {
 #define init_stack (init_thread_union.stack)
 
 /*
+ * how to get the current stack pointer from C
+ */
+register unsigned long current_stack_pointer asm ("sp");
+
+/*
  * how to get the thread information struct from C
  */
 static inline struct thread_info *current_thread_info(void) 
__attribute_const__;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 0/4] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM

2014-08-26 Thread Behan Webster


On 08/26/14 07:16, Will Deacon wrote:

Hi Behan,

On Fri, Aug 01, 2014 at 05:11:59AM +0100, Behan Webster wrote:

On 07/31/14 03:33, Will Deacon wrote:

On Thu, Jul 31, 2014 at 12:57:25AM +0100, beh...@converseincode.com wrote:

From: Behan Webster 

This patch set moves from using locally defined named registers to access the
stack pointer to using a globally defined named register. This allows the code
to work both with gcc and clang.

The LLVMLinux project aims to fully build the Linux kernel using both gcc and
clang (the C front end for the LLVM compiler infrastructure project).

Behan Webster (4):
arm64: LLVMLinux: Add current_stack_pointer() for arm64
arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk
arm64: LLVMLinux: Calculate current_thread_info from
  current_stack_pointer
arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c

Once Andreas's comments have been addressed:

Acked-by: Will Deacon 

Please can you send a new series after the merge window?

Pity. I was hoping to get it in this merge window.

However, will resubmit for 3.18.

Any chance of a v2 for this series, please? If you address the comments
pending for v1, I think it's good to merge.
Sure thing. 2 more named register patches added. Look for them at the 
end of the new patch series.


I kept missing you in Chicago. I was hoping to say "hi".

Behan

--
Behan Webster
beh...@converseincode.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

linux-next: build failure after merge of the percpu tree

2014-08-26 Thread Stephen Rothwell

Hi all,

After merging the percpu tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

In file included from arch/powerpc/include/asm/xics.h:9:0,
 from arch/powerpc/kernel/asm-offsets.c:47:
include/linux/interrupt.h:372:0: warning: "set_softirq_pending" redefined
 #define set_softirq_pending(x) (local_softirq_pending() = (x))
 ^
In file included from include/linux/hardirq.h:8:0,
 from include/linux/memcontrol.h:24,
 from include/linux/swap.h:8,
 from include/linux/suspend.h:4,
 from arch/powerpc/kernel/asm-offsets.c:24:
arch/powerpc/include/asm/hardirq.h:25:0: note: this is the location of the 
previous definition
 #define set_softirq_pending(x) __this_cpu_write(irq_stat._softirq_pending, (x))
 ^

I got lots (and lots :-() of these and some were considered errors
(powerpc is built with -Werr in arch/powerpc).

Caused by commit 5828f666c069 ("powerpc: Replace __get_cpu_var uses").

I have used the percpu tree from next-20140826 for today.
-- 
Cheers,
Stephen Rothwells...@canb.auug.org.au


signature.asc
Description: PGP signature

[PATCH 1/1] ice1712: Replacing hex with #defines

2014-08-26 Thread Konstantinos Tsimpoukas

Adds to te readability of the ice1712 driver.

Signed-off-by: Konstantinos Tsimpoukas 
---
 sound/pci/ice1712/ice1712.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/sound/pci/ice1712/ice1712.c b/sound/pci/ice1712/ice1712.c
index 87f7fc4..206ed2c 100644
--- a/sound/pci/ice1712/ice1712.c
+++ b/sound/pci/ice1712/ice1712.c
@@ -2528,7 +2528,7 @@ static int snd_ice1712_free(struct snd_ice1712 *ice)
if (!ice->port)
goto __hw_end;
/* mask all interrupts */
-   outb(0xc0, ICEMT(ice, IRQ));
+   outb(ICE1712_MULTI_CAPTURE | ICE1712_MULTI_PLAYBACK, ICEMT(ice, IRQ));
outb(0xff, ICEREG(ice, IRQMASK));
/* --- */
 __hw_end:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 00/14] arm64: eBPF JIT compiler

2014-08-26 Thread Zi Shen Lim

Hi Will, Catalin,

This is a respin of series implementing eBPF JIT compiler for arm64,
on top of 3.17-rc2. The v1 series [1] missed the previous merge window.

Patches [1-13/14] implement code generation functions. Unchanged from v1.

Patch [14/14] implements the actual eBPF JIT compiler. Updated from v1:
straightforward fixups due to changes from net.
Please see [14/14] for detailed change log.

This series is applies cleanly against 3.17-rc2 and is tested working
with lib/test_bpf on ARMv8 Foundation Model.

Will had previously reported that v1 series works on Juno platform.
Since v2 only involves straightforward renaming in [14/14], I don't
anticipate any regressions.

Thanks,
z

[1] https://lkml.org/lkml/2014/7/18/683


The following changes since commit 52addcf9d6669fa439387610bc65c92fa0980cef:

  Linux 3.17-rc2 (2014-08-25 15:36:20 -0700)

are available in the git repository at:

  https://github.com/zlim/linux.git tags/arm64/bpf-v2

for you to fetch changes up to 2f4a4b8df4ba1cbd24957fb1a8371d30b1976174:

  arm64: eBPF JIT compiler (2014-08-26 19:04:43 -0700)


 Documentation/networking/filter.txt |   6 +-
 arch/arm64/Kconfig  |   1 +
 arch/arm64/Makefile |   1 +
 arch/arm64/include/asm/insn.h   | 249 +
 arch/arm64/kernel/insn.c| 646 +-
 arch/arm64/net/Makefile |   4 +
 arch/arm64/net/bpf_jit.h| 169 +
 arch/arm64/net/bpf_jit_comp.c   | 677 
 8 files changed, 1743 insertions(+), 10 deletions(-)
 create mode 100644 arch/arm64/net/Makefile
 create mode 100644 arch/arm64/net/bpf_jit.h
 create mode 100644 arch/arm64/net/bpf_jit_comp.c

Zi Shen Lim (14):
  arm64: introduce aarch64_insn_gen_comp_branch_imm()
  arm64: introduce aarch64_insn_gen_branch_reg()
  arm64: introduce aarch64_insn_gen_cond_branch_imm()
  arm64: introduce aarch64_insn_gen_load_store_reg()
  arm64: introduce aarch64_insn_gen_load_store_pair()
  arm64: introduce aarch64_insn_gen_add_sub_imm()
  arm64: introduce aarch64_insn_gen_bitfield()
  arm64: introduce aarch64_insn_gen_movewide()
  arm64: introduce aarch64_insn_gen_add_sub_shifted_reg()
  arm64: introduce aarch64_insn_gen_data1()
  arm64: introduce aarch64_insn_gen_data2()
  arm64: introduce aarch64_insn_gen_data3()
  arm64: introduce aarch64_insn_gen_logical_shifted_reg()
  arm64: eBPF JIT compiler

 Documentation/networking/filter.txt |   6 +-
 arch/arm64/Kconfig  |   1 +
 arch/arm64/Makefile |   1 +
 arch/arm64/include/asm/insn.h   | 249 +
 arch/arm64/kernel/insn.c| 646 +-
 arch/arm64/net/Makefile |   4 +
 arch/arm64/net/bpf_jit.h| 169 +
 arch/arm64/net/bpf_jit_comp.c   | 677 
 8 files changed, 1743 insertions(+), 10 deletions(-)
 create mode 100644 arch/arm64/net/Makefile
 create mode 100644 arch/arm64/net/bpf_jit.h
 create mode 100644 arch/arm64/net/bpf_jit_comp.c

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 04/14] arm64: introduce aarch64_insn_gen_load_store_reg()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate load/store (register offset)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 20 ++
 arch/arm64/kernel/insn.c  | 62 +++
 2 files changed, 82 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 86a8a9c..5bc1cc3 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -72,6 +72,7 @@ enum aarch64_insn_imm_type {
 enum aarch64_insn_register_type {
AARCH64_INSN_REGTYPE_RT,
AARCH64_INSN_REGTYPE_RN,
+   AARCH64_INSN_REGTYPE_RM,
 };
 
 enum aarch64_insn_register {
@@ -143,12 +144,26 @@ enum aarch64_insn_branch_type {
AARCH64_INSN_BRANCH_COMP_NONZERO,
 };
 
+enum aarch64_insn_size_type {
+   AARCH64_INSN_SIZE_8,
+   AARCH64_INSN_SIZE_16,
+   AARCH64_INSN_SIZE_32,
+   AARCH64_INSN_SIZE_64,
+};
+
+enum aarch64_insn_ldst_type {
+   AARCH64_INSN_LDST_LOAD_REG_OFFSET,
+   AARCH64_INSN_LDST_STORE_REG_OFFSET,
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
 static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \
 { return (val); }
 
+__AARCH64_INSN_FUNCS(str_reg,  0x3FE0EC00, 0x38206800)
+__AARCH64_INSN_FUNCS(ldr_reg,  0x3FE0EC00, 0x38606800)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -184,6 +199,11 @@ u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op);
 u32 aarch64_insn_gen_nop(void);
 u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
enum aarch64_insn_branch_type type);
+u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg,
+   enum aarch64_insn_register base,
+   enum aarch64_insn_register offset,
+   enum aarch64_insn_size_type size,
+   enum aarch64_insn_ldst_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index b65edc0..b882c85 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -286,6 +286,9 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
case AARCH64_INSN_REGTYPE_RN:
shift = 5;
break;
+   case AARCH64_INSN_REGTYPE_RM:
+   shift = 16;
+   break;
default:
pr_err("%s: unknown register type encoding %d\n", __func__,
   type);
@@ -298,6 +301,35 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
return insn;
 }
 
+static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type,
+u32 insn)
+{
+   u32 size;
+
+   switch (type) {
+   case AARCH64_INSN_SIZE_8:
+   size = 0;
+   break;
+   case AARCH64_INSN_SIZE_16:
+   size = 1;
+   break;
+   case AARCH64_INSN_SIZE_32:
+   size = 2;
+   break;
+   case AARCH64_INSN_SIZE_64:
+   size = 3;
+   break;
+   default:
+   pr_err("%s: unknown size encoding %d\n", __func__, type);
+   return 0;
+   }
+
+   insn &= ~GENMASK(31, 30);
+   insn |= size << 30;
+
+   return insn;
+}
+
 static inline long branch_imm_common(unsigned long pc, unsigned long addr,
 long range)
 {
@@ -428,3 +460,33 @@ u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register 
reg,
 
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, reg);
 }
+
+u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg,
+   enum aarch64_insn_register base,
+   enum aarch64_insn_register offset,
+   enum aarch64_insn_size_type size,
+   enum aarch64_insn_ldst_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_LDST_LOAD_REG_OFFSET:
+   insn = aarch64_insn_get_ldr_reg_value();
+   break;
+   case AARCH64_INSN_LDST_STORE_REG_OFFSET:
+   insn = aarch64_insn_get_str_reg_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   insn = aarch64_insn_encode_ldst_size(size, insn);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn,
+   base);
+
+   return

[PATCHv2 03/14] arm64: introduce aarch64_insn_gen_cond_branch_imm()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate conditional branch (immediate)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 21 +
 arch/arm64/kernel/insn.c  | 17 +
 2 files changed, 38 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 5080962..86a8a9c 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -117,6 +117,24 @@ enum aarch64_insn_variant {
AARCH64_INSN_VARIANT_64BIT
 };
 
+enum aarch64_insn_condition {
+   AARCH64_INSN_COND_EQ = 0x0, /* == */
+   AARCH64_INSN_COND_NE = 0x1, /* != */
+   AARCH64_INSN_COND_CS = 0x2, /* unsigned >= */
+   AARCH64_INSN_COND_CC = 0x3, /* unsigned < */
+   AARCH64_INSN_COND_MI = 0x4, /* < 0 */
+   AARCH64_INSN_COND_PL = 0x5, /* >= 0 */
+   AARCH64_INSN_COND_VS = 0x6, /* overflow */
+   AARCH64_INSN_COND_VC = 0x7, /* no overflow */
+   AARCH64_INSN_COND_HI = 0x8, /* unsigned > */
+   AARCH64_INSN_COND_LS = 0x9, /* unsigned <= */
+   AARCH64_INSN_COND_GE = 0xa, /* signed >= */
+   AARCH64_INSN_COND_LT = 0xb, /* signed < */
+   AARCH64_INSN_COND_GT = 0xc, /* signed > */
+   AARCH64_INSN_COND_LE = 0xd, /* signed <= */
+   AARCH64_INSN_COND_AL = 0xe, /* always */
+};
+
 enum aarch64_insn_branch_type {
AARCH64_INSN_BRANCH_NOLINK,
AARCH64_INSN_BRANCH_LINK,
@@ -135,6 +153,7 @@ __AARCH64_INSN_FUNCS(b, 0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
 __AARCH64_INSN_FUNCS(cbnz, 0xFE00, 0x3500)
+__AARCH64_INSN_FUNCS(bcond,0xFF10, 0x5400)
 __AARCH64_INSN_FUNCS(svc,  0xFFE0001F, 0xD401)
 __AARCH64_INSN_FUNCS(hvc,  0xFFE0001F, 0xD402)
 __AARCH64_INSN_FUNCS(smc,  0xFFE0001F, 0xD403)
@@ -159,6 +178,8 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, 
unsigned long addr,
 enum aarch64_insn_register reg,
 enum aarch64_insn_variant variant,
 enum aarch64_insn_branch_type type);
+u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr,
+enum aarch64_insn_condition cond);
 u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op);
 u32 aarch64_insn_gen_nop(void);
 u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 6797936..b65edc0 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -380,6 +380,23 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, 
unsigned long addr,
 offset >> 2);
 }
 
+u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr,
+enum aarch64_insn_condition cond)
+{
+   u32 insn;
+   long offset;
+
+   offset = branch_imm_common(pc, addr, SZ_1M);
+
+   insn = aarch64_insn_get_bcond_value();
+
+   BUG_ON(cond < AARCH64_INSN_COND_EQ || cond > AARCH64_INSN_COND_AL);
+   insn |= cond;
+
+   return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn,
+offset >> 2);
+}
+
 u32 __kprobes aarch64_insn_gen_hint(enum aarch64_insn_hint_op op)
 {
return aarch64_insn_get_hint_value() | op;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 01/14] arm64: introduce aarch64_insn_gen_comp_branch_imm()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate compare & branch (immediate)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 57 
 arch/arm64/kernel/insn.c  | 88 ---
 2 files changed, 140 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index dc1f73b..a98c495 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -2,6 +2,8 @@
  * Copyright (C) 2013 Huawei Ltd.
  * Author: Jiang Liu 
  *
+ * Copyright (C) 2014 Zi Shen Lim 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -67,9 +69,58 @@ enum aarch64_insn_imm_type {
AARCH64_INSN_IMM_MAX
 };
 
+enum aarch64_insn_register_type {
+   AARCH64_INSN_REGTYPE_RT,
+};
+
+enum aarch64_insn_register {
+   AARCH64_INSN_REG_0  = 0,
+   AARCH64_INSN_REG_1  = 1,
+   AARCH64_INSN_REG_2  = 2,
+   AARCH64_INSN_REG_3  = 3,
+   AARCH64_INSN_REG_4  = 4,
+   AARCH64_INSN_REG_5  = 5,
+   AARCH64_INSN_REG_6  = 6,
+   AARCH64_INSN_REG_7  = 7,
+   AARCH64_INSN_REG_8  = 8,
+   AARCH64_INSN_REG_9  = 9,
+   AARCH64_INSN_REG_10 = 10,
+   AARCH64_INSN_REG_11 = 11,
+   AARCH64_INSN_REG_12 = 12,
+   AARCH64_INSN_REG_13 = 13,
+   AARCH64_INSN_REG_14 = 14,
+   AARCH64_INSN_REG_15 = 15,
+   AARCH64_INSN_REG_16 = 16,
+   AARCH64_INSN_REG_17 = 17,
+   AARCH64_INSN_REG_18 = 18,
+   AARCH64_INSN_REG_19 = 19,
+   AARCH64_INSN_REG_20 = 20,
+   AARCH64_INSN_REG_21 = 21,
+   AARCH64_INSN_REG_22 = 22,
+   AARCH64_INSN_REG_23 = 23,
+   AARCH64_INSN_REG_24 = 24,
+   AARCH64_INSN_REG_25 = 25,
+   AARCH64_INSN_REG_26 = 26,
+   AARCH64_INSN_REG_27 = 27,
+   AARCH64_INSN_REG_28 = 28,
+   AARCH64_INSN_REG_29 = 29,
+   AARCH64_INSN_REG_FP = 29, /* Frame pointer */
+   AARCH64_INSN_REG_30 = 30,
+   AARCH64_INSN_REG_LR = 30, /* Link register */
+   AARCH64_INSN_REG_ZR = 31, /* Zero: as source register */
+   AARCH64_INSN_REG_SP = 31  /* Stack pointer: as load/store base reg */
+};
+
+enum aarch64_insn_variant {
+   AARCH64_INSN_VARIANT_32BIT,
+   AARCH64_INSN_VARIANT_64BIT
+};
+
 enum aarch64_insn_branch_type {
AARCH64_INSN_BRANCH_NOLINK,
AARCH64_INSN_BRANCH_LINK,
+   AARCH64_INSN_BRANCH_COMP_ZERO,
+   AARCH64_INSN_BRANCH_COMP_NONZERO,
 };
 
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
@@ -80,6 +131,8 @@ static __always_inline u32 
aarch64_insn_get_##abbr##_value(void) \
 
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
+__AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
+__AARCH64_INSN_FUNCS(cbnz, 0xFE00, 0x3500)
 __AARCH64_INSN_FUNCS(svc,  0xFFE0001F, 0xD401)
 __AARCH64_INSN_FUNCS(hvc,  0xFFE0001F, 0xD402)
 __AARCH64_INSN_FUNCS(smc,  0xFFE0001F, 0xD403)
@@ -97,6 +150,10 @@ u32 aarch64_insn_encode_immediate(enum 
aarch64_insn_imm_type type,
  u32 insn, u64 imm);
 u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
enum aarch64_insn_branch_type type);
+u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr,
+enum aarch64_insn_register reg,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_branch_type type);
 u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op);
 u32 aarch64_insn_gen_nop(void);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 92f3683..d9f7827 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -2,6 +2,8 @@
  * Copyright (C) 2013 Huawei Ltd.
  * Author: Jiang Liu 
  *
+ * Copyright (C) 2014 Zi Shen Lim 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License version 2 as
  * published by the Free Software Foundation.
@@ -23,6 +25,8 @@
 #include 
 #include 
 
+#define AARCH64_INSN_SF_BITBIT(31)
+
 static int aarch64_insn_encoding_class[] = {
AARCH64_INSN_CLS_UNKNOWN,
AARCH64_INSN_CLS_UNKNOWN,
@@ -264,10 +268,36 @@ u32 __kprobes aarch64_insn_encode_immediate(enum 
aarch64_insn_imm_type type,
return insn;
 }
 
-u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr,
- enum aarch64_insn_branch_type type)
+static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type,
+   u32 insn,
+   enum aarch64_insn_register reg)
+{
+   int shift;
+
+   if (reg

[PATCHv2 10/14] arm64: introduce aarch64_insn_gen_data1()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate data-processing (1 source) instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 13 +
 arch/arm64/kernel/insn.c  | 37 +
 2 files changed, 50 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index c0a765d..246d214 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -185,6 +185,12 @@ enum aarch64_insn_bitfield_type {
AARCH64_INSN_BITFIELD_MOVE_SIGNED
 };
 
+enum aarch64_insn_data1_type {
+   AARCH64_INSN_DATA1_REVERSE_16,
+   AARCH64_INSN_DATA1_REVERSE_32,
+   AARCH64_INSN_DATA1_REVERSE_64,
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -211,6 +217,9 @@ __AARCH64_INSN_FUNCS(add,   0x7F20, 0x0B00)
 __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00)
 __AARCH64_INSN_FUNCS(sub,  0x7F20, 0x4B00)
 __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00)
+__AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400)
+__AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800)
+__AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -276,6 +285,10 @@ u32 aarch64_insn_gen_add_sub_shifted_reg(enum 
aarch64_insn_register dst,
 int shift,
 enum aarch64_insn_variant variant,
 enum aarch64_insn_adsb_type type);
+u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data1_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index d7a4dd4..81ef3b5 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -747,3 +747,40 @@ u32 aarch64_insn_gen_add_sub_shifted_reg(enum 
aarch64_insn_register dst,
 
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_6, insn, shift);
 }
+
+u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data1_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_DATA1_REVERSE_16:
+   insn = aarch64_insn_get_rev16_value();
+   break;
+   case AARCH64_INSN_DATA1_REVERSE_32:
+   insn = aarch64_insn_get_rev32_value();
+   break;
+   case AARCH64_INSN_DATA1_REVERSE_64:
+   BUG_ON(variant != AARCH64_INSN_VARIANT_64BIT);
+   insn = aarch64_insn_get_rev64_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 06/14] arm64: introduce aarch64_insn_gen_add_sub_imm()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate add/subtract (immediate) instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 16 
 arch/arm64/kernel/insn.c  | 44 +++
 2 files changed, 60 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index eef8f1e..29386aa 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -75,6 +75,7 @@ enum aarch64_insn_register_type {
AARCH64_INSN_REGTYPE_RN,
AARCH64_INSN_REGTYPE_RT2,
AARCH64_INSN_REGTYPE_RM,
+   AARCH64_INSN_REGTYPE_RD,
 };
 
 enum aarch64_insn_register {
@@ -162,6 +163,13 @@ enum aarch64_insn_ldst_type {
AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX,
 };
 
+enum aarch64_insn_adsb_type {
+   AARCH64_INSN_ADSB_ADD,
+   AARCH64_INSN_ADSB_SUB,
+   AARCH64_INSN_ADSB_ADD_SETFLAGS,
+   AARCH64_INSN_ADSB_SUB_SETFLAGS
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -174,6 +182,10 @@ __AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880)
 __AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0)
 __AARCH64_INSN_FUNCS(stp_pre,  0x7FC0, 0x2980)
 __AARCH64_INSN_FUNCS(ldp_pre,  0x7FC0, 0x29C0)
+__AARCH64_INSN_FUNCS(add_imm,  0x7F00, 0x1100)
+__AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100)
+__AARCH64_INSN_FUNCS(sub_imm,  0x7F00, 0x5100)
+__AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -220,6 +232,10 @@ u32 aarch64_insn_gen_load_store_pair(enum 
aarch64_insn_register reg1,
 int offset,
 enum aarch64_insn_variant variant,
 enum aarch64_insn_ldst_type type);
+u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+int imm, enum aarch64_insn_variant variant,
+enum aarch64_insn_adsb_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 7880c06..ec3a902 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -285,6 +285,7 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
 
switch (type) {
case AARCH64_INSN_REGTYPE_RT:
+   case AARCH64_INSN_REGTYPE_RD:
shift = 0;
break;
case AARCH64_INSN_REGTYPE_RN:
@@ -555,3 +556,46 @@ u32 aarch64_insn_gen_load_store_pair(enum 
aarch64_insn_register reg1,
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_7, insn,
 offset >> shift);
 }
+
+u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+int imm, enum aarch64_insn_variant variant,
+enum aarch64_insn_adsb_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_ADSB_ADD:
+   insn = aarch64_insn_get_add_imm_value();
+   break;
+   case AARCH64_INSN_ADSB_SUB:
+   insn = aarch64_insn_get_sub_imm_value();
+   break;
+   case AARCH64_INSN_ADSB_ADD_SETFLAGS:
+   insn = aarch64_insn_get_adds_imm_value();
+   break;
+   case AARCH64_INSN_ADSB_SUB_SETFLAGS:
+   insn = aarch64_insn_get_subs_imm_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   BUG_ON(imm & ~(SZ_4K - 1));
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+
+   return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 08/14] arm64: introduce aarch64_insn_gen_movewide()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate move wide (immediate) instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 13 +
 arch/arm64/kernel/insn.c  | 43 +++
 2 files changed, 56 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 8fd31fc..49dec28 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -172,6 +172,12 @@ enum aarch64_insn_adsb_type {
AARCH64_INSN_ADSB_SUB_SETFLAGS
 };
 
+enum aarch64_insn_movewide_type {
+   AARCH64_INSN_MOVEWIDE_ZERO,
+   AARCH64_INSN_MOVEWIDE_KEEP,
+   AARCH64_INSN_MOVEWIDE_INVERSE
+};
+
 enum aarch64_insn_bitfield_type {
AARCH64_INSN_BITFIELD_MOVE,
AARCH64_INSN_BITFIELD_MOVE_UNSIGNED,
@@ -194,9 +200,12 @@ __AARCH64_INSN_FUNCS(add_imm,  0x7F00, 0x1100)
 __AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100)
 __AARCH64_INSN_FUNCS(sub_imm,  0x7F00, 0x5100)
 __AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100)
+__AARCH64_INSN_FUNCS(movn, 0x7F80, 0x1280)
 __AARCH64_INSN_FUNCS(sbfm, 0x7F80, 0x1300)
 __AARCH64_INSN_FUNCS(bfm,  0x7F80, 0x3300)
+__AARCH64_INSN_FUNCS(movz, 0x7F80, 0x5280)
 __AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300)
+__AARCH64_INSN_FUNCS(movk, 0x7F80, 0x7280)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -252,6 +261,10 @@ u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register 
dst,
  int immr, int imms,
  enum aarch64_insn_variant variant,
  enum aarch64_insn_bitfield_type type);
+u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst,
+ int imm, int shift,
+ enum aarch64_insn_variant variant,
+ enum aarch64_insn_movewide_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index e07d026..7aa2784 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -655,3 +655,46 @@ u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register 
dst,
 
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_S, insn, imms);
 }
+
+u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst,
+ int imm, int shift,
+ enum aarch64_insn_variant variant,
+ enum aarch64_insn_movewide_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_MOVEWIDE_ZERO:
+   insn = aarch64_insn_get_movz_value();
+   break;
+   case AARCH64_INSN_MOVEWIDE_KEEP:
+   insn = aarch64_insn_get_movk_value();
+   break;
+   case AARCH64_INSN_MOVEWIDE_INVERSE:
+   insn = aarch64_insn_get_movn_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   BUG_ON(imm & ~(SZ_64K - 1));
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   BUG_ON(shift != 0 && shift != 16);
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   BUG_ON(shift != 0 && shift != 16 && shift != 32 &&
+  shift != 48);
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   insn |= (shift >> 4) << 21;
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, imm);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 05/14] arm64: introduce aarch64_insn_gen_load_store_pair()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate load/store pair instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 16 +++
 arch/arm64/kernel/insn.c  | 65 +++
 2 files changed, 81 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 5bc1cc3..eef8f1e 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -66,12 +66,14 @@ enum aarch64_insn_imm_type {
AARCH64_INSN_IMM_14,
AARCH64_INSN_IMM_12,
AARCH64_INSN_IMM_9,
+   AARCH64_INSN_IMM_7,
AARCH64_INSN_IMM_MAX
 };
 
 enum aarch64_insn_register_type {
AARCH64_INSN_REGTYPE_RT,
AARCH64_INSN_REGTYPE_RN,
+   AARCH64_INSN_REGTYPE_RT2,
AARCH64_INSN_REGTYPE_RM,
 };
 
@@ -154,6 +156,10 @@ enum aarch64_insn_size_type {
 enum aarch64_insn_ldst_type {
AARCH64_INSN_LDST_LOAD_REG_OFFSET,
AARCH64_INSN_LDST_STORE_REG_OFFSET,
+   AARCH64_INSN_LDST_LOAD_PAIR_PRE_INDEX,
+   AARCH64_INSN_LDST_STORE_PAIR_PRE_INDEX,
+   AARCH64_INSN_LDST_LOAD_PAIR_POST_INDEX,
+   AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX,
 };
 
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
@@ -164,6 +170,10 @@ static __always_inline u32 
aarch64_insn_get_##abbr##_value(void) \
 
 __AARCH64_INSN_FUNCS(str_reg,  0x3FE0EC00, 0x38206800)
 __AARCH64_INSN_FUNCS(ldr_reg,  0x3FE0EC00, 0x38606800)
+__AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880)
+__AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0)
+__AARCH64_INSN_FUNCS(stp_pre,  0x7FC0, 0x2980)
+__AARCH64_INSN_FUNCS(ldp_pre,  0x7FC0, 0x29C0)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -204,6 +214,12 @@ u32 aarch64_insn_gen_load_store_reg(enum 
aarch64_insn_register reg,
enum aarch64_insn_register offset,
enum aarch64_insn_size_type size,
enum aarch64_insn_ldst_type type);
+u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
+enum aarch64_insn_register reg2,
+enum aarch64_insn_register base,
+int offset,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_ldst_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index b882c85..7880c06 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -255,6 +255,10 @@ u32 __kprobes aarch64_insn_encode_immediate(enum 
aarch64_insn_imm_type type,
mask = BIT(9) - 1;
shift = 12;
break;
+   case AARCH64_INSN_IMM_7:
+   mask = BIT(7) - 1;
+   shift = 15;
+   break;
default:
pr_err("aarch64_insn_encode_immediate: unknown immediate 
encoding %d\n",
type);
@@ -286,6 +290,9 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
case AARCH64_INSN_REGTYPE_RN:
shift = 5;
break;
+   case AARCH64_INSN_REGTYPE_RT2:
+   shift = 10;
+   break;
case AARCH64_INSN_REGTYPE_RM:
shift = 16;
break;
@@ -490,3 +497,61 @@ u32 aarch64_insn_gen_load_store_reg(enum 
aarch64_insn_register reg,
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn,
offset);
 }
+
+u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1,
+enum aarch64_insn_register reg2,
+enum aarch64_insn_register base,
+int offset,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_ldst_type type)
+{
+   u32 insn;
+   int shift;
+
+   switch (type) {
+   case AARCH64_INSN_LDST_LOAD_PAIR_PRE_INDEX:
+   insn = aarch64_insn_get_ldp_pre_value();
+   break;
+   case AARCH64_INSN_LDST_STORE_PAIR_PRE_INDEX:
+   insn = aarch64_insn_get_stp_pre_value();
+   break;
+   case AARCH64_INSN_LDST_LOAD_PAIR_POST_INDEX:
+   insn = aarch64_insn_get_ldp_post_value();
+   break;
+   case AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX:
+   insn = aarch64_insn_get_stp_post_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   /* offset must be multiples of 4 in

[PATCHv2 11/14] arm64: introduce aarch64_insn_gen_data2()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate data-processing (2 source) instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 20 ++
 arch/arm64/kernel/insn.c  | 48 +++
 2 files changed, 68 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 246d214..367245f 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -191,6 +191,15 @@ enum aarch64_insn_data1_type {
AARCH64_INSN_DATA1_REVERSE_64,
 };
 
+enum aarch64_insn_data2_type {
+   AARCH64_INSN_DATA2_UDIV,
+   AARCH64_INSN_DATA2_SDIV,
+   AARCH64_INSN_DATA2_LSLV,
+   AARCH64_INSN_DATA2_LSRV,
+   AARCH64_INSN_DATA2_ASRV,
+   AARCH64_INSN_DATA2_RORV,
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -217,6 +226,12 @@ __AARCH64_INSN_FUNCS(add,  0x7F20, 0x0B00)
 __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00)
 __AARCH64_INSN_FUNCS(sub,  0x7F20, 0x4B00)
 __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00)
+__AARCH64_INSN_FUNCS(udiv, 0x7FE0FC00, 0x1AC00800)
+__AARCH64_INSN_FUNCS(sdiv, 0x7FE0FC00, 0x1AC00C00)
+__AARCH64_INSN_FUNCS(lslv, 0x7FE0FC00, 0x1AC02000)
+__AARCH64_INSN_FUNCS(lsrv, 0x7FE0FC00, 0x1AC02400)
+__AARCH64_INSN_FUNCS(asrv, 0x7FE0FC00, 0x1AC02800)
+__AARCH64_INSN_FUNCS(rorv, 0x7FE0FC00, 0x1AC02C00)
 __AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400)
 __AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800)
 __AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00)
@@ -289,6 +304,11 @@ u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst,
   enum aarch64_insn_register src,
   enum aarch64_insn_variant variant,
   enum aarch64_insn_data1_type type);
+u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_register reg,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data2_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 81ef3b5..c054164 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -784,3 +784,51 @@ u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst,
 
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
 }
+
+u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_register reg,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data2_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_DATA2_UDIV:
+   insn = aarch64_insn_get_udiv_value();
+   break;
+   case AARCH64_INSN_DATA2_SDIV:
+   insn = aarch64_insn_get_sdiv_value();
+   break;
+   case AARCH64_INSN_DATA2_LSLV:
+   insn = aarch64_insn_get_lslv_value();
+   break;
+   case AARCH64_INSN_DATA2_LSRV:
+   insn = aarch64_insn_get_lsrv_value();
+   break;
+   case AARCH64_INSN_DATA2_ASRV:
+   insn = aarch64_insn_get_asrv_value();
+   break;
+   case AARCH64_INSN_DATA2_RORV:
+   insn = aarch64_insn_get_rorv_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+
+   return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 13/14] arm64: introduce aarch64_insn_gen_logical_shifted_reg()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate logical (shifted register)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 25 ++
 arch/arm64/kernel/insn.c  | 60 +++
 2 files changed, 85 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 36e8465..56a9e63 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -206,6 +206,17 @@ enum aarch64_insn_data3_type {
AARCH64_INSN_DATA3_MSUB,
 };
 
+enum aarch64_insn_logic_type {
+   AARCH64_INSN_LOGIC_AND,
+   AARCH64_INSN_LOGIC_BIC,
+   AARCH64_INSN_LOGIC_ORR,
+   AARCH64_INSN_LOGIC_ORN,
+   AARCH64_INSN_LOGIC_EOR,
+   AARCH64_INSN_LOGIC_EON,
+   AARCH64_INSN_LOGIC_AND_SETFLAGS,
+   AARCH64_INSN_LOGIC_BIC_SETFLAGS
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -243,6 +254,14 @@ __AARCH64_INSN_FUNCS(rorv, 0x7FE0FC00, 0x1AC02C00)
 __AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400)
 __AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800)
 __AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00)
+__AARCH64_INSN_FUNCS(and,  0x7F20, 0x0A00)
+__AARCH64_INSN_FUNCS(bic,  0x7F20, 0x0A20)
+__AARCH64_INSN_FUNCS(orr,  0x7F20, 0x2A00)
+__AARCH64_INSN_FUNCS(orn,  0x7F20, 0x2A20)
+__AARCH64_INSN_FUNCS(eor,  0x7F20, 0x4A00)
+__AARCH64_INSN_FUNCS(eon,  0x7F20, 0x4A20)
+__AARCH64_INSN_FUNCS(ands, 0x7F20, 0x6A00)
+__AARCH64_INSN_FUNCS(bics, 0x7F20, 0x6A20)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -323,6 +342,12 @@ u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst,
   enum aarch64_insn_register reg2,
   enum aarch64_insn_variant variant,
   enum aarch64_insn_data3_type type);
+u32 aarch64_insn_gen_logical_shifted_reg(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+enum aarch64_insn_register reg,
+int shift,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_logic_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index f73a4bf..0668ee5 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -874,3 +874,63 @@ u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst,
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn,
reg2);
 }
+
+u32 aarch64_insn_gen_logical_shifted_reg(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+enum aarch64_insn_register reg,
+int shift,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_logic_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_LOGIC_AND:
+   insn = aarch64_insn_get_and_value();
+   break;
+   case AARCH64_INSN_LOGIC_BIC:
+   insn = aarch64_insn_get_bic_value();
+   break;
+   case AARCH64_INSN_LOGIC_ORR:
+   insn = aarch64_insn_get_orr_value();
+   break;
+   case AARCH64_INSN_LOGIC_ORN:
+   insn = aarch64_insn_get_orn_value();
+   break;
+   case AARCH64_INSN_LOGIC_EOR:
+   insn = aarch64_insn_get_eor_value();
+   break;
+   case AARCH64_INSN_LOGIC_EON:
+   insn = aarch64_insn_get_eon_value();
+   break;
+   case AARCH64_INSN_LOGIC_AND_SETFLAGS:
+   insn = aarch64_insn_get_ands_value();
+   break;
+   case AARCH64_INSN_LOGIC_BIC_SETFLAGS:
+   insn = aarch64_insn_get_bics_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   BUG_ON(shift & ~(SZ_32 - 1));
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   BUG_ON(shift & ~(SZ_64 - 1));
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+
+

[PATCHv2 12/14] arm64: introduce aarch64_insn_gen_data3()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate data-processing (3 source) instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 14 ++
 arch/arm64/kernel/insn.c  | 42 ++
 2 files changed, 56 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 367245f..36e8465 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -79,6 +79,7 @@ enum aarch64_insn_register_type {
AARCH64_INSN_REGTYPE_RT2,
AARCH64_INSN_REGTYPE_RM,
AARCH64_INSN_REGTYPE_RD,
+   AARCH64_INSN_REGTYPE_RA,
 };
 
 enum aarch64_insn_register {
@@ -200,6 +201,11 @@ enum aarch64_insn_data2_type {
AARCH64_INSN_DATA2_RORV,
 };
 
+enum aarch64_insn_data3_type {
+   AARCH64_INSN_DATA3_MADD,
+   AARCH64_INSN_DATA3_MSUB,
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -226,6 +232,8 @@ __AARCH64_INSN_FUNCS(add,   0x7F20, 0x0B00)
 __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00)
 __AARCH64_INSN_FUNCS(sub,  0x7F20, 0x4B00)
 __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00)
+__AARCH64_INSN_FUNCS(madd, 0x7FE08000, 0x1B00)
+__AARCH64_INSN_FUNCS(msub, 0x7FE08000, 0x1B008000)
 __AARCH64_INSN_FUNCS(udiv, 0x7FE0FC00, 0x1AC00800)
 __AARCH64_INSN_FUNCS(sdiv, 0x7FE0FC00, 0x1AC00C00)
 __AARCH64_INSN_FUNCS(lslv, 0x7FE0FC00, 0x1AC02000)
@@ -309,6 +317,12 @@ u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst,
   enum aarch64_insn_register reg,
   enum aarch64_insn_variant variant,
   enum aarch64_insn_data2_type type);
+u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_register reg1,
+  enum aarch64_insn_register reg2,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data3_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index c054164..f73a4bf 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -302,6 +302,7 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
shift = 5;
break;
case AARCH64_INSN_REGTYPE_RT2:
+   case AARCH64_INSN_REGTYPE_RA:
shift = 10;
break;
case AARCH64_INSN_REGTYPE_RM:
@@ -832,3 +833,44 @@ u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst,
 
return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg);
 }
+
+u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst,
+  enum aarch64_insn_register src,
+  enum aarch64_insn_register reg1,
+  enum aarch64_insn_register reg2,
+  enum aarch64_insn_variant variant,
+  enum aarch64_insn_data3_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_DATA3_MADD:
+   insn = aarch64_insn_get_madd_value();
+   break;
+   case AARCH64_INSN_DATA3_MSUB:
+   insn = aarch64_insn_get_msub_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RA, insn, src);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn,
+   reg1);
+
+   return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn,
+   reg2);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 09/14] arm64: introduce aarch64_insn_gen_add_sub_shifted_reg()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate add/subtract (shifted register)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 11 ++
 arch/arm64/kernel/insn.c  | 49 +++
 2 files changed, 60 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 49dec28..c0a765d 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -67,6 +67,7 @@ enum aarch64_insn_imm_type {
AARCH64_INSN_IMM_12,
AARCH64_INSN_IMM_9,
AARCH64_INSN_IMM_7,
+   AARCH64_INSN_IMM_6,
AARCH64_INSN_IMM_S,
AARCH64_INSN_IMM_R,
AARCH64_INSN_IMM_MAX
@@ -206,6 +207,10 @@ __AARCH64_INSN_FUNCS(bfm,  0x7F80, 0x3300)
 __AARCH64_INSN_FUNCS(movz, 0x7F80, 0x5280)
 __AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300)
 __AARCH64_INSN_FUNCS(movk, 0x7F80, 0x7280)
+__AARCH64_INSN_FUNCS(add,  0x7F20, 0x0B00)
+__AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00)
+__AARCH64_INSN_FUNCS(sub,  0x7F20, 0x4B00)
+__AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -265,6 +270,12 @@ u32 aarch64_insn_gen_movewide(enum aarch64_insn_register 
dst,
  int imm, int shift,
  enum aarch64_insn_variant variant,
  enum aarch64_insn_movewide_type type);
+u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+enum aarch64_insn_register reg,
+int shift,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_adsb_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index 7aa2784..d7a4dd4 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -260,6 +260,7 @@ u32 __kprobes aarch64_insn_encode_immediate(enum 
aarch64_insn_imm_type type,
mask = BIT(7) - 1;
shift = 15;
break;
+   case AARCH64_INSN_IMM_6:
case AARCH64_INSN_IMM_S:
mask = BIT(6) - 1;
shift = 10;
@@ -698,3 +699,51 @@ u32 aarch64_insn_gen_movewide(enum aarch64_insn_register 
dst,
 
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, imm);
 }
+
+u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst,
+enum aarch64_insn_register src,
+enum aarch64_insn_register reg,
+int shift,
+enum aarch64_insn_variant variant,
+enum aarch64_insn_adsb_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_ADSB_ADD:
+   insn = aarch64_insn_get_add_value();
+   break;
+   case AARCH64_INSN_ADSB_SUB:
+   insn = aarch64_insn_get_sub_value();
+   break;
+   case AARCH64_INSN_ADSB_ADD_SETFLAGS:
+   insn = aarch64_insn_get_adds_value();
+   break;
+   case AARCH64_INSN_ADSB_SUB_SETFLAGS:
+   insn = aarch64_insn_get_subs_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   BUG_ON(shift & ~(SZ_32 - 1));
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT;
+   BUG_ON(shift & ~(SZ_64 - 1));
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg);
+
+   return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_6, insn, shift);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCHv2 14/14] arm64: eBPF JIT compiler

2014-08-26 Thread Zi Shen Lim

The JIT compiler emits A64 instructions. It supports eBPF only.
Legacy BPF is supported thanks to conversion by BPF core.

JIT is enabled in the same way as for other architectures:

echo 1 > /proc/sys/net/core/bpf_jit_enable

Or for additional compiler output:

echo 2 > /proc/sys/net/core/bpf_jit_enable

See Documentation/networking/filter.txt for more information.

The implementation passes all 57 tests in lib/test_bpf.c
on ARMv8 Foundation Model :) Also tested by Will on Juno platform.

Signed-off-by: Zi Shen Lim 
Acked-by: Alexei Starovoitov 
Acked-by: Will Deacon 
---
v1->v2:

  Rebased onto 3.17-rc2, and fixed up changes related to:
  - sock_filter_int -> bpf_insn: 2695fb552cbe (net: filter: rename
'struct sock_filter_int' into 'struct bpf_insn')
  - sk_filter -> bpf_prog: 7ae457c1e5b4 (net: filter: split
'struct sk_filter' into socket and bpf parts)

RFCv3->v1:

  Addressed review comments from Will wrt codegen bits:
  - define and use {SF,N}_BIT
  - use masks for limit checks

  Also:
  - rebase onto net-next

RFCv2->RFCv3:

  - clarify 16B stack alignment requirement - I missed one reference
  - fixed a couple checks for immediate bits
  - make bpf_jit.h checkpatch clean
  - remove stale DW case in LD_IND and LD_ABS (good catch by Alexei)
  - add Alexei's Acked-by
  - rebase onto net-next

  Also, per discussion with Will, consolidated bpf_jit.h into
  arch/arm64/.../insn.{c,h}:
  - instruction encoding stuff moved into arch/arm64/kernel/insn.c
  - bpf_jit.h uses arch/arm64/include/asm/insn.h

RFCv1->RFCv2:

  Addressed review comments from Alexei:
  - use core-$(CONFIG_NET)
  - use GENMASK
  - lower-case function names in header file
  - drop LD_ABS+DW and LD_IND+DW, which do not exist in eBPF yet
  - use pr_xxx_once() to prevent spamming logs
  - clarify 16B stack alignment requirement
  - drop usage of EMIT macro which was saving just one argument,
turns out having additional argument wasn't too much of an eyesore

  Also, per discussion with Alexei, and additional suggestion from
  Daniel:
  - moved load_pointer() from net/core/filter.c into filter.h
as bpf_load_pointer()
  which is done as a separate preparatory patch. [1]

[1] http://patchwork.ozlabs.org/patch/366906/

NOTES:

* The preparatory patch [1] has been merged into net-next
  9f12fbe603f7 ("net: filter: move load_pointer() into filter.h").

* bpf_jit_comp.c and bpf_jit.h is checkpatch clean.

* The following sparse warning is not applicable:
  warning: symbol 'bpf_jit_enable' was not declared. Should it be static?

FUTURE WORK:

1. Implement remaining classes of eBPF instructions: ST|MEM, STX|XADD
   which currently do not have corresponding test cases in test_bpf.

2. Further compiler optimization, such as optimization for small
   immediates.

 Documentation/networking/filter.txt |   6 +-
 arch/arm64/Kconfig  |   1 +
 arch/arm64/Makefile |   1 +
 arch/arm64/net/Makefile |   4 +
 arch/arm64/net/bpf_jit.h| 169 +
 arch/arm64/net/bpf_jit_comp.c   | 677 
 6 files changed, 855 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm64/net/Makefile
 create mode 100644 arch/arm64/net/bpf_jit.h
 create mode 100644 arch/arm64/net/bpf_jit_comp.c

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index c48a970..1842d4f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -462,9 +462,9 @@ JIT compiler
 
 
 The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
-ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
-transparently invoked for each attached filter from user space or for internal
-kernel users if it has been previously enabled by root:
+ARM, ARM64 and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler
+is transparently invoked for each attached filter from user space or for
+internal kernel users if it has been previously enabled by root:
 
   echo 1 > /proc/sys/net/core/bpf_jit_enable
 
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fd4e81a..cfea623 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -35,6 +35,7 @@ config ARM64
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
+   select HAVE_BPF_JIT
select HAVE_C_RECORDMCOUNT
select HAVE_CC_STACKPROTECTOR
select HAVE_DEBUG_BUGVERBOSE
diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile
index 2df5e5d..59c86b6 100644
--- a/arch/arm64/Makefile
+++ b/arch/arm64/Makefile
@@ -47,6 +47,7 @@ endif
 export TEXT_OFFSET GZFLAGS
 
 core-y += arch/arm64/kernel/ arch/arm64/mm/
+core-$(CONFIG_NET) += arch/arm64/net/
 core-$(CONFIG_KVM) += arch/arm64/kvm/
 core-$(CONFIG_XEN) += arch/arm64/xen/
 core-$(CONFIG_CRYPTO) += arch/arm64/crypto/
diff --git a/arch/arm64/net/Makefile b/arch/arm64/net/Makefile
new

[PATCHv2 07/14] arm64: introduce aarch64_insn_gen_bitfield()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate bitfield instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h | 16 +
 arch/arm64/kernel/insn.c  | 56 +++
 2 files changed, 72 insertions(+)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index 29386aa..8fd31fc 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -67,6 +67,8 @@ enum aarch64_insn_imm_type {
AARCH64_INSN_IMM_12,
AARCH64_INSN_IMM_9,
AARCH64_INSN_IMM_7,
+   AARCH64_INSN_IMM_S,
+   AARCH64_INSN_IMM_R,
AARCH64_INSN_IMM_MAX
 };
 
@@ -170,6 +172,12 @@ enum aarch64_insn_adsb_type {
AARCH64_INSN_ADSB_SUB_SETFLAGS
 };
 
+enum aarch64_insn_bitfield_type {
+   AARCH64_INSN_BITFIELD_MOVE,
+   AARCH64_INSN_BITFIELD_MOVE_UNSIGNED,
+   AARCH64_INSN_BITFIELD_MOVE_SIGNED
+};
+
 #define__AARCH64_INSN_FUNCS(abbr, mask, val)   \
 static __always_inline bool aarch64_insn_is_##abbr(u32 code) \
 { return (code & (mask)) == (val); } \
@@ -186,6 +194,9 @@ __AARCH64_INSN_FUNCS(add_imm,   0x7F00, 0x1100)
 __AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100)
 __AARCH64_INSN_FUNCS(sub_imm,  0x7F00, 0x5100)
 __AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100)
+__AARCH64_INSN_FUNCS(sbfm, 0x7F80, 0x1300)
+__AARCH64_INSN_FUNCS(bfm,  0x7F80, 0x3300)
+__AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300)
 __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400)
 __AARCH64_INSN_FUNCS(bl,   0xFC00, 0x9400)
 __AARCH64_INSN_FUNCS(cbz,  0xFE00, 0x3400)
@@ -236,6 +247,11 @@ u32 aarch64_insn_gen_add_sub_imm(enum 
aarch64_insn_register dst,
 enum aarch64_insn_register src,
 int imm, enum aarch64_insn_variant variant,
 enum aarch64_insn_adsb_type type);
+u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst,
+ enum aarch64_insn_register src,
+ int immr, int imms,
+ enum aarch64_insn_variant variant,
+ enum aarch64_insn_bitfield_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index ec3a902..e07d026 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -26,6 +26,7 @@
 #include 
 
 #define AARCH64_INSN_SF_BITBIT(31)
+#define AARCH64_INSN_N_BIT BIT(22)
 
 static int aarch64_insn_encoding_class[] = {
AARCH64_INSN_CLS_UNKNOWN,
@@ -259,6 +260,14 @@ u32 __kprobes aarch64_insn_encode_immediate(enum 
aarch64_insn_imm_type type,
mask = BIT(7) - 1;
shift = 15;
break;
+   case AARCH64_INSN_IMM_S:
+   mask = BIT(6) - 1;
+   shift = 10;
+   break;
+   case AARCH64_INSN_IMM_R:
+   mask = BIT(6) - 1;
+   shift = 16;
+   break;
default:
pr_err("aarch64_insn_encode_immediate: unknown immediate 
encoding %d\n",
type);
@@ -599,3 +608,50 @@ u32 aarch64_insn_gen_add_sub_imm(enum 
aarch64_insn_register dst,
 
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm);
 }
+
+u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst,
+ enum aarch64_insn_register src,
+ int immr, int imms,
+ enum aarch64_insn_variant variant,
+ enum aarch64_insn_bitfield_type type)
+{
+   u32 insn;
+   u32 mask;
+
+   switch (type) {
+   case AARCH64_INSN_BITFIELD_MOVE:
+   insn = aarch64_insn_get_bfm_value();
+   break;
+   case AARCH64_INSN_BITFIELD_MOVE_UNSIGNED:
+   insn = aarch64_insn_get_ubfm_value();
+   break;
+   case AARCH64_INSN_BITFIELD_MOVE_SIGNED:
+   insn = aarch64_insn_get_sbfm_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   switch (variant) {
+   case AARCH64_INSN_VARIANT_32BIT:
+   mask = GENMASK(4, 0);
+   break;
+   case AARCH64_INSN_VARIANT_64BIT:
+   insn |= AARCH64_INSN_SF_BIT | AARCH64_INSN_N_BIT;
+   mask = GENMASK(5, 0);
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   BUG_ON(immr & ~mask);
+   BUG_ON(imms & ~mask);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst);
+
+   insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src);
+
+   insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_R, insn, immr);
+
+   return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_S, insn, imms);
+}
-- 
1.9.1

--

[PATCHv2 02/14] arm64: introduce aarch64_insn_gen_branch_reg()

2014-08-26 Thread Zi Shen Lim

Introduce function to generate unconditional branch (register)
instructions.

Signed-off-by: Zi Shen Lim 
Acked-by: Will Deacon 
---
 arch/arm64/include/asm/insn.h |  7 +++
 arch/arm64/kernel/insn.c  | 35 +--
 2 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h
index a98c495..5080962 100644
--- a/arch/arm64/include/asm/insn.h
+++ b/arch/arm64/include/asm/insn.h
@@ -71,6 +71,7 @@ enum aarch64_insn_imm_type {
 
 enum aarch64_insn_register_type {
AARCH64_INSN_REGTYPE_RT,
+   AARCH64_INSN_REGTYPE_RN,
 };
 
 enum aarch64_insn_register {
@@ -119,6 +120,7 @@ enum aarch64_insn_variant {
 enum aarch64_insn_branch_type {
AARCH64_INSN_BRANCH_NOLINK,
AARCH64_INSN_BRANCH_LINK,
+   AARCH64_INSN_BRANCH_RETURN,
AARCH64_INSN_BRANCH_COMP_ZERO,
AARCH64_INSN_BRANCH_COMP_NONZERO,
 };
@@ -138,6 +140,9 @@ __AARCH64_INSN_FUNCS(hvc,   0xFFE0001F, 0xD402)
 __AARCH64_INSN_FUNCS(smc,  0xFFE0001F, 0xD403)
 __AARCH64_INSN_FUNCS(brk,  0xFFE0001F, 0xD420)
 __AARCH64_INSN_FUNCS(hint, 0xF01F, 0xD503201F)
+__AARCH64_INSN_FUNCS(br,   0xFC1F, 0xD61F)
+__AARCH64_INSN_FUNCS(blr,  0xFC1F, 0xD63F)
+__AARCH64_INSN_FUNCS(ret,  0xFC1F, 0xD65F)
 
 #undef __AARCH64_INSN_FUNCS
 
@@ -156,6 +161,8 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, 
unsigned long addr,
 enum aarch64_insn_branch_type type);
 u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op);
 u32 aarch64_insn_gen_nop(void);
+u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
+   enum aarch64_insn_branch_type type);
 
 bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn);
 
diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c
index d9f7827..6797936 100644
--- a/arch/arm64/kernel/insn.c
+++ b/arch/arm64/kernel/insn.c
@@ -283,6 +283,9 @@ static u32 aarch64_insn_encode_register(enum 
aarch64_insn_register_type type,
case AARCH64_INSN_REGTYPE_RT:
shift = 0;
break;
+   case AARCH64_INSN_REGTYPE_RN:
+   shift = 5;
+   break;
default:
pr_err("%s: unknown register type encoding %d\n", __func__,
   type);
@@ -325,10 +328,16 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long 
pc, unsigned long addr,
 */
offset = branch_imm_common(pc, addr, SZ_128M);
 
-   if (type == AARCH64_INSN_BRANCH_LINK)
+   switch (type) {
+   case AARCH64_INSN_BRANCH_LINK:
insn = aarch64_insn_get_bl_value();
-   else
+   break;
+   case AARCH64_INSN_BRANCH_NOLINK:
insn = aarch64_insn_get_b_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
 
return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_26, insn,
 offset >> 2);
@@ -380,3 +389,25 @@ u32 __kprobes aarch64_insn_gen_nop(void)
 {
return aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
 }
+
+u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg,
+   enum aarch64_insn_branch_type type)
+{
+   u32 insn;
+
+   switch (type) {
+   case AARCH64_INSN_BRANCH_NOLINK:
+   insn = aarch64_insn_get_br_value();
+   break;
+   case AARCH64_INSN_BRANCH_LINK:
+   insn = aarch64_insn_get_blr_value();
+   break;
+   case AARCH64_INSN_BRANCH_RETURN:
+   insn = aarch64_insn_get_ret_value();
+   break;
+   default:
+   BUG_ON(1);
+   }
+
+   return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, reg);
+}
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] mmc: dw_mmc: Pass back errors from mmc_of_parse()

2014-08-26 Thread Jaehoon Chung

Looks good to me.

Acked-by: Jaehoon Chung 

Best Regards,
Jaehoon Chung

On 08/26/2014 03:19 AM, Doug Anderson wrote:
> It's possible that mmc_of_parse() could return errors (possibly in
> some future version it might return -EPROBE_DEFER even).  Let's pass
> those errors back.
> 
> Signed-off-by: Doug Anderson 
> ---
>  drivers/mmc/host/dw_mmc.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> index 7f227e9..9ef4df0 100644
> --- a/drivers/mmc/host/dw_mmc.c
> +++ b/drivers/mmc/host/dw_mmc.c
> @@ -2131,7 +2131,9 @@ static int dw_mci_init_slot(struct dw_mci *host, 
> unsigned int id)
>   if (host->pdata->caps2)
>   mmc->caps2 = host->pdata->caps2;
>  
> - mmc_of_parse(mmc);
> + ret = mmc_of_parse(mmc);
> + if (ret)
> + goto err_host_allocated;
>  
>   if (host->pdata->blk_settings) {
>   mmc->max_segs = host->pdata->blk_settings->max_segs;
> @@ -2163,7 +2165,7 @@ static int dw_mci_init_slot(struct dw_mci *host, 
> unsigned int id)
>  
>   ret = mmc_add_host(mmc);
>   if (ret)
> - goto err_setup_bus;
> + goto err_host_allocated;
>  
>  #if defined(CONFIG_DEBUG_FS)
>   dw_mci_init_debugfs(slot);
> @@ -2174,9 +2176,9 @@ static int dw_mci_init_slot(struct dw_mci *host, 
> unsigned int id)
>  
>   return 0;
>  
> -err_setup_bus:
> +err_host_allocated:
>   mmc_free_host(mmc);
> - return -EINVAL;
> + return ret;
>  }
>  
>  static void dw_mci_cleanup_slot(struct dw_mci_slot *slot, unsigned int id)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Xen-devel] [PATCH 1/3] x86: Make page cache mode a real type

2014-08-26 Thread Juergen Gross


On 08/26/2014 09:44 PM, Toshi Kani wrote:

On Tue, 2014-08-26 at 08:16 +0200, Juergen Gross wrote:

At the moment there are a lot of places that handle setting or getting
the page cache mode by treating the pgprot bits equal to the cache mode.
This is only true because there are a lot of assumptions about the setup
of the PAT MSR. Otherwise the cache type needs to get translated into
pgprot bits and vice versa.

This patch tries to prepare for that by introducing a seperate type
for the cache mode and adding functions to translate between those and pgprot
values.

To avoid too much performance penalty the translation between cache mode
and pgprot values is done via tables which contain the relevant information.
Write-back cache mode is hard-wired to be 0, all other modes are configurable
via those tables. For large pages there are translation functions as the
PAT bit is located at different positions in the ptes of 4k and large pages.

Signed-off-by: Stefan Bader 
Signed-off-by: Juergen Gross 


Hi Juergen,

Thanks for the updates!  A few comments below...


@@ -73,6 +73,9 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot)
  /*
   * Map 'pfn' using protections 'prot'
   */
+#define __PAGE_KERNEL_WC   (__PAGE_KERNEL | \
+cachemode2protval(_PAGE_CACHE_MODE_WC))
+
  void __iomem *
  iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot)
  {
@@ -82,12 +85,14 @@ iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot)
 * MTRR is UC or WC.  UC_MINUS gets the real intention, of the
 * user, which is "WC if the MTRR is WC, UC if you can't do that."
 */
-   if (!pat_enabled && pgprot_val(prot) == pgprot_val(PAGE_KERNEL_WC))
-   prot = PAGE_KERNEL_UC_MINUS;
+   if (!pat_enabled && pgprot_val(prot) == __PAGE_KERNEL_WC)
+   prot = __pgprot(__PAGE_KERNEL |
+   protval_pagemode(_PAGE_CACHE_MODE_UC_MINUS));


protval_pagemode() should be cachemode2protval().


Obviously, yes.




  /*
diff --git a/drivers/video/fbdev/vermilion/vermilion.c 
b/drivers/video/fbdev/vermilion/vermilion.c
index 048a666..6bbc559 100644
--- a/drivers/video/fbdev/vermilion/vermilion.c
+++ b/drivers/video/fbdev/vermilion/vermilion.c
@@ -1004,13 +1004,15 @@ static int vmlfb_mmap(struct fb_info *info, struct 
vm_area_struct *vma)
struct vml_info *vinfo = container_of(info, struct vml_info, info);
unsigned long offset = vma->vm_pgoff << PAGE_SHIFT;
int ret;
+   unsigned long prot;

ret = vmlfb_vram_offset(vinfo, offset);
if (ret)
return -EINVAL;

-   pgprot_val(vma->vm_page_prot) |= _PAGE_PCD;
-   pgprot_val(vma->vm_page_prot) &= ~_PAGE_PWT;
+   prot = pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK;
+   pgprot_val(vma->vm_page_prot) =
+   prot | cachemode2protval(_PAGE_CACHE_MODE_UC);


This cache mode should be _PAGE_CACHE_MODE_UC_MINUS as the original code
only sets the PCD bit.


I'll change it.

Thanks,

Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/3] Adding Skyworks SKY81452 MFD driver

2014-08-26 Thread Gyungoh Yoo

On Tue, Aug 26, 2014 at 09:22:58AM +0100, Lee Jones wrote:
> On Mon, 25 Aug 2014, Gyungoh Yoo wrote:
> > On Thu, Aug 21, 2014 at 10:45:02AM +0100, Lee Jones wrote:
> > > When you send patch-sets, you should send them connected to one
> > > another AKA threaded.  That way, when we're reviewing we can look at
> > > the other patches in the set for reference.  See the man page for `git
> > > send-email` for details.
> > > 
> > > 
> > > 
> > > > Signed-off-by: Gyungoh Yoo 
> > > > ---
> 
> [...]
> 
> > > > +static int sky81452_register_devices(struct device *dev,
> > > > +   const struct sky81452_platform_data *pdata)
> > > > +{
> > > > +   struct mfd_cell cells[] = {
> > > > +   {
> > > > +   .name = "sky81452-bl",
> > > > +   .platform_data = pdata->bl_pdata,
> > > > +   .pdata_size = sizeof(*pdata->bl_pdata),
> > > 
> > > Have you tested this with DT?
> > > 
> > > You're not passing the compatible string and not using
> > > of_platform_populate() so I'm struggling to see how it would work
> > > properly.
> > 
> > sky81452-bl and regulator-sky81452 is parsing the information
> > in regulator node of its parent node. So I thought these 2 drivers
> > don't need compatible attribute. That is what it didn't have
> > compatible string.
> > Is is mandatory that all drivers should have compatible attribute?
> 
> How do they obtain their DT nodes?

The backlight driver which is one of the child driver is obtain its DT node 
like this

np = of_get_child_by_name(dev->parent->of_node, "backlight");

> 
> [...]
> 
> > > > +   return mfd_add_devices(dev, -1, cells, ARRAY_SIZE(cells),
> > > > +   NULL, 0, NULL);
> > > 
> > > This doesn't really need to be in a function of its own.  Please put
> > > it in .probe().  Also check for the return value and present the user
> > > with an error message if it fails.
> > 
> > I think this need to be, in case of !CONFIG_OF.
> > Can you please explain more in details?
> 
> Then how to you obtain the shared register map you created?

regmap is stored in driver data in MFD.

i2c_set_clientdata(client, regmap);

The child drivers obain the regmap from the parent.

struct regmap *regmap = dev_get_drvdata(dev->parent);

> 
> [...]
> 
> -- 
> Lee Jones
> Linaro STMicroelectronics Landing Team Lead
> Linaro.org │ Open source software for ARM SoCs
> Follow Linaro: Facebook | Twitter | Blog
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC v7 net-next 00/28] BPF syscall

2014-08-26 Thread Andy Lutomirski

On Aug 26, 2014 7:29 PM, "Alexei Starovoitov"  wrote:
>
> Hi Ingo, David,
>
> posting whole thing again as RFC to get feedback on syscall only.
> If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok,
> I'll split them into small chunks as requested and will repost without RFC.

IMO it's much easier to review a syscall if we just look at a
specification of what it does.  The code is, in some sense, secondary.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH net v3 4/4] tg3: Fix tx_pending checks for tg3_tso_bug

2014-08-26 Thread Prashant Sreedharan


>  static inline bool tg3_maybe_stop_txq(struct tg3_napi *tnapi,
> struct netdev_queue *txq,
> @@ -7841,14 +7847,16 @@ static inline bool tg3_maybe_stop_txq(struct tg3_napi 
> *tnapi,
>   if (!netif_tx_queue_stopped(txq)) {
>   stopped = true;
>   netif_tx_stop_queue(txq);
> - BUG_ON(wakeup_thresh >= tnapi->tx_pending);
> + tnapi->wakeup_thresh = wakeup_thresh;
> + BUG_ON(tnapi->wakeup_thresh >= tnapi->tx_pending);
>   }
>   /* netif_tx_stop_queue() must be done before checking tx index
>* in tg3_tx_avail(), because in tg3_tx(), we update tx index
> -  * before checking for netif_tx_queue_stopped().
> +  * before checking for netif_tx_queue_stopped(). The memory
> +  * barrier also synchronizes wakeup_thresh changes.
>*/
>   smp_mb();
> - if (tg3_tx_avail(tnapi) > wakeup_thresh)
> + if (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh)
>   netif_tx_wake_queue(txq);

you can add a comment here... stopped is not set to false even if queue
wakes up, to log the netdev_err "BUG! TX Ring.." message.

>   }
>   return stopped;
> @@ -7861,10 +7869,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct 
> tg3_napi *tnapi,
>  struct netdev_queue *txq, struct sk_buff *skb)


> @@ -12318,9 +12354,7 @@ static int tg3_set_ringparam(struct net_device *dev, 
> struct ethtool_ringparam *e
>   if ((ering->rx_pending > tp->rx_std_ring_mask) ||
>   (ering->rx_jumbo_pending > tp->rx_jmb_ring_mask) ||
>   (ering->tx_pending > TG3_TX_RING_SIZE - 1) ||
> - (ering->tx_pending <= MAX_SKB_FRAGS + 1) ||
> - (tg3_flag(tp, TSO_BUG) &&
> -  (ering->tx_pending <= (MAX_SKB_FRAGS * 3
> + (ering->tx_pending <= MAX_SKB_FRAGS + 1))
>   return -EINVAL;
>  
>   if (netif_running(dev)) {
> @@ -12340,6 +12374,7 @@ static int tg3_set_ringparam(struct net_device *dev, 
> struct ethtool_ringparam *e
>   if (tg3_flag(tp, JUMBO_RING_ENABLE))
>   tp->rx_jumbo_pending = ering->rx_jumbo_pending;
>  
> + dev->gso_max_segs = TG3_TX_SEG_PER_DESC(ering->tx_pending - 1);

Assuming a LSO skb of 64k size takes the tg3_tso_bug() code path, if the
available TX descriptors is <= 135 assuming gso_segs is 45 for this skb
based on the estimate 45 * 3 driver would stop this TX queue and set the
tnapi->wakeup_thresh to 135 and return NETDEV_TX_BUSY. This skb will be
queued to be resent when the queue wakes up. 

Meanwhile if the user changes the TX ring size tx_pending=135,
dev->gso_max_segs is modified accordingly to 44, the LSO skb which was
queued will now be GSO'ed (in net/dev.c) before calling
tg3_start_xmit(). To note tg3_tx() cannot wake the queue as it is
expecting to be woken up when available free TX descriptors is 136. So
we end up with HW TX ring empty and not able to send any pkts.





>   for (i = 0; i < tp->irq_max; i++)
>   tp->napi[i].tx_pending = ering->tx_pending;
>  
> @@ -17816,6 +17851,7 @@ static int tg3_init_one(struct pci_dev *pdev,
>   else



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: kernel: signal: NULL ptr deref when killing process

2014-08-26 Thread Sasha Levin

On 08/21/2014 01:17 PM, Oleg Nesterov wrote:
>> Is there a race between kill() and exit() brought on by the kill path only
>> > using the RCU read lock?  This doesn't prevent ->real_cred from being
>> > modified, but it looks like this should, in combination with
>> > delayed_put_task_struct(), prevent it from being cleared.
> Yes, rcu should protect us from both delayed_put_pid() and delayed_put_task().
> Everything looks correct... And there are a lot of other similar users of
> find_vpid/find_task_by_vpid/pid_task/etc under rcu, I can't recall any bug
> in this area.
> 
> I am puzzled. Note also that ->signal == NULL. Will try to think more,
> but so far I have no any idea.

I've hit something similar earlier today, and it might be related:

[  973.452840] BUG: unable to handle kernel NULL pointer dereference at 
02b0
[  973.455347] IP: flush_sigqueue_mask (include/linux/signal.h:118 
kernel/signal.c:715)
[  973.457526] PGD 4dfdc7067 PUD 5f77d9067 PMD 0
[  973.459216] Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  973.460086] Dumping ftrace buffer:
[  973.460086](ftrace buffer empty)
[  973.460086] Modules linked in:
[  973.460086] CPU: 4 PID: 13145 Comm: trinity-c767 Not tainted 
3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
[  973.460086] task: 88060480 ti: 880586648000 task.ti: 
880586648000
[  973.460086] RIP: flush_sigqueue_mask (include/linux/signal.h:118 
kernel/signal.c:715)
[  973.460086] RSP: 0018:88058664bec8  EFLAGS: 00010046
[  973.460086] RAX:  RBX: f730 RCX: 0001
[  973.460086] RDX:  RSI: 02a0 RDI: 88058664bed8
[  973.460086] RBP: 88058664bf10 R08: 0001 R09: 0001
[  973.460086] R10: 0002d201 R11: 0254 R12: 
[  973.460086] R13: 88058664bf40 R14: 88060480 R15: 0010
[  973.460086] FS:  7fe3a3045700() GS:880277c0() 
knlGS:
[  973.460086] CS:  0010 DS:  ES:  CR0: 8005003b
[  973.460086] CR2: 02b0 CR3: 0004e23d5000 CR4: 06a0
[  973.460086] Stack:
[  973.460086]  ac183690 01017fffb3247180 0001 

[  973.460086]  7fffb3247180 7fffb3247220 0011 

[  973.460086]   88058664bf78 ac183ef5 

[  973.460086] Call Trace:
[  973.460086] ? do_sigaction (kernel/signal.c:3124 (discriminator 17))
[  973.460086] SyS_rt_sigaction (kernel/signal.c:3360 kernel/signal.c:3341)
[  973.460086] tracesys (arch/x86/kernel/entry_64.S:542)
[ 973.460086] Code: b7 49 09 d5 4d 89 6e 10 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 
5f 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 0f 31 c0 <48> 8b 56 
10 48 85 ca 74 7b 55 48 f7 d1 48 89 e5 41 56 48 21 ca
All code

   0:   b7 49   mov$0x49,%bh
   2:   09 d5   or %edx,%ebp
   4:   4d 89 6e 10 mov%r13,0x10(%r14)
   8:   48 83 c4 08 add$0x8,%rsp
   c:   5b  pop%rbx
   d:   41 5c   pop%r12
   f:   41 5d   pop%r13
  11:   41 5e   pop%r14
  13:   41 5f   pop%r15
  15:   5d  pop%rbp
  16:   c3  retq
  17:   66 2e 0f 1f 84 00 00nopw   %cs:0x0(%rax,%rax,1)
  1e:   00 00 00
  21:   66 66 66 66 90  data32 data32 data32 xchg %ax,%ax
  26:   48 8b 0fmov(%rdi),%rcx
  29:   31 c0   xor%eax,%eax
  2b:*  48 8b 56 10 mov0x10(%rsi),%rdx  <-- trapping 
instruction
  2f:   48 85 catest   %rcx,%rdx
  32:   74 7b   je 0xaf
  34:   55  push   %rbp
  35:   48 f7 d1not%rcx
  38:   48 89 e5mov%rsp,%rbp
  3b:   41 56   push   %r14
  3d:   48 21 caand%rcx,%rdx
...

Code starting with the faulting instruction
===
   0:   48 8b 56 10 mov0x10(%rsi),%rdx
   4:   48 85 catest   %rcx,%rdx
   7:   74 7b   je 0x84
   9:   55  push   %rbp
   a:   48 f7 d1not%rcx
   d:   48 89 e5mov%rsp,%rbp
  10:   41 56   push   %r14
  12:   48 21 caand%rcx,%rdx
...
[  973.460086] RIP flush_sigqueue_mask (include/linux/signal.h:118 
kernel/signal.c:715)
[  973.460086]  RSP 
[  973.460086] CR2: 02b0


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] memory-hotplug: fix not enough check of valid_zones

2014-08-26 Thread Yasuaki Ishimatsu


(2014/08/27 10:55), Zhang Zhen wrote:

On 2014/8/26 18:23, Yasuaki Ishimatsu wrote:

(2014/08/26 18:57), Zhang Zhen wrote:

As Yasuaki Ishimatsu described the check here is not enough
if memory has hole as follows:

PFN   0x00  0xd0  0xe0  0xf0
   +-+-+-+
zone type   |   Normal| hole|   Normal|
   +-+-+-+
In this case, the check can't guarantee that this is "the last
block of memory".
The check of ZONE_MOVABLE has the same problem.

Change the interface name to valid_zones according to most pepole's
suggestion.

Sample output of the sysfs files:
 memory0/valid_zones: none
 memory1/valid_zones: DMA32
 memory2/valid_zones: DMA32
 memory3/valid_zones: DMA32
 memory4/valid_zones: Normal
 memory5/valid_zones: Normal
 memory6/valid_zones: Normal Movable
 memory7/valid_zones: Movable Normal
 memory8/valid_zones: Movable


The patch has two changes:
  - change sysfs interface name
  - change check of ZONE_MOVABLE
So please separate them.


Ok, i will separate them.

Thanks!

Signed-off-by: Zhang Zhen 
---
   Documentation/ABI/testing/sysfs-devices-memory |  8 ++---
   Documentation/memory-hotplug.txt   |  4 +--
   drivers/base/memory.c  | 42 
++
   3 files changed, 15 insertions(+), 39 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-devices-memory 
b/Documentation/ABI/testing/sysfs-devices-memory
index 2b2a1d7..deef3b5 100644
--- a/Documentation/ABI/testing/sysfs-devices-memory
+++ b/Documentation/ABI/testing/sysfs-devices-memory
@@ -61,13 +61,13 @@ Users:hotplug memory remove tools
   http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils


-What:   /sys/devices/system/memory/memoryX/zones_online_to
+What:   /sys/devices/system/memory/memoryX/valid_zones
   Date:   July 2014
   Contact:Zhang Zhen 
   Description:
-The file /sys/devices/system/memory/memoryX/zones_online_to
-is read-only and is designed to show which zone this memory block can
-be onlined to.
+The file /sys/devices/system/memory/memoryX/valid_zonesis
+read-only and is designed to show which zone this memory
+block can be onlined to.

   What:/sys/devices/system/memoryX/nodeY
   Date:October 2009
diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 5b34e33..947229c 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -155,7 +155,7 @@ Under each memory block, you can see 4 files:
   /sys/devices/system/memory/memoryXXX/phys_device
   /sys/devices/system/memory/memoryXXX/state
   /sys/devices/system/memory/memoryXXX/removable
-/sys/devices/system/memory/memoryXXX/zones_online_to
+/sys/devices/system/memory/memoryXXX/valid_zones

   'phys_index'  : read-only and contains memory block id, same as XXX.
   'state'   : read-write
@@ -171,7 +171,7 @@ Under each memory block, you can see 4 files:
   block is removable and a value of 0 indicates that
   it is not removable. A memory block is removable only if
   every section in the block is removable.
-'zones_online_to' : read-only: designed to show which zone this memory block
+'valid_zones' : read-only: designed to show which zone this memory block
   can be onlined to.

   NOTE:
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index ccaf37c..efd456c 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -374,21 +374,7 @@ static ssize_t show_phys_device(struct device *dev,
   }

   #ifdef CONFIG_MEMORY_HOTREMOVE
-static int __zones_online_to(unsigned long end_pfn,
-struct page *first_page, unsigned long nr_pages)
-{
-struct zone *zone_next;
-
-/* The mem block is the last block of memory. */
-if (!pfn_valid(end_pfn + 1))
-return 1;
-zone_next = page_zone(first_page + nr_pages);
-if (zone_idx(zone_next) == ZONE_MOVABLE)
-return 1;
-return 0;
-}
-
-static ssize_t show_zones_online_to(struct device *dev,
+static ssize_t show_valid_zones(struct device *dev,
   struct device_attribute *attr, char *buf)
   {
   struct memory_block *mem = to_memory_block(dev);
@@ -407,33 +393,23 @@ static ssize_t show_zones_online_to(struct device *dev,

   zone = page_zone(first_page);

-#ifdef CONFIG_HIGHMEM
-if (zone_idx(zone) == ZONE_HIGHMEM) {
-if (__zones_online_to(end_pfn, first_page, nr_pages))
+if (zone_idx(zone) == ZONE_MOVABLE - 1) {
+/*The mem block is the last memoryblock of this zone.*/
+if (end_pfn == zone_end_pfn(zone))
   return sprintf(buf, "%s %s\n",
   zone->name, (zone + 1)->name);
   }
-#else
-if (zone_idx(zone) ==

Re: [PATCH] random: add and use memzero_explicit() for clearing data

2014-08-26 Thread Theodore Ts'o

On Tue, Aug 26, 2014 at 01:11:30AM +0200, Hannes Frederic Sowa wrote:
> On Mo, 2014-08-25 at 22:01 +0200, Daniel Borkmann wrote:
> > zatimend has reported that in his environment (3.16/gcc4.8.3/corei7)
> > memset() calls which clear out sensitive data in extract_{buf,entropy,
> > entropy_user}() in random driver are being optimized away by gcc.
> > 
> > Add a helper memzero_explicit() (similarly as explicit_bzero() variants)
> > that can be used in such cases where a variable with sensitive data is
> > being cleared out in the end. Other use cases might also be in crypto
> > code. [ I have put this into lib/string.c though, as it's always built-in
> > and doesn't need any dependencies then. ]
> > 
> > Fixes kernel bugzilla: 82041
> > 
> > Reported-by: zatim...@hotmail.co.uk
> > Signed-off-by: Daniel Borkmann 
> > Cc: Hannes Frederic Sowa 
> > Cc: Alexey Dobriyan 
> 
> Acked-by: Hannes Frederic Sowa 

Applied to the random tree, thanks.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 00/16] rcu: Some minor fixes and cleanups

2014-08-26 Thread Paul E. McKenney

On Tue, Aug 26, 2014 at 09:10:10PM -0400, Pranith Kumar wrote:
> On Wed, Jul 23, 2014 at 10:45 AM, Paul E. McKenney
>  wrote:
> > On Wed, Jul 23, 2014 at 01:09:37AM -0400, Pranith Kumar wrote:
> >> Hi Paul,
> >>
> >> This is a series of minor fixes and cleanup patches which I found while 
> >> studying
> >> the code. All my previous pending (but not rejected ;) patches are 
> >> superseded by
> >> this series, expect the rcutorture snprintf changes. I am still waiting 
> >> for you
> >> to decide on that one :)
> >>
> >> These changes have been tested by the kvm rcutorture test setup. Some 
> >> tests give
> >> me stall warnings, but otherwise have SUCCESS messages in the logs.
> >
> > For patches 1, 3, 5, 8, 12, and 13, once you get a Reviewed-by from one
> > of the co-maintainers or designated reviewers, I will queue them.
> > The other patches I have responded to.
> 
> Hi Paul, just a reminder so that these don't get forgotten :)

Hello, Pranith, haven't forgotten them, but also haven't seen any
reviews.

Thanx, Paul

> >> Pranith Kumar (16):
> >>   rcu: Use rcu_num_nodes instead of NUM_RCU_NODES
> >>   rcu: Check return value for cpumask allocation
> >>   rcu: Fix comment for gp_state field values
> >>   rcu: Remove redundant check for an online CPU
> >>   rcu: Add noreturn attribute to boost kthread
> >>   rcu: Clear gp_flags only when actually starting new gp
> >>   rcu: Save and restore irq flags in rcu_gp_cleanup()
> >>   rcu: Clean up rcu_spawn_one_boost_kthread()
> >>   rcu: Remove redundant check for online cpu
> >>   rcu: Check for RCU_FLAG_GP_INIT bit in gp_flags for spurious wakeup
> >>   rcu: Check for spurious wakeup using return value
> >>   rcu: Rename rcu_spawn_gp_kthread() to rcu_spawn_kthreads()
> >>   rcu: Spawn nocb kthreads from rcu_prepare_kthreads()
> >>   rcu: Remove redundant checks for rcu_scheduler_fully_active
> >>   rcu: Check for a nocb cpu before trying to spawn nocb threads
> >>   rcu: kvm.sh: Fix error when you pass --cpus argument
> >>
> >>  kernel/rcu/tree.c | 42 
> >> ++-
> >>  kernel/rcu/tree.h |  4 +--
> >>  kernel/rcu/tree_plugin.h  | 40 
> >> +
> >>  tools/testing/selftests/rcutorture/bin/kvm.sh |  4 +--
> >>  4 files changed, 47 insertions(+), 43 deletions(-)
> >>
> >> --
> >> 2.0.0.rc2
> >>
> >
> 
> 
> 
> -- 
> Pranith
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: mm: BUG in unmap_page_range

2014-08-26 Thread Sasha Levin

On 08/11/2014 11:28 PM, Sasha Levin wrote:
> On 08/05/2014 09:04 PM, Sasha Levin wrote:
>> > Thanks Hugh, Mel. I've added both patches to my local tree and will update 
>> > tomorrow
>> > with the weather.
>> > 
>> > Also:
>> > 
>> > On 08/05/2014 08:42 PM, Hugh Dickins wrote:
>>> >> One thing I did wonder, though: at first I was reassured by the
>>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought
>>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger
>>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas.
>>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.)
>> > 
>> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll
>> > update how that one looks as well.
> Sorry for the rather long delay.
> 
> The patch looks fine, the issue didn't reproduce.
> 
> The added VM_BUG_ON didn't trigger either, so maybe we should consider adding
> it in.

It took a while, but I've managed to hit that VM_BUG_ON:

[  707.975456] kernel BUG at include/asm-generic/pgtable.h:724!
[  707.977147] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  707.978974] Dumping ftrace buffer:
[  707.980110](ftrace buffer empty)
[  707.981221] Modules linked in:
[  707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 
3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079
[  707.982801] task: 880165e28000 ti: 880165e3 task.ti: 
880165e3
[  707.982801] RIP: 0010:[]  [] 
change_protection_range+0x94a/0x970
[  707.982801] RSP: 0018:880165e33d98  EFLAGS: 00010246
[  707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100
[  707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902
[  707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000
[  707.982801] R10: 1e01 R11: 0008 R12: 41751000
[  707.982801] R13: 00f7 R14: 9d340902 R15: 41741000
[  707.982801] FS:  7f358a9aa700() GS:88071c60() 
knlGS:
[  707.982801] CS:  0010 DS:  ES:  CR0: 8005003b
[  707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0
[  707.982801] Stack:
[  707.982801]  8804db88d058  88070fb17cf0 

[  707.982801]  880165d88000  8801686a5000 
4163e000
[  707.982801]  8801686a5000 0001 0025 
41750fff
[  707.982801] Call Trace:
[  707.982801]  [] change_protection+0x14/0x30
[  707.982801]  [] change_prot_numa+0x1b/0x40
[  707.982801]  [] task_numa_work+0x1f6/0x330
[  707.982801]  [] task_work_run+0xc4/0xf0
[  707.982801]  [] do_notify_resume+0x97/0xb0
[  707.982801]  [] int_signal+0x12/0x17
[  707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 
48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> 0b 
0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01
[  707.982801] RIP  [] change_protection_range+0x94a/0x970
[  707.982801]  RSP 


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4] zram: add num_{discard_req, discarded} for discard stat

2014-08-26 Thread Chao Yu

Since we have supported handling discard request in this commit
f4659d8e620d08bd1a84a8aec5d2f5294a242764 (zram: support REQ_DISCARD), zram got
one more chance to free unused memory whenever received discard request. But
without stating for discard request, there is no method for user to know whether
discard request has been handled by zram or how many blocks were discarded by
zram when user wants to know the effect of discard.

In this patch, we add num_discard_req to stat discard request and add
num_discarded to stat real discarded blocks, and export them to sysfs for users.

* From v1
 * Update zram document to show num_discards in statistics list.

* From v2
 * Update description of this patch with clear goal.

* From v3
 * Stat discard request and discarded pages separately as "previous stat
   indicates lots of free page discarded without real freeing, so the stat makes
   our user's misunderstanding" pointed out by Minchan Kim.

Signed-off-by: Chao Yu 
---
 Documentation/ABI/testing/sysfs-block-zram | 17 +
 Documentation/blockdev/zram.txt|  2 ++
 drivers/block/zram/zram_drv.c  | 17 ++---
 drivers/block/zram/zram_drv.h  |  2 ++
 4 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-block-zram 
b/Documentation/ABI/testing/sysfs-block-zram
index 70ec992..805fb11 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -57,6 +57,23 @@ Description:
The failed_writes file is read-only and specifies the number of
failed writes happened on this device.
 
+What:  /sys/block/zram/num_discard_req
+Date:  August 2014
+Contact:   Chao Yu 
+Description:
+   The num_discard_req file is read-only and specifies the number
+   of requests received by this device. These requests are sent by
+   swap layer or filesystem when they want to free blocks which are
+   no longer used.
+
+What:  /sys/block/zram/num_discarded
+Date:  August 2014
+Contact:   Chao Yu 
+Description:
+   The num_discarded file is read-only and specifies the number of
+   real discarded blocks (pages which are really freed) in this
+   device after discard request is sent to this device.
+
 What:  /sys/block/zram/max_comp_streams
 Date:  February 2014
 Contact:   Sergey Senozhatsky 
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 0595c3f..f9c1e41 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -89,6 +89,8 @@ size of the disk when not in use so a huge zram is wasteful.
num_writes
failed_reads
failed_writes
+   num_discard_req
+   num_discarded
invalid_io
notify_free
zero_pages
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index d00831c..1d012e8 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -322,7 +322,7 @@ static void handle_zero_page(struct bio_vec *bvec)
  * caller should hold this table index entry's bit_spinlock to
  * indicate this index entry is accessing.
  */
-static void zram_free_page(struct zram *zram, size_t index)
+static bool zram_free_page(struct zram *zram, size_t index)
 {
struct zram_meta *meta = zram->meta;
unsigned long handle = meta->table[index].handle;
@@ -336,7 +336,7 @@ static void zram_free_page(struct zram *zram, size_t index)
zram_clear_flag(meta, index, ZRAM_ZERO);
atomic64_dec(>stats.zero_pages);
}
-   return;
+   return false;
}
 
zs_free(meta->mem_pool, handle);
@@ -347,6 +347,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 
meta->table[index].handle = 0;
zram_set_obj_size(meta, index, 0);
+   return true;
 }
 
 static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
@@ -603,12 +604,18 @@ static void zram_bio_discard(struct zram *zram, u32 index,
}
 
while (n >= PAGE_SIZE) {
+   bool discarded;
+
bit_spin_lock(ZRAM_ACCESS, >table[index].value);
-   zram_free_page(zram, index);
+   discarded = zram_free_page(zram, index);
bit_spin_unlock(ZRAM_ACCESS, >table[index].value);
+   if (discarded)
+   atomic64_inc(>stats.num_discarded);
index++;
n -= PAGE_SIZE;
}
+
+   atomic64_inc(>stats.num_discard_req);
 }
 
 static void zram_reset_device(struct zram *zram, bool reset_capacity)
@@ -866,6 +873,8 @@ ZRAM_ATTR_RO(num_reads);
 ZRAM_ATTR_RO(num_writes);
 ZRAM_ATTR_RO(failed_reads);
 ZRAM_ATTR_RO(failed_writes);

Re: [PATCH v3 3/4] thermal: add more description for thermal-zones

2014-08-26 Thread Wei Ni

On 08/26/2014 08:12 PM, Eduardo Valentin wrote:
> On Tue, Aug 26, 2014 at 10:17:29AM +0800, Wei Ni wrote:
>> On 08/25/2014 07:07 PM, Eduardo Valentin wrote:
>>> Hello Wei Ni,
>>>
>>> On Mon, Aug 25, 2014 at 02:29:47PM +0800, Wei Ni wrote:
 Add more description for the "polling-delay" property.
 Set "trips" and "cooling maps" as optional property, because
 if missing these two sub-nodes, the thermal zone device still
 work properly.

 Signed-off-by: Wei Ni 
 ---
  Documentation/devicetree/bindings/thermal/thermal.txt | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

 diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt 
 b/Documentation/devicetree/bindings/thermal/thermal.txt
 index f5db6b7..e3d3ed9 100644
 --- a/Documentation/devicetree/bindings/thermal/thermal.txt
 +++ b/Documentation/devicetree/bindings/thermal/thermal.txt
 @@ -136,8 +136,8 @@ containing trip nodes and one sub-node containing all 
 the zone cooling maps.
  
  Required properties:
  - polling-delay:  The maximum number of milliseconds to wait between polls
 -  Type: unsigned  when checking this thermal zone.
 -  Size: one cell
 +  Type: unsigned  when checking this thermal zone. If this value is 0, the
 +  Size: one cell  driver will not run polling queue, but just cancel it.
  
>>>
>>> The description above is specific to Linux kernel implementation
>>> nomenclature. DT description needs to be OS agnostic.
>>>
  - polling-delay-passive: The maximum number of milliseconds to wait
Type: unsigned  between polls when performing passive cooling.
 @@ -148,14 +148,16 @@ Required properties:
phandles + sensor
specifier
  
 +Optional property:
  - trips:  A sub-node which is a container of only trip point nodes
Type: sub-node  required to describe the thermal zone.
  
  - cooling-maps:   A sub-node which is a container of only cooling 
 device
Type: sub-node  map nodes, used to describe the relation between trips
 -  and cooling devices.
 +  and cooling devices. If missing the "trips" property,
 +  This sub-node will not be parsed, because no trips can
 +  be bound to cooling devices.
>>>
>>> Do you mean if the thermal zone misses the "trips" property? Actually,
>>> the binding describes both, cooling-maps and trips, as required
>>> properties. Thus, both needs to be in place to consider the thermal zone
>>> as a proper described zone.
>>
>> I moved the "trips" and "cooling-maps" to optional property, because if
>> missing these two properties, the thermal zone devices still can be
>> registered, and the driver can work properly, it has the basic function,
>> can read temperature from thermal sysfs, although it doesn't have trips
>> and bind with cooling devices.
> 
> 
> If a thermal zone is used only for monitoring, then I believe it lost
> its purpose. As  Maybe a different framework shall be used, such as hwmon,
> for instance?

Yes, if we only use it for monitoring, we can use hwmon. But we have
more functions base on these two thermal zone devices. We have a
skin-temperature driver, which used nct1008's remote and local
temperatures to estimator the skin temperature. As you know the thermal
framework is more powerful, the remote/local sensors can be register as
thermal zone, then the skin-temp driver can use thermal_zone_get_temp()
to read their temperatures and then estimator skin's temp. We also will
set trips and cooling devices for this skin-temp.

Wei.

> 
> The purpose of a thermal zone is to describe thermal behavior of a
> hardware. As it is mentioned in the thermal.txt file.
> 
> 
>>
>> Thanks.
>> Wei.
>>
>>>
  
 -Optional property:
  - coefficients:   An array of integers (one signed cell) 
 containing
Type: array coefficients to compose a linear relation 
 between
Elem size: one cell the sensors listed in the thermal-sensors 
 property.
 -- 
 1.8.1.5

>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-tegra" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v5 3/4] zram: zram memory size limitation

2014-08-26 Thread Minchan Kim

Hey Joonsoo,

On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote:
> Hello, Minchan and David.
> 
> On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote:
> > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim  wrote:
> > > Hey Joonsoo,
> > >
> > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote:
> > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote:
> > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, 
> > >> > struct bio_vec *bvec, u32 index,
> > >> > ret = -ENOMEM;
> > >> > goto out;
> > >> > }
> > >> > +
> > >> > +   if (zram->limit_pages &&
> > >> > +   zs_get_total_pages(meta->mem_pool) > zram->limit_pages) {
> > >> > +   zs_free(meta->mem_pool, handle);
> > >> > +   ret = -ENOMEM;
> > >> > +   goto out;
> > >> > +   }
> > >> > +
> > >> > cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
> > >>
> > >> Hello,
> > >>
> > >> I don't follow up previous discussion, so I could be wrong.
> > >> Why this enforcement should be here?
> > >>
> > >> I think that this has two problems.
> > >> 1) alloc/free happens unnecessarilly if we have used memory over the
> > >> limitation.
> > >
> > > True but firstly, I implemented the logic in zsmalloc, not zram but
> > > as I described in cover-letter, it's not a requirement of zsmalloc
> > > but zram so it should be in there. If every user want it in future,
> > > then we could move the function into zsmalloc. That's what we
> > > concluded in previous discussion.
> 
> Hmm...
> Problem is that we can't avoid these unnecessary overhead in this
> implementation. If we can implement this feature in zram efficiently,
> it's okay. But, I think that current form isn't.


If we can add it in zsmalloc, it would be more clean and efficient
for zram but as I said, at the moment, I didn't want to put zram's
requirement into zsmalloc because to me, it's weird to enforce max
limit to allocator. It's client's role, I think.

If current implementation is expensive and rather hard to follow,
It would be one reason to move the feature into zsmalloc but
I don't think it makes critical trobule in zram usecase.
See below.

But I still open and will wait others's opinion.
If other guys think zsmalloc is better place, I am willing to move
it into zsmalloc.

> 
> > >
> > > Another idea is we could call zs_get_total_pages right before zs_malloc
> > > but the problem is we cannot know how many of pages are allocated
> > > by zsmalloc in advance.
> > > IOW, zram should be blind on zsmalloc's internal.
> > >
> > 
> > We did however suggest that we could check before hand to see if
> > max was already exceeded as an optimization.
> > (possibly with a guess on usage but at least using the minimum of 1 page)
> > In the contested case, the max may already be exceeded transiently and
> > therefore we know this one _could_ fail (it could also pass, but odds
> > aren't good).
> > As Minchan mentions this was discussed before - but not into great detail.
> > Testing should be done to determine possible benefit. And as he also
> > mentions, the better place for it may be in zsmalloc, but that
> > requires an ABI change.
> 
> Why we hesitate to change zsmalloc API? It is in-kernel API and there
> are just two users now, zswap and zram. We can change it easily.
> I think that we just need following simple API change in zsmalloc.c.
> 
> zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op)
> =>
> zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops
> *zpool_op)
> 
> It's pool allocator so there is no obstacle for us to limit maximum
> memory usage in zsmalloc. It's a natural idea to limit memory usage
> for pool allocator.
> 
> > Certainly a detailed suggestion could happen on this thread and I'm
> > also interested
> > in your thoughts, but this patchset should be able to go in as is.
> > Memory exhaustion avoidance probably trumps the possible thrashing at
> > threshold.
> > 
> > > About alloc/free cost once if it is over the limit,
> > > I don't think it's important to consider.
> > > Do you have any scenario in your mind to consider alloc/free cost
> > > when the limit is over?
> > >
> > >> 2) Even if this request doesn't do new allocation, it could be failed
> > >> due to other's allocation. There is time gap between allocation and
> > >> free, so legimate user who want to use preallocated zsmalloc memory
> > >> could also see this condition true and then he will be failed.
> > >
> > > Yeb, we already discussed that. :)
> > > Such false positive shouldn't be a severe problem if we can keep a
> > > promise that zram user cannot exceed mem_limit.
> > >
> 
> If we can keep such a promise, why we need to limit memory usage?
> I guess that this limit feature is useful for user who can't keep such 
> promise.
> So, we should assume that this false positive happens frequently.


The goal is to limit memory usage within some threshold.
so false positive shouldn't be

[PATCH RFC v7 net-next 02/28] net: filter: split filter.h and expose eBPF to user space

2014-08-26 Thread Alexei Starovoitov

eBPF can be used from user space.

uapi/linux/bpf.h: eBPF instruction set definition

linux/filter.h: the rest

This patch only moves macro definitions, but practically it freezes existing
eBPF instruction set, though new instructions can still be added in the future.

These eBPF definitions cannot go into uapi/linux/filter.h, since the names
may conflict with existing applications.

Signed-off-by: Alexei Starovoitov 
---
 include/linux/filter.h|  312 +--
 include/uapi/linux/Kbuild |1 +
 include/uapi/linux/bpf.h  |  321 +
 3 files changed, 323 insertions(+), 311 deletions(-)
 create mode 100644 include/uapi/linux/bpf.h

diff --git a/include/linux/filter.h b/include/linux/filter.h
index f3262b598262..f04793474d16 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -9,322 +9,12 @@
 #include 
 #include 
 #include 
-
-/* Internally used and optimized filter representation with extended
- * instruction set based on top of classic BPF.
- */
-
-/* instruction classes */
-#define BPF_ALU64  0x07/* alu mode in double word width */
-
-/* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
-#define BPF_XADD   0xc0/* exclusive add */
-
-/* alu/jmp fields */
-#define BPF_MOV0xb0/* mov reg to reg */
-#define BPF_ARSH   0xc0/* sign extending arithmetic shift right */
-
-/* change endianness of a register */
-#define BPF_END0xd0/* flags for endianness conversion: */
-#define BPF_TO_LE  0x00/* convert to little-endian */
-#define BPF_TO_BE  0x08/* convert to big-endian */
-#define BPF_FROM_LEBPF_TO_LE
-#define BPF_FROM_BEBPF_TO_BE
-
-#define BPF_JNE0x50/* jump != */
-#define BPF_JSGT   0x60/* SGT is signed '>', GT in x86 */
-#define BPF_JSGE   0x70/* SGE is signed '>=', GE in x86 */
-#define BPF_CALL   0x80/* function call */
-#define BPF_EXIT   0x90/* function return */
-
-/* Register numbers */
-enum {
-   BPF_REG_0 = 0,
-   BPF_REG_1,
-   BPF_REG_2,
-   BPF_REG_3,
-   BPF_REG_4,
-   BPF_REG_5,
-   BPF_REG_6,
-   BPF_REG_7,
-   BPF_REG_8,
-   BPF_REG_9,
-   BPF_REG_10,
-   __MAX_BPF_REG,
-};
-
-/* BPF has 10 general purpose 64-bit registers and stack frame. */
-#define MAX_BPF_REG__MAX_BPF_REG
-
-/* ArgX, context and stack frame pointer register positions. Note,
- * Arg1, Arg2, Arg3, etc are used as argument mappings of function
- * calls in BPF_CALL instruction.
- */
-#define BPF_REG_ARG1   BPF_REG_1
-#define BPF_REG_ARG2   BPF_REG_2
-#define BPF_REG_ARG3   BPF_REG_3
-#define BPF_REG_ARG4   BPF_REG_4
-#define BPF_REG_ARG5   BPF_REG_5
-#define BPF_REG_CTXBPF_REG_6
-#define BPF_REG_FP BPF_REG_10
-
-/* Additional register mappings for converted user programs. */
-#define BPF_REG_A  BPF_REG_0
-#define BPF_REG_X  BPF_REG_7
-#define BPF_REG_TMPBPF_REG_8
-
-/* BPF program can access up to 512 bytes of stack space. */
-#define MAX_BPF_STACK  512
-
-/* Helper macros for filter block array initializers. */
-
-/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */
-
-#define BPF_ALU64_REG(OP, DST, SRC)\
-   ((struct bpf_insn) {\
-   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_X,\
-   .dst_reg = DST, \
-   .src_reg = SRC, \
-   .off   = 0, \
-   .imm   = 0 })
-
-#define BPF_ALU32_REG(OP, DST, SRC)\
-   ((struct bpf_insn) {\
-   .code  = BPF_ALU | BPF_OP(OP) | BPF_X,  \
-   .dst_reg = DST, \
-   .src_reg = SRC, \
-   .off   = 0, \
-   .imm   = 0 })
-
-/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */
-
-#define BPF_ALU64_IMM(OP, DST, IMM)\
-   ((struct bpf_insn) {\
-   .code  = BPF_ALU64 | BPF_OP(OP) | BPF_K,\
-   .dst_reg = DST, \
-   .src_reg = 0,   \
-   .off   = 0, \
-   .imm   = IMM })
-
-#define BPF_ALU32_IMM(OP, DST, IMM)\
-   ((struct bpf_insn) {\
-   .code  = BPF_ALU | BPF_OP(OP) | BPF_K,  \
-   .dst_reg = DST, \
-   .src_reg = 0,   \
-   .off   = 0, \
-

[PATCH V2 2/3] perf tools: parse the pmu event prefix and surfix

2014-08-26 Thread kan . liang

From: Kan Liang 

There are two types of event formats for PMU events. E.g. el-abort OR
perf tools: parse the pmu event prefix and surfix

There are two types of event formats for PMU events. E.g. el-abort OR
cpu/el-abort/. However, the lexer mistakenly recognizes the simple style
format as two events.

The parse_events_pmu_check function uses bsearch to search the name in
known pmu event list. It can tell the lexer that the name is a PE_NAME
or a PMU event name prefix or a PMU event name suffix. All these
information will be used for accurately parsing kernel PMU events.

The pmu events list will be read from sysfs at runtime.

Signed-off-by: Kan Liang 
---
V2: Read kernel PMU events from sysfs at runtime

 tools/perf/util/parse-events.c | 103 +
 tools/perf/util/parse-events.h |  15 ++
 tools/perf/util/pmu.c  |  10 
 tools/perf/util/pmu.h  |  10 
 4 files changed, 128 insertions(+), 10 deletions(-)

diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 7a0aa75..5e69e65 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -29,6 +29,9 @@ extern int parse_events_debug;
 #endif
 int parse_events_parse(void *data, void *scanner);
 
+static struct kernel_pmu_event_symbol *kernel_pmu_events_list;
+static size_t kernel_pmu_events_list_num;
+
 static struct event_symbol event_symbols_hw[PERF_COUNT_HW_MAX] = {
[PERF_COUNT_HW_CPU_CYCLES] = {
.symbol = "cpu-cycles",
@@ -852,6 +855,103 @@ int parse_events_name(struct list_head *list, char *name)
return 0;
 }
 
+static int
+comp_pmu(const void *p1, const void *p2)
+{
+   struct kernel_pmu_event_symbol *pmu1 =
+   (struct kernel_pmu_event_symbol *) p1;
+   struct kernel_pmu_event_symbol *pmu2 =
+   (struct kernel_pmu_event_symbol *) p2;
+
+   return strcmp(pmu1->symbol, pmu2->symbol);
+}
+
+enum kernel_pmu_event_type
+parse_events_pmu_check(const char *name)
+{
+   struct kernel_pmu_event_symbol p, *r;
+
+   /*
+* name "cpu" could be prefix of cpu-cycles or cpu// events.
+* cpu-cycles has been handled by hardcode.
+* So it must be cpu// events, not kernel pmu event.
+*/
+   if (!kernel_pmu_events_list_num || !strcmp(name, "cpu"))
+   return NONE_KERNEL_PMU_EVENT;
+
+   strcpy(p.symbol, name);
+   r = bsearch(, kernel_pmu_events_list,
+   kernel_pmu_events_list_num,
+   sizeof(struct kernel_pmu_event_symbol), comp_pmu);
+   if (r == NULL)
+   return NONE_KERNEL_PMU_EVENT;
+   return r->type;
+}
+
+/*
+ * Read the pmu events list from sysfs
+ * Save it into kernel_pmu_events_list
+ */
+static void scan_kernel_pmu_events_list(void)
+{
+
+   struct perf_pmu *pmu = NULL;
+   struct perf_pmu_alias *alias;
+   int len = 0;
+
+   while ((pmu = perf_pmu__scan(pmu)) != NULL)
+   list_for_each_entry(alias, >aliases, list) {
+   if (!strcmp(pmu->name, "cpu")) {
+   if (strchr(alias->name, '-'))
+   len++;
+   len++;
+   }
+   }
+   if (len == 0)
+   return;
+   kernel_pmu_events_list =
+   malloc(sizeof(struct kernel_pmu_event_symbol) * len);
+   kernel_pmu_events_list_num = len;
+
+   pmu = NULL;
+   len = 0;
+   while ((pmu = perf_pmu__scan(pmu)) != NULL)
+   list_for_each_entry(alias, >aliases, list) {
+   if (!strcmp(pmu->name, "cpu")) {
+   struct kernel_pmu_event_symbol *p =
+   kernel_pmu_events_list + len;
+   char *tmp = strchr(alias->name, '-');
+
+   if (tmp != NULL) {
+   strncpy(p->symbol, alias->name,
+   tmp - alias->name);
+   p->type = KERNEL_PMU_EVENT_PREFIX;
+   tmp++;
+   p++;
+   strcpy(p->symbol, tmp);
+   p->type = KERNEL_PMU_EVENT_SUFFIX;
+   len += 2;
+   } else {
+   strcpy(p->symbol, alias->name);
+   p->type = KERNEL_PMU_EVENT;
+   len++;
+   }
+   }
+   }
+   qsort(kernel_pmu_events_list, len,
+   sizeof(struct kernel_pmu_event_symbol), comp_pmu);
+
+}
+
+static void release_kernel_pmu_events_list(void)
+{
+   if (kernel_pmu_events_list) {
+

[PATCH V2 1/3] Revert "perf tools: Default to cpu// for events v5"

2014-08-26 Thread kan . liang

From: Kan Liang 

This reverts commit 50e200f07948 ("perf tools: Default to cpu// for
events v5")

The fixup cannot handle the case that
new style format(which without //) mixed with
other different formats.

For example,
group events with new style format: {mem-stores,mem-loads}
some hardware event + new style event: cycles,mem-loads
Cache event + new style event: LLC-loads,mem-loads
Raw event + new style event:
cpu/event=0xc8,umask=0x08/,mem-loads
old style event and new stytle mixture: mem-stores,cpu/mem-loads/

Signed-off-by: Kan Liang 
---
 tools/perf/util/include/linux/string.h |  1 -
 tools/perf/util/parse-events.c | 30 +-
 tools/perf/util/string.c   | 24 
 3 files changed, 1 insertion(+), 54 deletions(-)

diff --git a/tools/perf/util/include/linux/string.h 
b/tools/perf/util/include/linux/string.h
index 97a8007..6f19c54 100644
--- a/tools/perf/util/include/linux/string.h
+++ b/tools/perf/util/include/linux/string.h
@@ -1,4 +1,3 @@
 #include 
 
 void *memdup(const void *src, size_t len);
-int str_append(char **s, int *len, const char *a);
diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c
index 1e15df1..7a0aa75 100644
--- a/tools/perf/util/parse-events.c
+++ b/tools/perf/util/parse-events.c
@@ -6,7 +6,7 @@
 #include "parse-options.h"
 #include "parse-events.h"
 #include "exec_cmd.h"
-#include "linux/string.h"
+#include "string.h"
 #include "symbol.h"
 #include "cache.h"
 #include "header.h"
@@ -852,32 +852,6 @@ int parse_events_name(struct list_head *list, char *name)
return 0;
 }
 
-static int parse_events__scanner(const char *str, void *data, int start_token);
-
-static int parse_events_fixup(int ret, const char *str, void *data,
- int start_token)
-{
-   char *o = strdup(str);
-   char *s = NULL;
-   char *t = o;
-   char *p;
-   int len = 0;
-
-   if (!o)
-   return ret;
-   while ((p = strsep(, ",")) != NULL) {
-   if (s)
-   str_append(, , ",");
-   str_append(, , "cpu/");
-   str_append(, , p);
-   str_append(, , "/");
-   }
-   free(o);
-   if (!s)
-   return -ENOMEM;
-   return parse_events__scanner(s, data, start_token);
-}
-
 static int parse_events__scanner(const char *str, void *data, int start_token)
 {
YY_BUFFER_STATE buffer;
@@ -898,8 +872,6 @@ static int parse_events__scanner(const char *str, void 
*data, int start_token)
parse_events__flush_buffer(buffer, scanner);
parse_events__delete_buffer(buffer, scanner);
parse_events_lex_destroy(scanner);
-   if (ret && !strchr(str, '/'))
-   ret = parse_events_fixup(ret, str, data, start_token);
return ret;
 }
 
diff --git a/tools/perf/util/string.c b/tools/perf/util/string.c
index 2553e5b..4b0ff22 100644
--- a/tools/perf/util/string.c
+++ b/tools/perf/util/string.c
@@ -387,27 +387,3 @@ void *memdup(const void *src, size_t len)
 
return p;
 }
-
-/**
- * str_append - reallocate string and append another
- * @s: pointer to string pointer
- * @len: pointer to len (initialized)
- * @a: string to append.
- */
-int str_append(char **s, int *len, const char *a)
-{
-   int olen = *s ? strlen(*s) : 0;
-   int nlen = olen + strlen(a) + 1;
-   if (*len < nlen) {
-   *len = *len * 2;
-   if (*len < nlen)
-   *len = nlen;
-   *s = realloc(*s, *len);
-   if (!*s)
-   return -ENOMEM;
-   if (olen == 0)
-   **s = 0;
-   }
-   strcat(*s, a);
-   return 0;
-}
-- 
1.8.3.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH V2 3/3] perf tools: Add support to new style format of kernel PMU event

2014-08-26 Thread kan . liang

From: Kan Liang 

Add new rules for kernel PMU event.

event_pmu:
PE_KERNEL_PMU_EVENT
|
PE_PMU_EVENT_PRE '-' PE_PMU_EVENT_SUF

PE_KERNEL_PMU_EVENT token is for
cycles-ct/cycles-t/mem-loads/mem-stores.
The prefix cycles is mixed up with cpu-cycles.
loads and stores are mixed up with cache event
So they have to be hardcode in lex.

PE_PMU_EVENT_PRE and PE_PMU_EVENT_SUF tokens are for other PMU
events.
The lex looks generic identifier up in the table and return the matched
token. If there is no match, generic PE_NAME token will be return.

Using the rules, kernel PMU event could use new style format without //

so you can use

perf record -e mem-loads ...

instead of

perf record -e cpu/mem-loads/

Signed-off-by: Kan Liang 
---
 tools/perf/util/parse-events.l | 30 +-
 tools/perf/util/parse-events.y | 42 ++
 2 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l
index 3432995..4dd7f04 100644
--- a/tools/perf/util/parse-events.l
+++ b/tools/perf/util/parse-events.l
@@ -51,6 +51,24 @@ static int str(yyscan_t scanner, int token)
return token;
 }
 
+static int pmu_str_check(yyscan_t scanner)
+{
+   YYSTYPE *yylval = parse_events_get_lval(scanner);
+   char *text = parse_events_get_text(scanner);
+
+   yylval->str = strdup(text);
+   switch (parse_events_pmu_check(text)) {
+   case KERNEL_PMU_EVENT_PREFIX:
+   return PE_PMU_EVENT_PRE;
+   case KERNEL_PMU_EVENT_SUFFIX:
+   return PE_PMU_EVENT_SUF;
+   case KERNEL_PMU_EVENT:
+   return PE_KERNEL_PMU_EVENT;
+   default:
+   return PE_NAME;
+   }
+}
+
 static int sym(yyscan_t scanner, int type, int config)
 {
YYSTYPE *yylval = parse_events_get_lval(scanner);
@@ -178,6 +196,16 @@ alignment-faults   { return 
sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_AL
 emulation-faults   { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EMULATION_FAULTS); }
 dummy  { return sym(yyscanner, 
PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); }
 
+   /*
+* We have to handle the kernel PMU event 
cycles-ct/cycles-t/mem-loads/mem-stores separately.
+* Because the prefix cycles is mixed up with cpu-cycles.
+* loads and stores are mixed up with cache event
+*/
+cycles-ct  { return str(yyscanner, 
PE_KERNEL_PMU_EVENT); }
+cycles-t   { return str(yyscanner, 
PE_KERNEL_PMU_EVENT); }
+mem-loads  { return str(yyscanner, 
PE_KERNEL_PMU_EVENT); }
+mem-stores { return str(yyscanner, 
PE_KERNEL_PMU_EVENT); }
+
 L1-dcache|l1-d|l1d|L1-data |
 L1-icache|l1-i|l1i|L1-instruction  |
 LLC|L2 |
@@ -199,7 +227,7 @@ r{num_raw_hex}  { return raw(yyscanner); }
 {num_hex}  { return value(yyscanner, 16); }
 
 {modifier_event}   { return str(yyscanner, PE_MODIFIER_EVENT); }
-{name} { return str(yyscanner, PE_NAME); }
+{name} { return pmu_str_check(yyscanner); }
 "/"{ BEGIN(config); return '/'; }
 -  { return '-'; }
 ,  { BEGIN(event); return ','; }
diff --git a/tools/perf/util/parse-events.y b/tools/perf/util/parse-events.y
index 0bc87ba..77e01e5 100644
--- a/tools/perf/util/parse-events.y
+++ b/tools/perf/util/parse-events.y
@@ -47,6 +47,7 @@ static inc_group_count(struct list_head *list,
 %token PE_NAME_CACHE_TYPE PE_NAME_CACHE_OP_RESULT
 %token PE_PREFIX_MEM PE_PREFIX_RAW PE_PREFIX_GROUP
 %token PE_ERROR
+%token PE_PMU_EVENT_PRE PE_PMU_EVENT_SUF PE_KERNEL_PMU_EVENT
 %type  PE_VALUE
 %type  PE_VALUE_SYM_HW
 %type  PE_VALUE_SYM_SW
@@ -58,6 +59,7 @@ static inc_group_count(struct list_head *list,
 %type  PE_MODIFIER_EVENT
 %type  PE_MODIFIER_BP
 %type  PE_EVENT_NAME
+%type  PE_PMU_EVENT_PRE PE_PMU_EVENT_SUF PE_KERNEL_PMU_EVENT
 %type  value_sym
 %type  event_config
 %type  event_term
@@ -210,6 +212,46 @@ PE_NAME '/' event_config '/'
parse_events__free_terms($3);
$$ = list;
 }
+|
+PE_KERNEL_PMU_EVENT
+{
+   struct parse_events_evlist *data = _data;
+   struct list_head *head = malloc(sizeof(*head));
+   struct parse_events_term *term;
+   struct list_head *list;
+
+   ABORT_ON(parse_events_term__num(, PARSE_EVENTS__TERM_TYPE_USER,
+   $1, 1));
+   ABORT_ON(!head);
+   INIT_LIST_HEAD(head);
+   list_add_tail(>list, head);
+
+   ALLOC_LIST(list);
+   ABORT_ON(parse_events_add_pmu(list, >idx, "cpu", head));
+   parse_events__free_terms(head);
+   $$ = list;
+}
+|
+PE_PMU_EVENT_PRE '-'

Re: [PATCH v1 1/1] power: Add simple gpio-restart driver

2014-08-26 Thread Guenter Roeck


On 08/26/2014 04:45 PM, David Riley wrote:

This driver registers a restart handler to set a GPIO line high/low
to reset a board based on devicetree bindings.

Signed-off-by: David Riley 
---
  .../devicetree/bindings/gpio/gpio-restart.txt  |  48 +++
  drivers/power/reset/Kconfig|   8 ++
  drivers/power/reset/Makefile   |   1 +
  drivers/power/reset/gpio-restart.c | 142 +
  4 files changed, 199 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/gpio/gpio-restart.txt
  create mode 100644 drivers/power/reset/gpio-restart.c

diff --git a/Documentation/devicetree/bindings/gpio/gpio-restart.txt 
b/Documentation/devicetree/bindings/gpio/gpio-restart.txt
new file mode 100644
index 000..7cd58788
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpio/gpio-restart.txt
@@ -0,0 +1,48 @@
+Driver a GPIO line that can be used to restart the system as a
+restart handler.
+
+The driver supports both level triggered and edge triggered power off.
+At driver load time, the driver will request the given gpio line and
+install a restart handler. If the optional properties 'input' is
+not found, the GPIO line will be driven in the inactive state.
+Otherwise its configured as an input.
+
+When do_kernel_restart is called the various restart handlers will be tried
+in order.  The gpio is configured as an output, and drive active, so
+triggering a level triggered power off condition. This will also cause an
+inactive->active edge condition, so triggering positive edge triggered
+power off. After a delay of 100ms, the GPIO is set to inactive, thus
+causing an active->inactive edge, triggering negative edge triggered power
+off. After another 100ms delay the GPIO is driver active again. If the
+power is still on and the CPU still running after a 3000ms delay, a
+WARN_ON(1) is emitted.
+
+Required properties:
+- compatible : should be "gpio-restart".
+- gpios : The GPIO to set high/low, see "gpios property" in
+  Documentation/devicetree/bindings/gpio/gpio.txt. If the pin should be
+  low to power down the board set it to "Active Low", otherwise set
+  gpio to "Active High".
+
+Optional properties:
+- input : Initially configure the GPIO line as an input. Only reconfigure
+  it to an output when the machine_restart function is called. If this optional
+  property is not specified, the GPIO is initialized as an output in its
+  inactive state.


Maybe describe this as open source ?


+- priority : A priority ranging from 0 to 255 (default 128) according to
+  the following guidelines:
+   0:  Restart handler of last resort, with limited restart
+   capabilities
+   128:Default restart handler; use if no other restart handler is
+   expected to be available, and/or if restart functionality is
+   sufficient to restart the entire system
+   255:Highest priority restart handler, will preempt all other
+   restart handlers
+
+Examples:
+
+gpio-restart {
+   compatible = "gpio-restart";
+   gpios = < 4 0>;
+   priority = /bits/ 8 <200>;
+};
diff --git a/drivers/power/reset/Kconfig b/drivers/power/reset/Kconfig
index ca41523..f07e26c 100644
--- a/drivers/power/reset/Kconfig
+++ b/drivers/power/reset/Kconfig
@@ -39,6 +39,14 @@ config POWER_RESET_GPIO
  If your board needs a GPIO high/low to power down, say Y and
  create a binding in your devicetree.

+config POWER_RESET_GPIO_RESTART
+   bool "GPIO restart driver"
+   depends on OF_GPIO && POWER_RESET
+   help
+ This driver supports restarting your board via a GPIO line.
+ If your board needs a GPIO high/low to restart, say Y and
+ create a binding in your devicetree.
+
  config POWER_RESET_HISI
bool "Hisilicon power-off driver"
depends on POWER_RESET && ARCH_HISI
diff --git a/drivers/power/reset/Makefile b/drivers/power/reset/Makefile
index a42e70e..199cb6e 100644
--- a/drivers/power/reset/Makefile
+++ b/drivers/power/reset/Makefile
@@ -2,6 +2,7 @@ obj-$(CONFIG_POWER_RESET_AS3722) += as3722-poweroff.o
  obj-$(CONFIG_POWER_RESET_AXXIA) += axxia-reset.o
  obj-$(CONFIG_POWER_RESET_BRCMSTB) += brcmstb-reboot.o
  obj-$(CONFIG_POWER_RESET_GPIO) += gpio-poweroff.o
+obj-$(CONFIG_POWER_RESET_GPIO_RESTART) += gpio-restart.o
  obj-$(CONFIG_POWER_RESET_HISI) += hisi-reboot.o
  obj-$(CONFIG_POWER_RESET_MSM) += msm-poweroff.o
  obj-$(CONFIG_POWER_RESET_QNAP) += qnap-poweroff.o
diff --git a/drivers/power/reset/gpio-restart.c 
b/drivers/power/reset/gpio-restart.c
new file mode 100644
index 000..2cbff64
--- /dev/null
+++ b/drivers/power/reset/gpio-restart.c
@@ -0,0 +1,142 @@
+/*
+ * Toggles a GPIO pin to restart a device
+ *
+ * Copyright (C) 2014 Google, Inc.
+ *
+ * This software is licensed under the terms of the GNU General Public
+ * License version 2, as published by the Free Software Foundation, and
+ * may be copied, distributed, and

[PATCH RFC v7 net-next 01/28] net: filter: add "load 64-bit immediate" eBPF instruction

2014-08-26 Thread Alexei Starovoitov

add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
All previous instructions were 8-byte. This is first 16-byte instruction.
Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
insn[0].code = BPF_LD | BPF_DW | BPF_IMM
insn[0].dst_reg = destination register
insn[0].imm = lower 32-bit
insn[1].code = 0
insn[1].imm = upper 32-bit
All unused fields must be zero.

Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
which loads 32-bit immediate value into a register.

x64 JITs it as single 'movabsq %rax, imm64'
arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn

Note that old eBPF programs are binary compatible with new interpreter.

In the following patches this instruction is used to store eBPF map pointers:
 BPF_LD_IMM64(R1, const_imm_map_ptr)
 BPF_CALL(BPF_FUNC_map_lookup_elem)
and verifier is introduced to check validity of the programs.

Later LLVM compiler is using this insn as generic load of 64-bit immediate
constant and as a load of map pointer with relocation.

Signed-off-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt |8 +++-
 arch/x86/net/bpf_jit_comp.c |   17 +
 include/linux/filter.h  |   18 ++
 kernel/bpf/core.c   |5 +
 lib/test_bpf.c  |   21 +
 5 files changed, 68 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index c48a9704bda8..81916ab5d96f 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -951,7 +951,7 @@ Size modifier is one of ...
 
 Mode modifier is one of:
 
-  BPF_IMM  0x00  /* classic BPF only, reserved in eBPF */
+  BPF_IMM  0x00  /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
   BPF_ABS  0x20
   BPF_IND  0x40
   BPF_MEM  0x60
@@ -995,6 +995,12 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + 
off16) += src_reg
 Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
 2 byte atomic increments are not supported.
 
+eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists
+of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single
+instruction that loads 64-bit immediate value into a dst_reg.
+Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
+32-bit immediate value into a register.
+
 Testing
 ---
 
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index b08a98c59530..98837147ee57 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -393,6 +393,23 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
EMIT1_off32(add_1reg(0xB8, dst_reg), imm32);
break;
 
+   case BPF_LD | BPF_IMM | BPF_DW:
+   if (insn[1].code != 0 || insn[1].src_reg != 0 ||
+   insn[1].dst_reg != 0 || insn[1].off != 0) {
+   /* verifier must catch invalid insns */
+   pr_err("invalid BPF_LD_IMM64 insn\n");
+   return -EINVAL;
+   }
+
+   /* movabsq %rax, imm64 */
+   EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg));
+   EMIT(insn[0].imm, 4);
+   EMIT(insn[1].imm, 4);
+
+   insn++;
+   i++;
+   break;
+
/* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */
case BPF_ALU | BPF_MOD | BPF_X:
case BPF_ALU | BPF_DIV | BPF_X:
diff --git a/include/linux/filter.h b/include/linux/filter.h
index a5227ab8ccb1..f3262b598262 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -161,6 +161,24 @@ enum {
.off   = 0, \
.imm   = IMM })
 
+/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */
+#define BPF_LD_IMM64(DST, IMM) \
+   BPF_LD_IMM64_RAW(DST, 0, IMM)
+
+#define BPF_LD_IMM64_RAW(DST, SRC, IMM)\
+   ((struct bpf_insn) {\
+   .code  = BPF_LD | BPF_DW | BPF_IMM, \
+   .dst_reg = DST, \
+   .src_reg = SRC, \
+   .off   = 0, \
+   .imm   = (__u32) (IMM) }),  \
+   ((struct bpf_insn) {\
+   .code  = 0, /* zero is reserved opcode */   \
+   .dst_reg = 0,   \
+   .src_reg = 0,   \
+   .off   =

[PATCH RFC v7 net-next 06/28] bpf: add hashtable type of BPF maps

2014-08-26 Thread Alexei Starovoitov

add new map type BPF_MAP_TYPE_HASH and its implementation

- key/value are opaque range of bytes

- user space provides 3 configuration attributes via BPF syscall:
  key_size, value_size, max_entries

- if value_size == 0, the map is used as a set

- map_update_elem() must fail to insert new element when max_entries
  limit is reached

- map takes care of allocating/freeing key/value pairs

- update/lookup/delete methods may be called from eBPF program attached
  to kprobes, so use spin_lock_irqsave() mechanism for concurrent updates

- optimized for speed of lookup() which can be called multiple times from
  eBPF program which itself is triggered by high volume of events

- in the future JIT compiler may recognize lookup() call and optimize it
  further, since key_size is constant for life of eBPF program

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |1 +
 kernel/bpf/Makefile  |2 +-
 kernel/bpf/hashtab.c |  365 ++
 3 files changed, 367 insertions(+), 1 deletion(-)
 create mode 100644 kernel/bpf/hashtab.c

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1602de6423b5..ad0a5a495ec3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -359,6 +359,7 @@ enum bpf_cmd {
 
 enum bpf_map_type {
BPF_MAP_TYPE_UNSPEC,
+   BPF_MAP_TYPE_HASH,
 };
 
 union bpf_attr {
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index e9f7334ed07a..558e12712ebc 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -1 +1 @@
-obj-y := core.o syscall.o
+obj-y := core.o syscall.o hashtab.o
diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
new file mode 100644
index ..4d131c86821c
--- /dev/null
+++ b/kernel/bpf/hashtab.c
@@ -0,0 +1,365 @@
+/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ */
+#include 
+#include 
+#include 
+
+struct bpf_htab {
+   struct bpf_map map;
+   struct hlist_head *buckets;
+   struct kmem_cache *elem_cache;
+   spinlock_t lock;
+   u32 count; /* number of elements in this hashtable */
+   u32 n_buckets; /* number of hash buckets */
+   u32 elem_size; /* size of each element in bytes */
+};
+
+/* each htab element is struct htab_elem + key + value */
+struct htab_elem {
+   struct hlist_node hash_node;
+   struct rcu_head rcu;
+   struct bpf_htab *htab;
+   u32 hash;
+   u32 pad;
+   char key[0];
+};
+
+#define BPF_MAP_MAX_KEY_SIZE 256
+static struct bpf_map *htab_map_alloc(union bpf_attr *attr)
+{
+   struct bpf_htab *htab;
+   int err, i;
+
+   htab = kzalloc(sizeof(*htab), GFP_USER);
+   if (!htab)
+   return ERR_PTR(-ENOMEM);
+
+   /* mandatory map attributes */
+   htab->map.key_size = attr->key_size;
+   htab->map.value_size = attr->value_size;
+   htab->map.max_entries = attr->max_entries;
+
+   /* check sanity of attributes.
+* value_size == 0 is allowed, in this case map is used as a set
+*/
+   err = -EINVAL;
+   if (htab->map.max_entries == 0 || htab->map.key_size == 0)
+   goto free_htab;
+
+   /* hash table size must be power of 2 */
+   htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
+
+   err = -E2BIG;
+   if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE)
+   goto free_htab;
+
+   err = -ENOMEM;
+   htab->buckets = kmalloc_array(htab->n_buckets,
+ sizeof(struct hlist_head), GFP_USER);
+
+   if (!htab->buckets)
+   goto free_htab;
+
+   for (i = 0; i < htab->n_buckets; i++)
+   INIT_HLIST_HEAD(>buckets[i]);
+
+   spin_lock_init(>lock);
+   htab->count = 0;
+
+   htab->elem_size = sizeof(struct htab_elem) +
+ round_up(htab->map.key_size, 8) +
+ htab->map.value_size;
+
+   htab->elem_cache = kmem_cache_create("bpf_htab", htab->elem_size, 0, 0,
+NULL);
+   if (!htab->elem_cache)
+   goto free_buckets;
+
+   return >map;
+
+free_buckets:
+   kfree(htab->buckets);
+free_htab:
+   kfree(htab);
+   return ERR_PTR(err);
+}
+
+static inline u32 htab_map_hash(const void *key, u32 key_len)
+{
+   return jhash(key, key_len, 0);
+}
+
+static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
+{
+   return >buckets[hash & (htab->n_buckets - 1)];
+}
+
+static struct htab_elem

[PATCH RFC v7 net-next 04/28] bpf: enable bpf syscall on x64 and i386

2014-08-26 Thread Alexei Starovoitov

done as separate commit to ease conflict resolution

Signed-off-by: Alexei Starovoitov 
---
 arch/x86/syscalls/syscall_32.tbl  |1 +
 arch/x86/syscalls/syscall_64.tbl  |1 +
 include/linux/syscalls.h  |3 ++-
 include/uapi/asm-generic/unistd.h |4 +++-
 kernel/sys_ni.c   |3 +++
 5 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 028b78168d85..9fe1b5d002f0 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -363,3 +363,4 @@
 354i386seccomp sys_seccomp
 355i386getrandom   sys_getrandom
 356i386memfd_createsys_memfd_create
+357i386bpf sys_bpf
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 35dd922727b9..281150b539a2 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -327,6 +327,7 @@
 318common  getrandom   sys_getrandom
 319common  memfd_createsys_memfd_create
 320common  kexec_file_load sys_kexec_file_load
+321common  bpf sys_bpf
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 0f86d85a9ce4..bda9b81357cc 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -65,6 +65,7 @@ struct old_linux_dirent;
 struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
+union bpf_attr;
 
 #include 
 #include 
@@ -875,5 +876,5 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int 
flags,
const char __user *uargs);
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
  unsigned int flags);
-
+asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size);
 #endif
diff --git a/include/uapi/asm-generic/unistd.h 
b/include/uapi/asm-generic/unistd.h
index 11d11bc5c78f..22749c134117 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp)
 __SYSCALL(__NR_getrandom, sys_getrandom)
 #define __NR_memfd_create 279
 __SYSCALL(__NR_memfd_create, sys_memfd_create)
+#define __NR_bpf 280
+__SYSCALL(__NR_bpf, sys_bpf)
 
 #undef __NR_syscalls
-#define __NR_syscalls 280
+#define __NR_syscalls 281
 
 /*
  * All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 391d4ddb6f4b..b4b5083f5f5e 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -218,3 +218,6 @@ cond_syscall(sys_kcmp);
 
 /* operate on Secure Computing state */
 cond_syscall(sys_seccomp);
+
+/* access BPF programs and maps */
+cond_syscall(sys_bpf);
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 10/28] bpf: verifier (add ability to receive verification log)

2014-08-26 Thread Alexei Starovoitov

add optional attributes for BPF_PROG_LOAD syscall:
struct {
...
__u32 log_level;/* verbosity level of eBPF verifier */
__u32 log_size; /* size of user buffer */
void __user *log_buf;   /* user supplied buffer */
};

In such case the verifier will return its verification log in the user
supplied buffer which can be used by humans to analyze why verifier
rejected given program

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |5 +-
 kernel/bpf/verifier.c|  235 ++
 2 files changed, 239 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ac272bd7a884..a6fa0416f2bd 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -400,7 +400,10 @@ union bpf_attr {
__u32 insn_cnt;
const struct bpf_insn __user *insns;
const char __user *license;
-#defineBPF_PROG_LOAD_LAST_FIELD license
+   __u32 log_level;/* verbosity level of eBPF verifier */
+   __u32 log_size; /* size of user buffer */
+   void __user *log_buf;   /* user supplied buffer */
+#defineBPF_PROG_LOAD_LAST_FIELD log_buf
};
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 3d22b19c5fe0..81a64a50e48d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -144,9 +144,244 @@
  * load/store to bpf_context are checked against known fields
  */
 
+/* single container for all structs
+ * one verifier_env per bpf_check() call
+ */
+struct verifier_env {
+};
+
+/* verbose verifier prints what it's seeing
+ * bpf_check() is called under lock, so no race to access these global vars
+ */
+static u32 log_level, log_size, log_len;
+static void *log_buf;
+
+static DEFINE_MUTEX(bpf_verifier_lock);
+
+/* log_level controls verbosity level of eBPF verifier.
+ * verbose() is used to dump the verification trace to the log, so the user
+ * can figure out what's wrong with the program
+ */
+static void verbose(const char *fmt, ...)
+{
+   va_list args;
+
+   if (log_level == 0 || log_len >= log_size - 1)
+   return;
+
+   va_start(args, fmt);
+   log_len += vscnprintf(log_buf + log_len, log_size - log_len, fmt, args);
+   va_end(args);
+}
+
+static const char *const bpf_class_string[] = {
+   [BPF_LD]= "ld",
+   [BPF_LDX]   = "ldx",
+   [BPF_ST]= "st",
+   [BPF_STX]   = "stx",
+   [BPF_ALU]   = "alu",
+   [BPF_JMP]   = "jmp",
+   [BPF_RET]   = "BUG",
+   [BPF_ALU64] = "alu64",
+};
+
+static const char *const bpf_alu_string[] = {
+   [BPF_ADD >> 4]  = "+=",
+   [BPF_SUB >> 4]  = "-=",
+   [BPF_MUL >> 4]  = "*=",
+   [BPF_DIV >> 4]  = "/=",
+   [BPF_OR  >> 4]  = "|=",
+   [BPF_AND >> 4]  = "&=",
+   [BPF_LSH >> 4]  = "<<=",
+   [BPF_RSH >> 4]  = ">>=",
+   [BPF_NEG >> 4]  = "neg",
+   [BPF_MOD >> 4]  = "%=",
+   [BPF_XOR >> 4]  = "^=",
+   [BPF_MOV >> 4]  = "=",
+   [BPF_ARSH >> 4] = "s>>=",
+   [BPF_END >> 4]  = "endian",
+};
+
+static const char *const bpf_ldst_string[] = {
+   [BPF_W >> 3]  = "u32",
+   [BPF_H >> 3]  = "u16",
+   [BPF_B >> 3]  = "u8",
+   [BPF_DW >> 3] = "u64",
+};
+
+static const char *const bpf_jmp_string[] = {
+   [BPF_JA >> 4]   = "jmp",
+   [BPF_JEQ >> 4]  = "==",
+   [BPF_JGT >> 4]  = ">",
+   [BPF_JGE >> 4]  = ">=",
+   [BPF_JSET >> 4] = "&",
+   [BPF_JNE >> 4]  = "!=",
+   [BPF_JSGT >> 4] = "s>",
+   [BPF_JSGE >> 4] = "s>=",
+   [BPF_CALL >> 4] = "call",
+   [BPF_EXIT >> 4] = "exit",
+};
+
+static void print_bpf_insn(struct bpf_insn *insn)
+{
+   u8 class = BPF_CLASS(insn->code);
+
+   if (class == BPF_ALU || class == BPF_ALU64) {
+   if (BPF_SRC(insn->code) == BPF_X)
+   verbose("(%02x) %sr%d %s %sr%d\n",
+   insn->code, class == BPF_ALU ? "(u32) " : "",
+   insn->dst_reg,
+   bpf_alu_string[BPF_OP(insn->code) >> 4],
+   class == BPF_ALU ? "(u32) " : "",
+   insn->src_reg);
+   else
+   verbose("(%02x) %sr%d %s %s%d\n",
+   insn->code, class == BPF_ALU ? "(u32) " : "",
+   insn->dst_reg,
+   bpf_alu_string[BPF_OP(insn->code) >> 4],
+   class == BPF_ALU ? "(u32) " : "",
+   insn->imm);
+   } else if (class == BPF_STX) {
+   if (BPF_MODE(insn->code) == BPF_MEM)
+   verbose("(%02x) *(%s *)(r%d %+d) = r%d\n",
+   insn->code,
+   bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
+

[PATCH RFC v7 net-next 08/28] bpf: handle pseudo BPF_CALL insn

2014-08-26 Thread Alexei Starovoitov

in native eBPF programs userspace is using pseudo BPF_CALL instructions
which encode one of 'enum bpf_func_id' inside insn->imm field.
Verifier checks that program using correct function arguments to given func_id.
If all checks passed, kernel needs to fixup BPF_CALL->imm fields by
replacing func_id with in-kernel function pointer.
eBPF interpreter just calls the function.

In-kernel eBPF users continue to use generic BPF_CALL.

Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/syscall.c |   37 +
 1 file changed, 37 insertions(+)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c316f7c28895..9dbf7bd42ccf 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -339,6 +339,40 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl)
list_add(>list_node, _prog_types);
 }
 
+/* fixup insn->imm field of bpf_call instructions:
+ * if (insn->imm == BPF_FUNC_map_lookup_elem)
+ *  insn->imm = bpf_map_lookup_elem - __bpf_call_base;
+ * else if (insn->imm == BPF_FUNC_map_update_elem)
+ *  insn->imm = bpf_map_update_elem - __bpf_call_base;
+ * else ...
+ *
+ * this function is called after eBPF program passed verification
+ */
+static void fixup_bpf_calls(struct bpf_prog *prog)
+{
+   const struct bpf_func_proto *fn;
+   int i;
+
+   for (i = 0; i < prog->len; i++) {
+   struct bpf_insn *insn = >insnsi[i];
+
+   if (insn->code == (BPF_JMP | BPF_CALL)) {
+   /* we reach here when program has bpf_call instructions
+* and it passed bpf_check(), means that
+* ops->get_func_proto must have been supplied, check it
+*/
+   BUG_ON(!prog->info->ops->get_func_proto);
+
+   fn = prog->info->ops->get_func_proto(insn->imm);
+   /* all functions that have prototype and verifier 
allowed
+* programs to call them, must be real in-kernel 
functions
+*/
+   BUG_ON(!fn->func);
+   insn->imm = fn->func - __bpf_call_base;
+   }
+   }
+}
+
 /* drop refcnt on maps used by eBPF program and free auxilary data */
 static void free_bpf_prog_info(struct bpf_prog_info *info)
 {
@@ -465,6 +499,9 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_prog_info;
 
+   /* fixup BPF_CALL->imm field */
+   fixup_bpf_calls(prog);
+
/* eBPF program is ready to be JITed */
bpf_prog_select_runtime(prog);
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 09/28] bpf: verifier (add docs)

2014-08-26 Thread Alexei Starovoitov

this patch adds all of eBPF verfier documentation and empty bpf_check()

The end goal for the verifier is to statically check safety of the program.

Verifier will catch:
- loops
- out of range jumps
- unreachable instructions
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention

More details in Documentation/networking/filter.txt

Signed-off-by: Alexei Starovoitov 
---
 Documentation/networking/filter.txt |  230 +++
 include/linux/bpf.h |2 +
 kernel/bpf/Makefile |2 +-
 kernel/bpf/syscall.c|2 +-
 kernel/bpf/verifier.c   |  152 +++
 5 files changed, 386 insertions(+), 2 deletions(-)
 create mode 100644 kernel/bpf/verifier.c

diff --git a/Documentation/networking/filter.txt 
b/Documentation/networking/filter.txt
index 30c142b58936..713e71f9f5dd 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1001,6 +1001,105 @@ instruction that loads 64-bit immediate value into a 
dst_reg.
 Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads
 32-bit immediate value into a register.
 
+eBPF verifier
+-
+The safety of the eBPF program is determined in two steps.
+
+First step does DAG check to disallow loops and other CFG validation.
+In particular it will detect programs that have unreachable instructions.
+(though classic BPF checker allows them)
+
+Second step starts from the first insn and descends all possible paths.
+It simulates execution of every insn and observes the state change of
+registers and stack.
+
+At the start of the program the register R1 contains a pointer to context
+and has type PTR_TO_CTX.
+If verifier sees an insn that does R2=R1, then R2 has now type
+PTR_TO_CTX as well and can be used on the right hand side of expression.
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
+since addition of two valid pointers makes invalid pointer.
+(In 'secure' mode verifier will reject any type of pointer arithmetic to make
+sure that kernel addresses don't leak to unprivileged users)
+
+If register was never written to, it's not readable:
+  bpf_mov R0 = R2
+  bpf_exit
+will be rejected, since R2 is unreadable at the start of the program.
+
+After kernel function call, R1-R5 are reset to unreadable and
+R0 has a return type of the function.
+
+Since R6-R9 are callee saved, their state is preserved across the call.
+  bpf_mov R6 = 1
+  bpf_call foo
+  bpf_mov R0 = R6
+  bpf_exit
+is a correct program. If there was R1 instead of R6, it would have
+been rejected.
+
+Classic BPF register X is mapped to eBPF register R7 inside 
sk_convert_filter(),
+so that its state is preserved across calls.
+
+load/store instructions are allowed only with registers of valid types, which
+are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
+For example:
+ bpf_mov R1 = 1
+ bpf_mov R2 = 2
+ bpf_xadd *(u32 *)(R1 + 3) += R2
+ bpf_exit
+will be rejected, since R1 doesn't have a valid pointer type at the time of
+execution of instruction bpf_xadd.
+
+At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX.
+ctx is generic. verifier is configured to known what context is for particular
+class of bpf programs. For example, context == skb (for socket filters) and
+ctx == seccomp_data for seccomp filters.
+A callback is used to customize verifier to restrict eBPF program access to 
only
+certain fields within ctx structure with specified size and alignment.
+
+For example, the following insn:
+  bpf_ld R0 = *(u32 *)(R6 + 8)
+intends to load a word from address R6 + 8 and store it into R0
+If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
+that offset 8 of size 4 bytes can be accessed for reading, otherwise
+the verifier will reject the program.
+If R6=FRAME_PTR, then access should be aligned and be within
+stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
+so it will fail verification, since it's out of bounds.
+
+The verifier will allow eBPF program to read data from stack only after
+it wrote into it.
+Classic BPF verifier does similar check with M[0-15] memory slots.
+For example:
+  bpf_ld R0 = *(u32 *)(R10 - 4)
+  bpf_exit
+is invalid program.
+Though R10 is correct read-only register and has type FRAME_PTR
+and R10 - 4 is within stack bounds, there were no stores into that location.
+
+Pointer register spill/fill is tracked as well, since four (R6-R9)
+callee saved registers may not be enough for some programs.
+
+Allowed function calls are customized with bpf_verifier_ops->get_func_proto()
+The eBPF verifier will check that registers match argument constraints.
+After the call register R0 will be set to return type of the function.
+
+Function calls is a main mechanism to extend functionality of eBPF programs.
+Socket filters may

[PATCH RFC v7 net-next 07/28] bpf: expand BPF syscall with program load/unload

2014-08-26 Thread Alexei Starovoitov

eBPF programs are safe run-to-completion functions with load/unload
methods from userspace similar to kernel modules.

User space API:

- load eBPF program
  fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)

  where 'attr' is
  struct {
  enum bpf_prog_type prog_type;
  __u32 insn_cnt;
  struct bpf_insn __user *insns;
  const char __user *license;
  };
  insns - array of eBPF instructions
  license - must be GPL compatible to call helper functions marked gpl_only

- unload eBPF program
  close(fd)

User space tests and examples follow in the later patches

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |   36 ++
 include/linux/filter.h   |9 ++-
 include/uapi/linux/bpf.h |   27 
 kernel/bpf/syscall.c |  170 ++
 net/core/filter.c|2 +
 5 files changed, 242 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2887f3f9da59..8ea6f9923ff2 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -46,4 +46,40 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
 void bpf_map_put(struct bpf_map *map);
 struct bpf_map *bpf_map_get(struct fd f);
 
+/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
+ * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
+ * instructions after verifying
+ */
+struct bpf_func_proto {
+   u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+   bool gpl_only;
+};
+
+struct bpf_verifier_ops {
+   /* return eBPF function prototype for verification */
+   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+};
+
+struct bpf_prog_type_list {
+   struct list_head list_node;
+   struct bpf_verifier_ops *ops;
+   enum bpf_prog_type type;
+};
+
+void bpf_register_prog_type(struct bpf_prog_type_list *tl);
+
+struct bpf_prog_info {
+   atomic_t refcnt;
+   bool is_gpl_compatible;
+   enum bpf_prog_type prog_type;
+   struct bpf_verifier_ops *ops;
+   struct bpf_map **used_maps;
+   u32 used_map_cnt;
+};
+
+struct bpf_prog;
+
+void bpf_prog_put(struct bpf_prog *prog);
+struct bpf_prog *bpf_prog_get(u32 ufd);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/filter.h b/include/linux/filter.h
index f04793474d16..f06913b29861 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -31,11 +31,16 @@ struct sock_fprog_kern {
 struct sk_buff;
 struct sock;
 struct seccomp_data;
+struct bpf_prog_info;
 
 struct bpf_prog {
u32 jited:1,/* Is our filter JIT'ed? */
-   len:31; /* Number of filter blocks */
-   struct sock_fprog_kern  *orig_prog; /* Original BPF program */
+   has_info:1, /* whether 'info' is valid */
+   len:30; /* Number of filter blocks */
+   union {
+   struct sock_fprog_kern  *orig_prog; /* Original BPF program 
*/
+   struct bpf_prog_info*info;
+   };
unsigned int(*bpf_func)(const struct sk_buff *skb,
const struct bpf_insn *filter);
union {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ad0a5a495ec3..ac272bd7a884 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -355,6 +355,13 @@ enum bpf_cmd {
 * returns zero and stores next key or negative error
 */
BPF_MAP_GET_NEXT_KEY,
+
+   /* verify and load eBPF program
+* prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size)
+* Using attr->prog_type, attr->insns, attr->license
+* returns fd or negative error
+*/
+   BPF_PROG_LOAD,
 };
 
 enum bpf_map_type {
@@ -362,6 +369,10 @@ enum bpf_map_type {
BPF_MAP_TYPE_HASH,
 };
 
+enum bpf_prog_type {
+   BPF_PROG_TYPE_UNSPEC,
+};
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
enum bpf_map_type map_type;
@@ -383,6 +394,22 @@ union bpf_attr {
 #define BPF_MAP_DELETE_ELEM_LAST_FIELD key
 #define BPF_MAP_GET_NEXT_KEY_LAST_FIELD next_key
};
+
+   struct { /* anonymous struct used by BPF_PROG_LOAD command */
+   enum bpf_prog_type prog_type;
+   __u32 insn_cnt;
+   const struct bpf_insn __user *insns;
+   const char __user *license;
+#defineBPF_PROG_LOAD_LAST_FIELD license
+   };
+};
+
+/* integer value in 'imm' field of BPF_CALL instruction selects which helper
+ * function eBPF program intends to call
+ */
+enum bpf_func_id {
+   BPF_FUNC_unspec,
+   __BPF_FUNC_MAX_ID,
 };
 
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b863976741d4..c316f7c28895 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -14,6 +14,8 @@

[PATCH RFC v7 net-next 13/28] bpf: verifier (add verifier core)

2014-08-26 Thread Alexei Starovoitov

This patch adds verifier core which simulates execution of every insn and
records the state of registers and program stack. Every branch instruction seen
during simulation is pushed into state stack. When verifier reaches BPF_EXIT,
it pops the state from the stack and continues until it reaches BPF_EXIT again.
For program:
1: bpf_mov r1, xxx
2: if (r1 == 0) goto 5
3: bpf_mov r0, 1
4: goto 6
5: bpf_mov r0, 2
6: bpf_exit
The verifier will walk insns: 1, 2, 3, 4, 6
then it will pop the state recorded at insn#2 and will continue: 5, 6

This way it walks all possible paths through the program and checks all
possible values of registers. While doing so, it checks for:
- invalid instructions
- uninitialized register access
- uninitialized stack access
- misaligned stack access
- out of range stack access
- invalid calling convention
- BPF_LD_ABS|IND instructions are only used in socket filters
- instruction encoding is not using reserved fields

Kernel subsystem configures the verifier with two callbacks:

- bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
  that provides information to the verifer which fields of 'ctx'
  are accessible (remember 'ctx' is the first argument to eBPF program)

- const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id);
  returns argument constraints of kernel helper functions that eBPF program
  may call, so that verifier can checks that R1-R5 types match the prototype

More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |   47 ++
 include/uapi/linux/bpf.h |1 +
 kernel/bpf/verifier.c| 1061 +-
 3 files changed, 1108 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 490551e17c15..ad1bda7ece35 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -46,6 +46,31 @@ void bpf_register_map_type(struct bpf_map_type_list *tl);
 void bpf_map_put(struct bpf_map *map);
 struct bpf_map *bpf_map_get(struct fd f);
 
+/* function argument constraints */
+enum bpf_arg_type {
+   ARG_ANYTHING = 0,   /* any argument is ok */
+
+   /* the following constraints used to prototype
+* bpf_map_lookup/update/delete_elem() functions
+*/
+   ARG_CONST_MAP_PTR,  /* const argument used as pointer to bpf_map */
+   ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
+   ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
+
+   /* the following constraints used to prototype bpf_memcmp() and other
+* functions that access data on eBPF program stack
+*/
+   ARG_PTR_TO_STACK,   /* any pointer to eBPF program stack */
+   ARG_CONST_STACK_SIZE,   /* number of bytes accessed from stack */
+};
+
+/* type of values returned from helper functions */
+enum bpf_return_type {
+   RET_INTEGER,/* function returns integer */
+   RET_VOID,   /* function doesn't return anything */
+   RET_PTR_TO_MAP_VALUE_OR_NULL,   /* returns a pointer to map elem value 
or NULL */
+};
+
 /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF 
programs
  * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL
  * instructions after verifying
@@ -53,11 +78,33 @@ struct bpf_map *bpf_map_get(struct fd f);
 struct bpf_func_proto {
u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
bool gpl_only;
+   enum bpf_return_type ret_type;
+   enum bpf_arg_type arg1_type;
+   enum bpf_arg_type arg2_type;
+   enum bpf_arg_type arg3_type;
+   enum bpf_arg_type arg4_type;
+   enum bpf_arg_type arg5_type;
+};
+
+/* bpf_context is intentionally undefined structure. Pointer to bpf_context is
+ * the first argument to eBPF programs.
+ * For socket filters: 'struct bpf_context *' == 'struct sk_buff *'
+ */
+struct bpf_context;
+
+enum bpf_access_type {
+   BPF_READ = 1,
+   BPF_WRITE = 2
 };
 
 struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+
+   /* return true if 'size' wide access at offset 'off' within bpf_context
+* with 'type' (read or write) is allowed
+*/
+   bool (*is_valid_access)(int off, int size, enum bpf_access_type type);
 };
 
 struct bpf_prog_type_list {
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 04aaaef0daa7..dee7a2469b8d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -377,6 +377,7 @@ enum bpf_map_type {
 
 enum bpf_prog_type {
BPF_PROG_TYPE_UNSPEC,
+   BPF_PROG_TYPE_SOCKET_FILTER,
 };
 
 union bpf_attr {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 7365a190cbd6..eb19f753d4d7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -144,6 +144,72 @@

[PATCH RFC v7 net-next 12/28] bpf: verifier (add branch/goto checks)

2014-08-26 Thread Alexei Starovoitov

check that control flow graph of eBPF program is a directed acyclic graph

check_cfg() does:
- detect loops
- detect unreachable instructions
- check that program terminates with BPF_EXIT insn
- check that all branches are within program boundary

Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c |  183 +
 1 file changed, 183 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 73811d69e7be..7365a190cbd6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -332,6 +332,185 @@ static struct bpf_map *ld_imm64_to_map_ptr(struct 
bpf_insn *insn)
return (struct bpf_map *) (unsigned long) imm64;
 }
 
+/* non-recursive DFS pseudo code
+ * 1  procedure DFS-iterative(G,v):
+ * 2  label v as discovered
+ * 3  let S be a stack
+ * 4  S.push(v)
+ * 5  while S is not empty
+ * 6t <- S.pop()
+ * 7if t is what we're looking for:
+ * 8return t
+ * 9for all edges e in G.adjacentEdges(t) do
+ * 10   if edge e is already labelled
+ * 11   continue with the next edge
+ * 12   w <- G.adjacentVertex(t,e)
+ * 13   if vertex w is not discovered and not explored
+ * 14   label e as tree-edge
+ * 15   label w as discovered
+ * 16   S.push(w)
+ * 17   continue at 5
+ * 18   else if vertex w is discovered
+ * 19   label e as back-edge
+ * 20   else
+ * 21   // vertex w is explored
+ * 22   label e as forward- or cross-edge
+ * 23   label t as explored
+ * 24   S.pop()
+ *
+ * convention:
+ * 0x10 - discovered
+ * 0x11 - discovered and fall-through edge labelled
+ * 0x12 - discovered and fall-through and branch edges labelled
+ * 0x20 - explored
+ */
+
+enum {
+   DISCOVERED = 0x10,
+   EXPLORED = 0x20,
+   FALLTHROUGH = 1,
+   BRANCH = 2,
+};
+
+#define PUSH_INT(I) \
+   do { \
+   if (cur_stack >= insn_cnt) { \
+   ret = -E2BIG; \
+   goto free_st; \
+   } \
+   stack[cur_stack++] = I; \
+   } while (0)
+
+#define PEEK_INT() \
+   ({ \
+   int _ret; \
+   if (cur_stack == 0) \
+   _ret = -1; \
+   else \
+   _ret = stack[cur_stack - 1]; \
+   _ret; \
+})
+
+#define POP_INT() \
+   ({ \
+   int _ret; \
+   if (cur_stack == 0) \
+   _ret = -1; \
+   else \
+   _ret = stack[--cur_stack]; \
+   _ret; \
+})
+
+#define PUSH_INSN(T, W, E) \
+   do { \
+   int w = W; \
+   if (E == FALLTHROUGH && st[T] >= (DISCOVERED | FALLTHROUGH)) \
+   break; \
+   if (E == BRANCH && st[T] >= (DISCOVERED | BRANCH)) \
+   break; \
+   if (w < 0 || w >= insn_cnt) { \
+   verbose("jump out of range from insn %d to %d\n", T, 
w); \
+   ret = -EINVAL; \
+   goto free_st; \
+   } \
+   if (st[w] == 0) { \
+   /* tree-edge */ \
+   st[T] = DISCOVERED | E; \
+   st[w] = DISCOVERED; \
+   PUSH_INT(w); \
+   goto peek_stack; \
+   } else if ((st[w] & 0xF0) == DISCOVERED) { \
+   verbose("back-edge from insn %d to %d\n", T, w); \
+   ret = -EINVAL; \
+   goto free_st; \
+   } else if (st[w] == EXPLORED) { \
+   /* forward- or cross-edge */ \
+   st[T] = DISCOVERED | E; \
+   } else { \
+   verbose("insn state internal bug\n"); \
+   ret = -EFAULT; \
+   goto free_st; \
+   } \
+   } while (0)
+
+/* non-recursive depth-first-search to detect loops in BPF program
+ * loop == back-edge in directed graph
+ */
+static int check_cfg(struct verifier_env *env)
+{
+   struct bpf_insn *insns = env->prog->insnsi;
+   int insn_cnt = env->prog->len;
+   int cur_stack = 0;
+   int *stack;
+   int ret = 0;
+   int *st;
+   int i, t;
+
+   st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+   if (!st)
+   return -ENOMEM;
+
+   stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL);
+   if (!stack) {
+   kfree(st);
+   return -ENOMEM;
+   }
+
+   st[0] = DISCOVERED; /* mark 1st insn as discovered */
+   PUSH_INT(0);
+
+peek_stack:
+   while ((t = PEEK_INT()) != -1) {
+   if (BPF_CLASS(insns[t].code) == BPF_JMP) {
+

[PATCH RFC v7 net-next 15/28] bpf: allow eBPF programs to use maps

2014-08-26 Thread Alexei Starovoitov

expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem()
map accessors to eBPF programs

Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf.h  |5 
 include/uapi/linux/bpf.h |3 ++
 kernel/bpf/syscall.c |   68 ++
 3 files changed, 76 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ad1bda7ece35..14e23bb10b2d 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -131,4 +131,9 @@ struct bpf_prog *bpf_prog_get(u32 ufd);
 /* verify correctness of eBPF program */
 int bpf_check(struct bpf_prog *fp, union bpf_attr *attr);
 
+/* in-kernel helper functions called from eBPF programs */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dee7a2469b8d..f87b501b2e1b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -419,6 +419,9 @@ union bpf_attr {
  */
 enum bpf_func_id {
BPF_FUNC_unspec,
+   BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(, ) */
+   BPF_FUNC_map_update_elem, /* int map_update_elem(, , ) */
+   BPF_FUNC_map_delete_elem, /* int map_delete_elem(, ) */
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8f11d1549cfc..641bb9e6709c 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -575,3 +575,71 @@ free_attr:
kfree(attr);
return err;
 }
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL,
+ *.arg1_type = ARG_CONST_MAP_PTR,
+ *.arg2_type = ARG_PTR_TO_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+   void *key = (void *) (unsigned long) r2;
+   void *value;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   value = map->ops->map_lookup_elem(map, key);
+
+   return (unsigned long) value;
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = RET_INTEGER,
+ *.arg1_type = ARG_CONST_MAP_PTR,
+ *.arg2_type = ARG_PTR_TO_MAP_KEY,
+ *.arg3_type = ARG_PTR_TO_MAP_VALUE,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+   void *key = (void *) (unsigned long) r2;
+   void *value = (void *) (unsigned long) r3;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   return map->ops->map_update_elem(map, key, value);
+}
+
+/* called from eBPF program under rcu lock
+ *
+ * if kernel subsystem is allowing eBPF programs to call this function,
+ * inside its own verifier_ops->get_func_proto() callback it should return
+ * (struct bpf_func_proto) {
+ *.ret_type = RET_INTEGER,
+ *.arg1_type = ARG_CONST_MAP_PTR,
+ *.arg2_type = ARG_PTR_TO_MAP_KEY,
+ * }
+ * so that eBPF verifier properly checks the arguments
+ */
+u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   struct bpf_map *map = (struct bpf_map *) (unsigned long) r1;
+   void *key = (void *) (unsigned long) r2;
+
+   WARN_ON_ONCE(!rcu_read_lock_held());
+
+   return map->ops->map_delete_elem(map, key);
+}
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 11/28] bpf: handle pseudo BPF_LD_IMM64 insn

2014-08-26 Thread Alexei Starovoitov

eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
to refer to process-local map_fd. Scan the program for such instructions and
if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
by verifier to check access to maps in bpf_map_lookup/update() calls.
If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
BPF_PSEUDO_MAP_FD flag.

Note that eBPF interpreter is generic and knows nothing about pseudo insns.

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |6 ++
 kernel/bpf/verifier.c|  147 ++
 2 files changed, 153 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a6fa0416f2bd..04aaaef0daa7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -176,6 +176,12 @@ enum {
.off   = 0, \
.imm   = ((__u64) (IMM)) >> 32 })
 
+#define BPF_PSEUDO_MAP_FD  1
+
+/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */
+#define BPF_LD_MAP_FD(DST, MAP_FD) \
+   BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD)
+
 /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = 
imm32 */
 
 #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 81a64a50e48d..73811d69e7be 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -144,10 +144,15 @@
  * load/store to bpf_context are checked against known fields
  */
 
+#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */
+
 /* single container for all structs
  * one verifier_env per bpf_check() call
  */
 struct verifier_env {
+   struct bpf_prog *prog;  /* eBPF program being verified */
+   struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by 
eBPF program */
+   u32 used_map_cnt;   /* number of used maps */
 };
 
 /* verbose verifier prints what it's seeing
@@ -319,6 +324,115 @@ static void print_bpf_insn(struct bpf_insn *insn)
}
 }
 
+/* return the map pointer stored inside BPF_LD_IMM64 instruction */
+static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn)
+{
+   u64 imm64 = ((u64) (u32) insn[0].imm) | ((u64) (u32) insn[1].imm) << 32;
+
+   return (struct bpf_map *) (unsigned long) imm64;
+}
+
+/* look for pseudo eBPF instructions that access map FDs and
+ * replace them with actual map pointers
+ */
+static int replace_map_fd_with_map_ptr(struct verifier_env *env)
+{
+   struct bpf_insn *insn = env->prog->insnsi;
+   int insn_cnt = env->prog->len;
+   int i, j;
+
+   for (i = 0; i < insn_cnt; i++, insn++) {
+   if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) {
+   struct bpf_map *map;
+   struct fd f;
+
+   if (i == insn_cnt - 1 || insn[1].code != 0 ||
+   insn[1].dst_reg != 0 || insn[1].src_reg != 0 ||
+   insn[1].off != 0) {
+   verbose("invalid bpf_ld_imm64 insn\n");
+   return -EINVAL;
+   }
+
+   if (insn->src_reg == 0)
+   /* valid generic load 64-bit imm */
+   goto next_insn;
+
+   if (insn->src_reg != BPF_PSEUDO_MAP_FD) {
+   verbose("unrecognized bpf_ld_imm64 insn\n");
+   return -EINVAL;
+   }
+
+   f = fdget(insn->imm);
+
+   map = bpf_map_get(f);
+   if (IS_ERR(map)) {
+   verbose("fd %d is not pointing to valid 
bpf_map\n",
+   insn->imm);
+   fdput(f);
+   return PTR_ERR(map);
+   }
+
+   /* store map pointer inside BPF_LD_IMM64 instruction */
+   insn[0].imm = (u32) (unsigned long) map;
+   insn[1].imm = ((u64) (unsigned long) map) >> 32;
+
+   /* check whether we recorded this map already */
+   for (j = 0; j < env->used_map_cnt; j++)
+   if (env->used_maps[j] == map) {
+   fdput(f);
+   goto next_insn;
+   }
+
+   if (env->used_map_cnt >= MAX_USED_MAPS) {
+   fdput(f);
+   return -E2BIG;
+   }
+
+   /* remember this map */
+   env->used_maps[env->used_map_cnt++] = map;
+
+   /* hold the map. If the

Re: [PATCH] net: stmmac: add dcrs parameter

2014-08-26 Thread Chen-Yu Tsai

On Tue, Aug 26, 2014 at 9:20 PM, Giuseppe CAVALLARO
 wrote:
> On 8/26/2014 2:35 PM, Vince Bridgers wrote:
>>
>> Hi Peppe,
>>

 In the Synopsys EMAC case, carrier sense is used to stop transmitting
 if no carrier is sensed during a transmission. This is only useful if
 the media in use is true half duplex media (like obsolete 10Base2 or
 10Base5). If no one in using true half duplex media, then is it
 possible to set this disable by default? If we're not sure, then
 having an option feels like the right thing to do.
>>>
>>>
>>>
>>> Indeed this is what I had done in the patch.
>>>
>>>
>>> http://git.stlinux.com/?p=stm/linux-sh4-2.6.32.y.git;a=commit;h=b0b863bf65c36dc593f6b7b4b418394fd880dae2
>>>
>>> Also in case of carrier sense the frame will be dropped in any case
>>> later.
>>>
>>> Let me know if you Acked this patch so I will rebase it on
>>> net.git and I send it soon
>>>
>>> peppe
>>>
>>
>> Yes, this looks good to me. I don't expect anyone is using 10Base2 or
>> 10Base5 anymore, so it's ok to disable DCRS by default.
>>
>> ack
>>
>> All the best,
>
>
> thx so much, I will send this patch (with your Acked-by) and ported on
> net.git soon.
>
> Chen-Yu, Ley Foon, pls let me know if it is ok for you as well

Looks good. Thanks!

Cheers
ChenYu
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 18/28] tracing: allow eBPF programs call printk()

2014-08-26 Thread Alexei Starovoitov

limited printk() with %d %u %x %p modifiers only

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |1 +
 kernel/trace/bpf_trace.c |   61 ++
 2 files changed, 62 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 55adff33083e..1ec3d293d14e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -430,6 +430,7 @@ enum bpf_func_id {
BPF_FUNC_fetch_u8,/* u8 bpf_fetch_u8(void *unsafe_ptr) */
BPF_FUNC_memcmp,  /* int bpf_memcmp(void *unsafe_ptr, void 
*safe_ptr, int size) */
BPF_FUNC_dump_stack,  /* void bpf_dump_stack(void) */
+   BPF_FUNC_printk,  /* int bpf_printk(const char *fmt, int 
fmt_size, ...) */
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index b4751e2c0d52..ff98be5a24d6 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -60,6 +60,60 @@ static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, 
u64 r5)
return 0;
 }
 
+/* limited printk()
+ * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed
+ */
+static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5)
+{
+   char *fmt = (char *) (long) r1;
+   int fmt_cnt = 0;
+   bool mod_l[3] = {};
+   int i;
+
+   /* bpf_check() guarantees that fmt points to bpf program stack and
+* fmt_size bytes of it were initialized by bpf program
+*/
+   if (fmt[fmt_size - 1] != 0)
+   return -EINVAL;
+
+   /* check format string for allowed specifiers */
+   for (i = 0; i < fmt_size; i++)
+   if (fmt[i] == '%') {
+   if (fmt_cnt >= 3)
+   return -EINVAL;
+   i++;
+   if (i >= fmt_size)
+   return -EINVAL;
+
+   if (fmt[i] == 'l') {
+   mod_l[fmt_cnt] = true;
+   i++;
+   if (i >= fmt_size)
+   return -EINVAL;
+   } else if (fmt[i] == 'p') {
+   mod_l[fmt_cnt] = true;
+   fmt_cnt++;
+   continue;
+   }
+
+   if (fmt[i] == 'l') {
+   mod_l[fmt_cnt] = true;
+   i++;
+   if (i >= fmt_size)
+   return -EINVAL;
+   }
+
+   if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x')
+   return -EINVAL;
+   fmt_cnt++;
+   }
+
+   return __trace_printk((unsigned long) __builtin_return_address(3), fmt,
+ mod_l[0] ? r3 : (u32) r3,
+ mod_l[1] ? r4 : (u32) r4,
+ mod_l[2] ? r5 : (u32) r5);
+}
+
 static struct bpf_func_proto tracing_filter_funcs[] = {
 #define FETCH(SIZE)\
[BPF_FUNC_fetch_##SIZE] = { \
@@ -86,6 +140,13 @@ static struct bpf_func_proto tracing_filter_funcs[] = {
.gpl_only = false,
.ret_type = RET_VOID,
},
+   [BPF_FUNC_printk] = {
+   .func = bpf_printk,
+   .gpl_only = true,
+   .ret_type = RET_INTEGER,
+   .arg1_type = ARG_PTR_TO_STACK,
+   .arg2_type = ARG_CONST_STACK_SIZE,
+   },
[BPF_FUNC_map_lookup_elem] = {
.func = bpf_map_lookup_elem,
.gpl_only = false,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 20/28] tracing: allow eBPF programs to call ktime_get_ns() and get_current()

2014-08-26 Thread Alexei Starovoitov

Signed-off-by: Alexei Starovoitov 
---
 include/uapi/linux/bpf.h |2 ++
 kernel/trace/bpf_trace.c |   20 
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1ec3d293d14e..e14e147c8899 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -431,6 +431,8 @@ enum bpf_func_id {
BPF_FUNC_memcmp,  /* int bpf_memcmp(void *unsafe_ptr, void 
*safe_ptr, int size) */
BPF_FUNC_dump_stack,  /* void bpf_dump_stack(void) */
BPF_FUNC_printk,  /* int bpf_printk(const char *fmt, int 
fmt_size, ...) */
+   BPF_FUNC_ktime_get_ns,/* u64 bpf_ktime_get_ns(void) */
+   BPF_FUNC_get_current, /* struct task_struct *bpf_get_current(void) 
*/
__BPF_FUNC_MAX_ID,
 };
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index ff98be5a24d6..a98e13e1131b 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -114,6 +114,16 @@ static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 
r4, u64 r5)
  mod_l[2] ? r5 : (u32) r5);
 }
 
+static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   return ktime_get_ns();
+}
+
+static u64 bpf_get_current(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
+{
+   return (u64) (long) current;
+}
+
 static struct bpf_func_proto tracing_filter_funcs[] = {
 #define FETCH(SIZE)\
[BPF_FUNC_fetch_##SIZE] = { \
@@ -169,6 +179,16 @@ static struct bpf_func_proto tracing_filter_funcs[] = {
.arg1_type = ARG_CONST_MAP_PTR,
.arg2_type = ARG_PTR_TO_MAP_KEY,
},
+   [BPF_FUNC_ktime_get_ns] = {
+   .func = bpf_ktime_get_ns,
+   .gpl_only = true,
+   .ret_type = RET_INTEGER,
+   },
+   [BPF_FUNC_get_current] = {
+   .func = bpf_get_current,
+   .gpl_only = true,
+   .ret_type = RET_INTEGER,
+   },
 };
 
 static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id 
func_id)
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 16/28] bpf: split eBPF out of NET

2014-08-26 Thread Alexei Starovoitov

let eBPF have its own CONFIG_BPF, so that tracing and other subsystems don't
need to depend on all of NET

Signed-off-by: Alexei Starovoitov 
---
 arch/Kconfig  |3 +++
 kernel/Makefile   |2 +-
 kernel/bpf/core.c |   12 
 net/Kconfig   |1 +
 4 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 0eae9df35b88..80a72f6f6b60 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -333,6 +333,9 @@ config SECCOMP_FILTER
 
  See Documentation/prctl/seccomp_filter.txt for details.
 
+config BPF
+   boolean
+
 config HAVE_CC_STACKPROTECTOR
bool
help
diff --git a/kernel/Makefile b/kernel/Makefile
index dc5c77544fd6..17ea6d4a9a24 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -86,7 +86,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/
 obj-$(CONFIG_TRACEPOINTS) += trace/
 obj-$(CONFIG_IRQ_WORK) += irq_work.o
 obj-$(CONFIG_CPU_PM) += cpu_pm.o
-obj-$(CONFIG_NET) += bpf/
+obj-$(CONFIG_BPF) += bpf/
 
 obj-$(CONFIG_PERF_EVENTS) += events/
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 0434c2170f2b..c17ba0ef3dcf 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -537,3 +537,15 @@ void bpf_prog_free(struct bpf_prog *fp)
bpf_jit_free(fp);
 }
 EXPORT_SYMBOL_GPL(bpf_prog_free);
+
+/* To emulate LD_ABS/LD_IND instructions __sk_run_filter() may call
+ * skb_copy_bits(), so provide a weak definition for it in NET-less config.
+ * seccomp_check_filter() verifies that seccomp filters are not using
+ * LD_ABS/LD_IND instructions. Other BPF users (like tracing filters)
+ * must not use these instructions unless ctx==skb
+ */
+int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to,
+int len)
+{
+   return -EFAULT;
+}
diff --git a/net/Kconfig b/net/Kconfig
index 4051fdfa4367..9a99e16d6f28 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -6,6 +6,7 @@ menuconfig NET
bool "Networking support"
select NLATTR
select GENERIC_NET_UTILS
+   select BPF
---help---
  Unless you really know what you are doing, you should say Y here.
  The reason is that some programs need kernel networking support even
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC v7 net-next 19/28] tracing: allow eBPF programs to be attached to kprobe/kretprobe

2014-08-26 Thread Alexei Starovoitov

Signed-off-by: Alexei Starovoitov 
---
 kernel/trace/trace_kprobe.c |   28 
 1 file changed, 28 insertions(+)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 282f6e4e5539..b6db92207c99 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,6 +19,7 @@
 
 #include 
 #include 
+#include 
 
 #include "trace_probe.h"
 
@@ -930,6 +931,22 @@ __kprobe_trace_func(struct trace_kprobe *tk, struct 
pt_regs *regs,
if (ftrace_trigger_soft_disabled(ftrace_file))
return;
 
+   if (call->flags & TRACE_EVENT_FL_BPF) {
+   struct bpf_context ctx = {};
+   unsigned long args[3];
+   /* get first 3 arguments of the function. x64 syscall ABI uses
+* the same 3 registers as x64 calling convention.
+* todo: implement it cleanly via arch specific
+* regs_get_argument_nth() helper
+*/
+   syscall_get_arguments(current, regs, 0, 3, args);
+   ctx.arg1 = args[0];
+   ctx.arg2 = args[1];
+   ctx.arg3 = args[2];
+   trace_filter_call_bpf(ftrace_file->filter, );
+   return;
+   }
+
local_save_flags(irq_flags);
pc = preempt_count();
 
@@ -978,6 +995,17 @@ __kretprobe_trace_func(struct trace_kprobe *tk, struct 
kretprobe_instance *ri,
if (ftrace_trigger_soft_disabled(ftrace_file))
return;
 
+   if (call->flags & TRACE_EVENT_FL_BPF) {
+   struct bpf_context ctx = {};
+   /* assume that register used to return a value from syscall is
+* the same as register used to return a value from a function
+* todo: provide arch specific helper
+*/
+   ctx.ret = syscall_get_return_value(current, regs);
+   trace_filter_call_bpf(ftrace_file->filter, );
+   return;
+   }
+
local_save_flags(irq_flags);
pc = preempt_count();
 
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1766 matches

Mail list logo