Re: [PATCH v6 5/6] arm64: add SIGSYS siginfo for compat task
On 08/27/2014 02:55 AM, Will Deacon wrote: On Thu, Aug 21, 2014 at 09:56:44AM +0100, AKASHI Takahiro wrote: SIGSYS is primarily used in secure computing to notify tracer. This patch allows signal handler on compat task to get correct information with SA_SYSINFO specified when this signal is delivered. Signed-off-by: AKASHI Takahiro --- arch/arm64/include/asm/compat.h |7 +++ arch/arm64/kernel/signal32.c|8 2 files changed, 15 insertions(+) diff --git a/arch/arm64/include/asm/compat.h b/arch/arm64/include/asm/compat.h index 253e33b..c877915 100644 --- a/arch/arm64/include/asm/compat.h +++ b/arch/arm64/include/asm/compat.h @@ -205,6 +205,13 @@ typedef struct compat_siginfo { compat_long_t _band;/* POLL_IN, POLL_OUT, POLL_MSG */ int _fd; } _sigpoll;h + + /* SIGSYS */ + struct { + compat_uptr_t _call_addr; /* calling user insn */ + int _syscall; /* triggering system call number */ + unsigned int _arch; /* AUDIT_ARCH_* of syscall */ + } _sigsys; } _sifields; } compat_siginfo_t; diff --git a/arch/arm64/kernel/signal32.c b/arch/arm64/kernel/signal32.c index 1b9ad02..aa550d6 100644 --- a/arch/arm64/kernel/signal32.c +++ b/arch/arm64/kernel/signal32.c @@ -186,6 +186,14 @@ int copy_siginfo_to_user32(compat_siginfo_t __user *to, const siginfo_t *from) err |= __put_user(from->si_uid, >si_uid); err |= __put_user((compat_uptr_t)(unsigned long)from->si_ptr, >si_ptr); break; +#ifdef __ARCH_SIGSYS + case __SI_SYS: + err |= __put_user((compat_uptr_t)(unsigned long) + from->si_call_addr, >si_call_addr); + err |= __put_user(from->si_syscall, >si_syscall); + err |= __put_user(from->si_arch, >si_arch); + break; +#endif I think you should drop this #ifdef. We care about whether arch/arm/ defines __ARCH_SIGSYS, not whether arm64 defines it (they both happen to define it anyway). Thanks. Done -Takahiro AKASHI Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 4/6] arm64: add seccomp syscall for compat task
On 08/27/2014 02:53 AM, Will Deacon wrote: On Thu, Aug 21, 2014 at 09:56:43AM +0100, AKASHI Takahiro wrote: This patch allows compat task to issue seccomp() system call. Signed-off-by: AKASHI Takahiro --- arch/arm64/include/asm/unistd.h |2 +- arch/arm64/include/asm/unistd32.h |3 +++ 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index 4bc95d2..cf6ee31 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -41,7 +41,7 @@ #define __ARM_NR_compat_cacheflush(__ARM_NR_COMPAT_BASE+2) #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE+5) -#define __NR_compat_syscalls 383 +#define __NR_compat_syscalls 384 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index e242600..2922c40 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -787,3 +787,6 @@ __SYSCALL(__NR_sched_setattr, sys_sched_setattr) __SYSCALL(__NR_sched_getattr, sys_sched_getattr) #define __NR_renameat2 382 __SYSCALL(__NR_renameat2, sys_renameat2) +#define __NR_seccomp 383 +__SYSCALL(__NR_seccomp, sys_seccomp) + This will need rebasing onto -rc2, as we're hooked up two new compat syscalls recently. Thanks for heads-up. Fixed it. -Takahiro AKASHI Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 2/6] arm64: ptrace: allow tracer to skip a system call
On 08/27/2014 02:51 AM, Will Deacon wrote: On Fri, Aug 22, 2014 at 01:35:17AM +0100, AKASHI Takahiro wrote: On 08/22/2014 02:08 AM, Kees Cook wrote: On Thu, Aug 21, 2014 at 3:56 AM, AKASHI Takahiro wrote: diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 8876049..c54dbcc 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1121,9 +1121,29 @@ static void tracehook_report_syscall(struct pt_regs *regs, asmlinkage int syscall_trace_enter(struct pt_regs *regs) { + unsigned int saved_syscallno = regs->syscallno; + if (test_thread_flag(TIF_SYSCALL_TRACE)) tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); + if (IS_SKIP_SYSCALL(regs->syscallno)) { + /* +* RESTRICTION: we can't modify a return value of user +* issued syscall(-1) here. In order to ease this flavor, +* we need to treat whatever value in x0 as a return value, +* but this might result in a bogus value being returned. +*/ + /* +* NOTE: syscallno may also be set to -1 if fatal signal is +* detected in tracehook_report_syscall_entry(), but since +* a value set to x0 here is not used in this case, we may +* neglect the case. +*/ + if (!test_thread_flag(TIF_SYSCALL_TRACE) || + (IS_SKIP_SYSCALL(saved_syscallno))) + regs->regs[0] = -ENOSYS; + } + I don't have a runtime environment yet for arm64, so I can't test this directly myself, so I'm just trying to eyeball this. :) Once the seccomp logic is added here, I don't think using -2 as a special value will work. Doesn't this mean the Oops is possible by the user issuing a "-2" syscall? As in, if TIF_SYSCALL_WORK is set, and the user passed -2 as the syscall, audit will be called only on entry, and then skipped on exit? Oops, you're absolutely right. I didn't think of this case. syscall_trace_enter() should not return a syscallno directly, but always return -1 if syscallno < 0. (except when secure_computing() returns with -1) This also implies that tracehook_report_syscall() should also have a return value. Will, is this fine with you? Well, the first thing that jumps out at me is why this is being done completely differently for arm64 and arm. I thought adding the new ptrace requests would reconcile the differences? I'm not sure what portion of my code you mentioned as "completely different", but 1) setting x0 to -ENOSYS is necessary because, otherwise, user-issued syscall(-1) will return a bogus value when audit tracing is on. Please note that, on arm, not traced traced -- -- syscall(-1) aborted OOPs(BUG_ON) syscall(-3000) aborted aborted syscall(1000)ENOSYS ENOSYS So, anyhow, its a bit difficult and meaningless to mimic these invalid cases. 2) branching a new label, syscall_trace_return_skip (see entry.S), after syscall_trace_enter() is necessary in order to avoid OOPS in audit_syscall_enter() as we discussed. Did I make it clear? -Takahiro AKASHI Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [alsa-devel] [PATCH 1/2] regmap: cache: Fix regcache_sync_block for non-autoincrementing devices
On 08/26/2014 05:21 PM, Takashi Iwai wrote: At Tue, 26 Aug 2014 17:03:12 +0300, Jarkko Nikula wrote: Commit 75a5f89f635c ("regmap: cache: Write consecutive registers in a single block write") expected that autoincrementing writes are supported if hardware has a register format which can support raw writes. This is not necessarily true and thus for instance rbtree sync can fail when there is need to sync multiple consecutive registers but block write to device fails due not supported autoincrementing writes. Fix this by spliting raw block sync to series of single register writes for devices that don't support autoincrementing writes. Wouldn't it suffice to correct regmap_can_raw_write() to return false if map->use_single_rw is set? I don't know. I was thinking that also but was unsure about it since regcache_sync_block_raw() and regcache_sync_block_single() code paths use different regmap write functions. regcache_sync_block_raw() ends up calling _regmap_raw_write() which takes care of page select operation when needed and regcache_sync_block_single() uses _regmap_write() which doesn't. Which makes me thinking should the regcache_sync_block_single() also use _regmap_raw_write() in order to take care of page selects? -- Jarkko -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[Bugfix] x86, irq: Fix bug in setting IOAPIC pin attributes
Commit 15a3c7cc9154321fc3 "x86, irq: Introduce two helper functions to support irqdomain map operation" breaks LPSS ACPI enumerated devices. On startup, IOAPIC driver preallocates IRQ descriptors and programs IOAPIC pins with default level and polarity attributes for all legacy IRQs. Later legacy IRQ users may fail to set IOAPIC pin attributes if the requested attributes conflicts with the default IOAPIC pin attributes. So change mp_irqdomain_map() to allow the first legacy IRQ user to reprogram IOAPIC pin with different attributes. Reported-by: Mika Westerberg Signed-off-by: Jiang Liu --- Hi Mika, We have a plan to kill function mp_set_gsi_attr() later, so I have slightly modified your changes. Could you please help to test it again? Regards! Gerry --- arch/x86/kernel/apic/io_apic.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c index 29290f554e79..40a4aa3f4061 100644 --- a/arch/x86/kernel/apic/io_apic.c +++ b/arch/x86/kernel/apic/io_apic.c @@ -1070,6 +1070,11 @@ static int mp_map_pin_to_irq(u32 gsi, int idx, int ioapic, int pin, } if (flags & IOAPIC_MAP_ALLOC) { + /* special handling for legacy IRQs */ + if (irq < nr_legacy_irqs() && info->count == 1 && + mp_irqdomain_map(domain, irq, pin) != 0) + irq = -1; + if (irq > 0) info->count++; else if (info->count == 0) @@ -3896,7 +3901,15 @@ int mp_irqdomain_map(struct irq_domain *domain, unsigned int virq, info->polarity = 1; } info->node = NUMA_NO_NODE; - info->set = 1; + + /* +* setup_IO_APIC_irqs() programs all legacy IRQs with default +* trigger and polarity attributes. Don't set the flag for that +* case so the first legacy IRQ user could reprogram the pin +* with real trigger and polarity attributes. +*/ + if (virq >= nr_legacy_irqs() || info->count) + info->set = 1; } set_io_apic_irq_attr(, ioapic, hwirq, info->trigger, info->polarity); -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kernel.org PATCH] Li Zefan is now the 3.4 stable maintainer
On Tue, Aug 26, 2014 at 04:08:58PM -0700, Greg KH wrote: > Li has agreed to continue to support the 3.4 stable kernel tree until > September 2016. Great! Welcome to Li in this strange world of very long term maintenance :-) Willy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 1/6] arm64: ptrace: add PTRACE_SET_SYSCALL
Kees, On 08/27/2014 02:46 AM, Will Deacon wrote: On Fri, Aug 22, 2014 at 01:19:13AM +0100, AKASHI Takahiro wrote: On 08/22/2014 01:47 AM, Kees Cook wrote: On Thu, Aug 21, 2014 at 3:56 AM, AKASHI Takahiro wrote: To allow tracer to be able to change/skip a system call by re-writing a syscall number, there are several approaches: (1) modify x8 register with ptrace(PTRACE_SETREGSET), and handle this case later on in syscall_trace_enter(), or (2) support ptrace(PTRACE_SET_SYSCALL) as on arm Thinking of the fact that user_pt_regs doesn't expose 'syscallno' to tracer as well as that secure_computing() expects a changed syscall number to be visible, especially case of -1, before this function returns in syscall_trace_enter(), we'd better take (2). Signed-off-by: AKASHI Takahiro Thanks, I like having this on both arm and arm64. Yeah, having this simplified the code of syscall_trace_enter() a bit, but also imposes some restriction on arm64, too. > I wonder if other archs should add this option too. Do you think so? I assumed that SET_SYSCALL is to be avoided if possible. I also think that SET_SYSCALL should take an extra argument for a return value just in case of -1 (or we have SKIP_SYSCALL?). I think we should propose this as a new request in the generic ptrace code. We can have an architecture-hook for actually setting the syscall, and allow architectures to define their own implementation of the request so they can be moved over one by one. What do you think about this request? -Takahiro AKASHI Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] ARM: probes: return directly when emulate not set
When kprobe decoding instruction, original code calls instruction specific decoder if emulate is set to false. However, instructions with DECODE_TYPE_EMULATE are in fact don't have their decoder. What in the action table are in fact handlers. For example: /* LDRD (immediate) 000x x1x0 1101 */ /* STRD (immediate) 000x x1x0 */ DECODE_EMULATEX (0x0e5000d0, 0x004000d0, PROBES_LDRSTRD, REGS(NOPCWB, NOPCX, 0, 0, 0)), and const union decode_action kprobes_arm_actions[NUM_PROBES_ARM_ACTIONS] = { ... [PROBES_LDRSTRD] = {.handler = emulate_ldrdstrd}, ... In this situation, original code calls 'emulate_ldrdstrd' as a decoder, which is obviously incorrect. This patch makes it returns INSN_GOOD directly when 'emulate' is not true. Signed-off-by: Wang Nan Cc: "David A. Long" Cc: Russell King Cc: Jon Medhurst Cc: Taras Kondratiuk Cc: Ben Dooks --- arch/arm/kernel/probes.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm/kernel/probes.c b/arch/arm/kernel/probes.c index a8ab540..1c77b8d 100644 --- a/arch/arm/kernel/probes.c +++ b/arch/arm/kernel/probes.c @@ -436,8 +436,7 @@ probes_decode_insn(probes_opcode_t insn, struct arch_probes_insn *asi, struct decode_emulate *d = (struct decode_emulate *)h; if (!emulate) - return actions[d->handler.action].decoder(insn, - asi, h); + return INSN_GOOD; asi->insn_handler = actions[d->handler.action].handler; set_emulated_insn(insn, asi, thumb); -- 1.8.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v7 3/8] cpufreq: kirkwood: Remove use of the clk provider API
On Tue, Aug 26, 2014 at 5:35 PM, Andrew Lunn wrote: >> One final thought I have had is that it might be a good idea to have a >> mux clock which represents the clock signal that feeds into the cpu. It >> seems that a mux is exactly what is going on here with cpuclk rate and >> ddrclk rate. > > I did think of this when i implemented the cpufreq driver. What makes > it hard is that this bit is mixed in the same 32 bit register as the > gate clocks. It would mean two clock providers sharing the same > register, sharing a spinlock, etc. And the gating provider is shared > with all mvebu platforms, dove, kirkword, orion5x, and four different > armada SoCs. So i'm pushing complexity into this shared code, which > none of the others need. Does the standard mux clock do what is > needed? Or would i have to write a new clock provider? Well I think that the mux-clock type should suffice. Both the standard gate and mux can have a spinlock passed in at registration-time, so it could be a shared spinlock. The standard mux clock expects a bitfield in a register, similar to the single-bit approach taken by the gate clock. So I think it could do just fine. > >> I even wonder if it is even appropriate to model this transition with a >> clock enable operation? Maybe it is only a multiplex operation, or >> perhaps a combination of enabling the powersave clock and changing the >> parent input to the cpu? >> >> My idea is instead of relying on a cpufreq driver to parse the state of >> your clocks and understand the multiplexing, you can use the clock >> framework for that. In fact that might help you get one step closer to >> using the cpufreq-cpu0.c/cpufreq-generic.c implementation. > > So you want the whole disabling of interrupt delivery to the cpu, > flipping the mux, wait for interrupt and re-enabling of interrupt > delivery to the cpu inside the clock provider? That is way past a > simple mux clock. No way! I said, "one step closer" for a reason. The interrupt stuff is clearly out of scope. Regards, Mike > >Andrew -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 3/4] zram: zram memory size limitation
On Wed, Aug 27, 2014 at 11:51:32AM +0900, Minchan Kim wrote: > Hey Joonsoo, > > On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote: > > Hello, Minchan and David. > > > > On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote: > > > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim wrote: > > > > Hey Joonsoo, > > > > > > > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote: > > > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote: > > > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, > > > >> > struct bio_vec *bvec, u32 index, > > > >> > ret = -ENOMEM; > > > >> > goto out; > > > >> > } > > > >> > + > > > >> > + if (zram->limit_pages && > > > >> > + zs_get_total_pages(meta->mem_pool) > zram->limit_pages) { > > > >> > + zs_free(meta->mem_pool, handle); > > > >> > + ret = -ENOMEM; > > > >> > + goto out; > > > >> > + } > > > >> > + > > > >> > cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO); > > > >> > > > >> Hello, > > > >> > > > >> I don't follow up previous discussion, so I could be wrong. > > > >> Why this enforcement should be here? > > > >> > > > >> I think that this has two problems. > > > >> 1) alloc/free happens unnecessarilly if we have used memory over the > > > >> limitation. > > > > > > > > True but firstly, I implemented the logic in zsmalloc, not zram but > > > > as I described in cover-letter, it's not a requirement of zsmalloc > > > > but zram so it should be in there. If every user want it in future, > > > > then we could move the function into zsmalloc. That's what we > > > > concluded in previous discussion. > > > > Hmm... > > Problem is that we can't avoid these unnecessary overhead in this > > implementation. If we can implement this feature in zram efficiently, > > it's okay. But, I think that current form isn't. > > > If we can add it in zsmalloc, it would be more clean and efficient > for zram but as I said, at the moment, I didn't want to put zram's > requirement into zsmalloc because to me, it's weird to enforce max > limit to allocator. It's client's role, I think. AFAIK, many kinds of pools such as thread-pool or memory-pool have their own limit. It's not weird for me. > If current implementation is expensive and rather hard to follow, > It would be one reason to move the feature into zsmalloc but > I don't think it makes critical trobule in zram usecase. > See below. > > But I still open and will wait others's opinion. > If other guys think zsmalloc is better place, I am willing to move > it into zsmalloc. > > > > > > > > > > > Another idea is we could call zs_get_total_pages right before zs_malloc > > > > but the problem is we cannot know how many of pages are allocated > > > > by zsmalloc in advance. > > > > IOW, zram should be blind on zsmalloc's internal. > > > > > > > > > > We did however suggest that we could check before hand to see if > > > max was already exceeded as an optimization. > > > (possibly with a guess on usage but at least using the minimum of 1 page) > > > In the contested case, the max may already be exceeded transiently and > > > therefore we know this one _could_ fail (it could also pass, but odds > > > aren't good). > > > As Minchan mentions this was discussed before - but not into great detail. > > > Testing should be done to determine possible benefit. And as he also > > > mentions, the better place for it may be in zsmalloc, but that > > > requires an ABI change. > > > > Why we hesitate to change zsmalloc API? It is in-kernel API and there > > are just two users now, zswap and zram. We can change it easily. > > I think that we just need following simple API change in zsmalloc.c. > > > > zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op) > > => > > zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops > > *zpool_op) > > > > It's pool allocator so there is no obstacle for us to limit maximum > > memory usage in zsmalloc. It's a natural idea to limit memory usage > > for pool allocator. > > > > > Certainly a detailed suggestion could happen on this thread and I'm > > > also interested > > > in your thoughts, but this patchset should be able to go in as is. > > > Memory exhaustion avoidance probably trumps the possible thrashing at > > > threshold. > > > > > > > About alloc/free cost once if it is over the limit, > > > > I don't think it's important to consider. > > > > Do you have any scenario in your mind to consider alloc/free cost > > > > when the limit is over? > > > > > > > >> 2) Even if this request doesn't do new allocation, it could be failed > > > >> due to other's allocation. There is time gap between allocation and > > > >> free, so legimate user who want to use preallocated zsmalloc memory > > > >> could also see this condition true and then he will be failed. > > > > > > > > Yeb, we already discussed that. :) > > > > Such false positive shouldn't be a severe problem if we
Re: [kernel.org PATCH] Li Zefan is now the 3.4 stable maintainer
On Tue, Aug 26, 2014 at 04:08:58PM -0700, Greg KH wrote: > Li has agreed to continue to support the 3.4 stable kernel tree until > September 2016. Update the releases.html page on kernel.org to reflect > this. > Li, it would be great if you can send me information about your -stable queue, ie how you maintain it and where it is located. This will enable me to continue testing the stable queue for the 3.4 kernel. Thanks, Guenter > Signed-off-by: Greg Kroah-Hartman > > > diff --git a/content/releases.rst b/content/releases.rst > index 4a3327f4ca9e..c71a33f34f1b 100644 > --- a/content/releases.rst > +++ b/content/releases.rst > @@ -43,7 +43,7 @@ Longterm > 3.14 Greg Kroah-Hartman 2014-03-30 Aug, 2016 > 3.12 Jiri Slaby 2013-11-03 2016 > 3.10 Greg Kroah-Hartman 2013-06-30 Sep, 2015 > -3.4 Greg Kroah-Hartman 2012-05-20 Oct, 2014 > +3.4 Li Zefan 2012-05-20 Sep, 2016 > 3.2 Ben Hutchings2012-01-04 2016 > 2.6.32 Willy Tarreau2009-12-03 Mid-2015 > == > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] add selftest for virtio-net
On 08/27/2014 09:45 AM, Hengjinxiao wrote: > Selftest is an important part of network driver, this patch adds selftest for > virtio-net, including loopback test, negotiate test and reset test. Loopback > test checks whether virtio-net can send and receive packets normally. > Negotiate test > executes feature negotiation between virtio-net driver in Guest OS and > virtio-net > device in Host OS. Reset test resets virtio-net. Thanks for the patch. Feature negotiation part brings some complicity and need more through. And this could be extended for CVE regression in the future. And you probably also need to send a patch of virtio spec to implement the loop back mode. See comments inline. > > Signed-off-by: Hengjinxiao > > --- > drivers/net/virtio_net.c| 233 > +++- > include/uapi/linux/virtio_net.h | 9 ++ > 2 files changed, 241 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index 59caa06..f83f6e4 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -28,6 +28,7 @@ > #include > #include > #include > +#include > > static int napi_weight = NAPI_POLL_WEIGHT; > module_param(napi_weight, int, 0444); > @@ -51,6 +52,23 @@ module_param(gso, bool, 0444); > #define MERGEABLE_BUFFER_ALIGN max(L1_CACHE_BYTES, 256) > > #define VIRTNET_DRIVER_VERSION "1.0.0" > +#define __VIRTNET_TESTING 0 > + Why need this marco? > +enum { > + VIRTNET_LOOPBACK_TEST, > + VIRTNET_FEATURE_NEG_TEST, > + VIRTNET_RESET_TEST, > +}; > + > +static const struct { > + const char string[ETH_GSTRING_LEN]; > +} virtnet_gstrings_test[] = { > + [VIRTNET_LOOPBACK_TEST] = { "loopback test (offline)" }, > + [VIRTNET_FEATURE_NEG_TEST] = { "negotiate test (offline)" }, > + [VIRTNET_RESET_TEST]= { "reset test (offline)" }, > +}; > + > +#define VIRTNET_NUM_TEST ARRAY_SIZE(virtnet_gstrings_test) > > struct virtnet_stats { > struct u64_stats_sync tx_syncp; > @@ -104,6 +122,8 @@ struct virtnet_info { > struct send_queue *sq; > struct receive_queue *rq; > unsigned int status; > + unsigned long flags; > + atomic_t lb_count; > > /* Max # of queue pairs supported by the device */ > u16 max_queue_pairs; > @@ -436,6 +456,19 @@ err_buf: > return NULL; > } > > +void virtnet_check_lb_frame(struct virtnet_info *vi, > +struct sk_buff *skb) > +{ > + unsigned int frame_size = skb->len; > + > + if (*(skb->data + 3) == 0xFF) { > + if ((*(skb->data + frame_size / 2 + 10) == 0xBE) && > +(*(skb->data + frame_size / 2 + 12) == 0xAF)) { > + atomic_dec(>lb_count); > + } > + } > +} > + > static void receive_buf(struct receive_queue *rq, void *buf, unsigned int > len) > { > struct virtnet_info *vi = rq->vq->vdev->priv; > @@ -485,7 +518,12 @@ static void receive_buf(struct receive_queue *rq, void > *buf, unsigned int len) > } else if (hdr->hdr.flags & VIRTIO_NET_HDR_F_DATA_VALID) { > skb->ip_summed = CHECKSUM_UNNECESSARY; > } > - > + /* loopback self test for ethtool */ > + if (test_bit(__VIRTNET_TESTING, >flags)) { > + virtnet_check_lb_frame(vi, skb); > + dev_kfree_skb_any(skb); > + return; > + } Not sure it's a good choice for adding such in fast path. We may need a test specific rx interrupt handler (and disable NAPI) for this. > skb->protocol = eth_type_trans(skb, dev); > pr_debug("Receiving skb proto 0x%04x len %i type %i\n", >ntohs(skb->protocol), skb->len, skb->pkt_type); > @@ -813,6 +851,9 @@ static int virtnet_open(struct net_device *dev) > { > struct virtnet_info *vi = netdev_priv(dev); > int i; > + /* disallow open during test */ > + if (test_bit(__VIRTNET_TESTING, >flags)) > + return -EBUSY; > > for (i = 0; i < vi->max_queue_pairs; i++) { > if (i < vi->curr_queue_pairs) > @@ -1363,12 +1404,158 @@ static void virtnet_get_channels(struct net_device > *dev, > channels->other_count = 0; > } > > +static int virtnet_reset(struct virtnet_info *vi); > + > +static void virtnet_create_lb_frame(struct sk_buff *skb, > + unsigned int frame_size) > +{ > + memset(skb->data, 0xFF, frame_size); > + frame_size &= ~1; > + memset(>data[frame_size / 2], 0xAA, frame_size / 2 - 1); > + memset(>data[frame_size / 2 + 10], 0xBE, 1); > + memset(>data[frame_size / 2 + 12], 0xAF, 1); > +} > + > +static int virtnet_start_loopback(struct virtnet_info *vi) > +{ > + if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_LOOPBACK, > + VIRTIO_NET_CTRL_LOOPBACK_SET, NULL, NULL)) { > + dev_warn(>dev->dev, "Failed to set loopback.\n"); > + return -EINVAL; > + } > + return 0;
Re: [PATCH] acpi: fan.c: printk replacement
On Tue, Aug 26, 2014 at 11:22:12PM +0200, Rafael J. Wysocki wrote: > On Tuesday, August 26, 2014 01:59:02 PM Joe Perches wrote: > > On Tue, 2014-08-26 at 23:02 +0200, Rafael J. Wysocki wrote: > > > On Tuesday, August 26, 2014 09:00:39 PM Sudip Mukherjee wrote: > > > > On Tue, Aug 26, 2014 at 12:45:20AM +0200, Rafael J. Wysocki wrote: > > > > > On Friday, August 22, 2014 05:33:21 PM Sudip Mukherjee wrote: > > > > > > printk replaced with corresponding dev_err and dev_info > > > > > > fixed one broken user-visible string > > > > > > multiine comment edited for correct commenting style > > > > > > asm/uaccess.h replaced with linux/uaccess.h > > > > > > > > > > > > Signed-off-by: Sudip Mukherjee > > > > > > --- > > > > > > drivers/acpi/fan.c | 18 +- > > > > > > 1 file changed, 9 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/drivers/acpi/fan.c b/drivers/acpi/fan.c > > > > > > index 8acf53e..7900d55 100644 > > > > > > --- a/drivers/acpi/fan.c > > > > > > +++ b/drivers/acpi/fan.c > > > > > > @@ -27,7 +27,7 @@ > > > > > > #include > > > > > > #include > > > > > > #include > > > > > > -#include > > > > > > +#include > > > > > > #include > > > > > > #include > > > > > > > > > > > > @@ -127,8 +127,9 @@ static const struct thermal_cooling_device_ops > > > > > > fan_cooling_ops = { > > > > > > }; > > > > > > > > > > > > /* > > > > > > -- > > > > > > - Driver Interface > > > > > > - > > > > > > -- > > > > > > */ > > > > > > + * Driver Interface > > > > > > + * > > > > > > -- > > > > > > +*/ > > > > > > > > > > > > static int acpi_fan_add(struct acpi_device *device) > > > > > > { > > > > > > @@ -143,7 +144,7 @@ static int acpi_fan_add(struct acpi_device > > > > > > *device) > > > > > > > > > > > > result = acpi_bus_update_power(device->handle, NULL); > > > > > > if (result) { > > > > > > - printk(KERN_ERR PREFIX "Setting initial power state\n"); > > > > > > + dev_err(>dev, PREFIX "Setting initial power > > > > > > state\n"); > > > > > > > > > > While at it, please define a proper pr_fmt() for this file and get > > > > > rid of PREFIX > > > > > too. > > > > > > > > > > Otherwise I don't see a compelling reason to apply this. > > > > > > > > > > > > > Hi, > > > > Since in the patch I am not using any pr_* , so I am unable to > > > > understand why > > > > you are asking for a proper pr_fmt(). > > > > > > Never mind, I was confused somehow, not exactly sure why. Sorry about > > > that. > > > > > > > I can get rid of the PREFIX . Then should I use pr_* in the patch > > > > instead of dev_* ? > > > > My understanding was dev_* is more preffered than pr_*. > > > > waiting for your suggestion on this. > > > > > > Well, that really depends on the particular case. It really is better to > > > use > > > dev_err() here, but then PREFIX with it is not really useful, so please > > > just > > > drop PREFIX from the new messages. > > > > PREFIX is "ACPI: " so I think the idea is > > to be able to grep for that. > > I'm not sure how useful that is in this particular case. You can grep for > "power > state" istead just fine ... > > Rafael > Then there is one more printk which prints fan state is on or off : >dev_info(>dev, PREFIX "%s [%s] (%s)\n", > acpi_device_name(device), acpi_device_bid(device), > !device->power.state ? "on" : "off"); So if we drop the PREFIX and if some one wants to grep for this fan on/off , then how to do that .. after removing PREFIX , in dmesg I am getting it as : [2.056204] fan PNP0C0B:00: Fan [FAN0] (off) [2.056225] fan PNP0C0B:01: Fan [FAN1] (off) [2.056245] fan PNP0C0B:02: Fan [FAN2] (off) [2.056263] fan PNP0C0B:03: Fan [FAN3] (off) [2.056283] fan PNP0C0B:04: Fan [FAN4] (off) thanks sudip -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v7 net-next 00/28] BPF syscall
On Tue, Aug 26, 2014 at 9:49 PM, Andy Lutomirski wrote: > On Tue, Aug 26, 2014 at 9:35 PM, Alexei Starovoitov wrote: >> On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski wrote: >>> On Aug 26, 2014 7:29 PM, "Alexei Starovoitov" wrote: Hi Ingo, David, posting whole thing again as RFC to get feedback on syscall only. If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok, I'll split them into small chunks as requested and will repost without RFC. >>> >>> IMO it's much easier to review a syscall if we just look at a >>> specification of what it does. The code is, in some sense, secondary. >> >> 'specification of what it does'... hmm, you mean beyond what's >> there in commit logs and in Documentation/networking/filter.txt ? >> Aren't samples at the end give an idea on 'what it does'? >> I'm happy to add 'specification', I just don't understand yet what >> it suppose to talk about beyond what's already written. >> I understand that the patches are missing explanation on 'why' >> the syscall is being added, but I don't think it's what you're asking... > > I mean a hopefully short document that defines what the syscall does. > It should be precise enough that one could, in principle, implement > the syscall just by reading the document and that one could use the > syscall just by reading the document. > > Given that there's a whole instruction set to go with it, it may end > up being moderately complicated or saying things like "see this other > thing for a description of the instruction set" and "there are some > extensible sets of functions you can call with it". I'm still lost. Here is the quote from Documentation/networking/filter.txt " 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps are accessed from user space via BPF syscall, which has commands: - create a map with given type and attributes map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size) using attr->map_type, attr->key_size, attr->value_size, attr->max_entries returns process-local file descriptor or negative error - lookup key in a given map err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key, attr->value returns zero and stores found elem into value or negative error - create or update key/value pair in a given map err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key, attr->value returns zero or negative error - find and delete element by key in a given map err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size) using attr->map_fd, attr->key - to delete map: close(fd) Exiting process will delete maps automatically userspace programs uses this API to create/populate/read maps that eBPF programs are concurrently updating. " and more in commit log: " - load eBPF program fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size) where 'attr' is struct { enum bpf_prog_type prog_type; __u32 insn_cnt; struct bpf_insn __user *insns; const char __user *license; }; insns - array of eBPF instructions license - must be GPL compatible to call helper functions marked gpl_only - unload eBPF program close(fd) " Isn't it short and describes what it does? Do you want me to describe what eBPF program can do? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v7 net-next 00/28] BPF syscall
On Tue, Aug 26, 2014 at 9:35 PM, Alexei Starovoitov wrote: > On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski wrote: >> On Aug 26, 2014 7:29 PM, "Alexei Starovoitov" wrote: >>> >>> Hi Ingo, David, >>> >>> posting whole thing again as RFC to get feedback on syscall only. >>> If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok, >>> I'll split them into small chunks as requested and will repost without RFC. >> >> IMO it's much easier to review a syscall if we just look at a >> specification of what it does. The code is, in some sense, secondary. > > 'specification of what it does'... hmm, you mean beyond what's > there in commit logs and in Documentation/networking/filter.txt ? > Aren't samples at the end give an idea on 'what it does'? > I'm happy to add 'specification', I just don't understand yet what > it suppose to talk about beyond what's already written. > I understand that the patches are missing explanation on 'why' > the syscall is being added, but I don't think it's what you're asking... I mean a hopefully short document that defines what the syscall does. It should be precise enough that one could, in principle, implement the syscall just by reading the document and that one could use the syscall just by reading the document. Given that there's a whole instruction set to go with it, it may end up being moderately complicated or saying things like "see this other thing for a description of the instruction set" and "there are some extensible sets of functions you can call with it". --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH tip/core/rcu 1/2] rcu: Parallelize and economize NOCB kthread wakeups
On (Sat) 23 Aug 2014 [03:43:38], Pranith Kumar wrote: > On Fri, Aug 22, 2014 at 5:53 PM, Paul E. McKenney > wrote: > > > > Hmmm... Please try replacing the synchronize_rcu() in > > __sysrq_swap_key_ops() with (say) schedule_timeout_interruptible(HZ / 10). > > I bet that gets rid of the hang. (And also introduces a low-probability > > bug, but should be OK for testing.) > > > > The other thing to try is to revert your patch that turned my event > > traces into printk()s, then put an ftrace_dump(DUMP_ALL); just after > > the synchronize_rcu() -- that might make it so that the ftrace data > > actually gets dumped out. > > > > I was able to reproduce this error on my Ubuntu 14.04 machine. I think > I found the root cause of the problem after several kvm runs. > > The problem is that earlier we were waiting on nocb_head and now we > are waiting on nocb_leader_wake. > > So there are a lot of nocb callbacks which are enqueued before the > nocb thread is spawned. This sets up nocb_head to be non-null, because > of which the nocb kthread used to wake up immediately after sleeping. > > Now that we have switched to nocb_leader_wake, this is not being set > when there are pending callbacks, unless the callbacks overflow the > qhimark. The pending callbacks were around 7000 when the boot hangs. > > So setting the qhimark using the boot parameter rcutree.qhimark=5000 > is one way to allow us to boot past the point by forcefully waking up > the nocb kthread. I am not sure this is fool-proof. > > Another option to start the nocb kthreads with nocb_leader_wake set, > so that it can handle any pending callbacks. The following patch also > allows us to boot properly. > > Phew! Let me know if this makes any sense :) > > diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h > index 00dc411..4c397aa 100644 > --- a/kernel/rcu/tree_plugin.h > +++ b/kernel/rcu/tree_plugin.h > @@ -2386,6 +2386,9 @@ static int rcu_nocb_kthread(void *arg) > struct rcu_head **tail; > struct rcu_data *rdp = arg; > > + if (rdp->nocb_leader == rdp) > + rdp->nocb_leader_wake = true; > + > /* Each pass through this loop invokes one batch of callbacks */ > for (;;) { > /* Wait for callbacks. */ Yes, this patch helps my case as well. Thanks! Amit -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 03/21] Fix XIP fault vs truncate race
Pagecache faults recheck i_size after taking the page lock to ensure that the fault didn't race against a truncate. We don't have a page to lock in the XIP case, so use the i_mmap_mutex instead. It is locked in the truncate path in unmap_mapping_range() after updating i_size. So while we hold it in the fault path, we are guaranteed that either i_size has already been updated in the truncate path, or that the truncate will subsequently call zap_page_range_single() and so remove the mapping we have just inserted. There is a window of time in which i_size has been reduced and the thread has a mapping to a page which will be removed from the file, but this is harmless as the page will not be allocated to a different purpose before the thread's access to it is revoked. Signed-off-by: Matthew Wilcox Reviewed-by: Jan Kara Acked-by: Kirill A. Shutemov --- mm/filemap_xip.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index d8d9fe3..c8d23e9 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -260,8 +260,17 @@ again: __xip_unmap(mapping, vmf->pgoff); found: + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(>i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + mutex_unlock(>i_mmap_mutex); + return VM_FAULT_SIGBUS; + } err = vm_insert_mixed(vma, (unsigned long)vmf->virtual_address, xip_pfn); + mutex_unlock(>i_mmap_mutex); if (err == -ENOMEM) return VM_FAULT_OOM; /* @@ -285,16 +294,27 @@ found: } if (error != -ENODATA) goto out; + + /* We must recheck i_size under i_mmap_mutex */ + mutex_lock(>i_mmap_mutex); + size = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> + PAGE_CACHE_SHIFT; + if (unlikely(vmf->pgoff >= size)) { + ret = VM_FAULT_SIGBUS; + goto unlock; + } /* not shared and writable, use xip_sparse_page() */ page = xip_sparse_page(); if (!page) - goto out; + goto unlock; err = vm_insert_page(vma, (unsigned long)vmf->virtual_address, page); if (err == -ENOMEM) - goto out; + goto unlock; ret = VM_FAULT_NOPAGE; +unlock: + mutex_unlock(>i_mmap_mutex); out: write_seqcount_end(_sparse_seq); mutex_unlock(_sparse_mutex); -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 17/21] ext2: Remove ext2_aops_xip
We shouldn't need a special address_space_operations any more Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 1 - fs/ext2/inode.c | 7 +-- fs/ext2/namei.c | 4 ++-- 3 files changed, 3 insertions(+), 9 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b30c3bd..b8b1c11 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -793,7 +793,6 @@ extern const struct file_operations ext2_xip_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; -extern const struct address_space_operations ext2_aops_xip; extern const struct address_space_operations ext2_nobh_aops; /* namei.c */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 154cbcf..034fd42 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -891,11 +891,6 @@ const struct address_space_operations ext2_aops = { .error_remove_page = generic_error_remove_page, }; -const struct address_space_operations ext2_aops_xip = { - .bmap = ext2_bmap, - .direct_IO = ext2_direct_IO, -}; - const struct address_space_operations ext2_nobh_aops = { .readpage = ext2_readpage, .readpages = ext2_readpages, @@ -1394,7 +1389,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = _file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = _aops_xip; + inode->i_mapping->a_ops = _aops; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 7ca803f..0db888c 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode inode->i_op = _file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = _aops_xip; + inode->i_mapping->a_ops = _aops; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) inode->i_op = _file_inode_operations; if (test_opt(inode->i_sb, XIP)) { - inode->i_mapping->a_ops = _aops_xip; + inode->i_mapping->a_ops = _aops; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 10/21] Replace xip_truncate_page with dax_truncate_page
It takes a get_block parameter just like nobh_truncate_page() and block_truncate_page() Signed-off-by: Matthew Wilcox --- fs/dax.c | 44 fs/ext2/inode.c| 2 +- include/linux/fs.h | 4 ++-- mm/filemap_xip.c | 40 4 files changed, 47 insertions(+), 43 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index f134078..d54f7d3 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -443,3 +443,47 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, return result; } EXPORT_SYMBOL_GPL(dax_fault); + +/** + * dax_truncate_page - handle a partial page being truncated in a DAX file + * @inode: The file being truncated + * @from: The file offset that is being truncated to + * @get_block: The filesystem method used to translate file offsets to blocks + * + * Similar to block_truncate_page(), this function can be called by a + * filesystem when it is truncating an DAX file to handle the partial page. + * + * We work in terms of PAGE_CACHE_SIZE here for commonality with + * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem + * took care of disposing of the unnecessary blocks. Even if the filesystem + * block size is smaller than PAGE_SIZE, we have to zero the rest of the page + * since the file might be mmaped. + */ +int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +{ + struct buffer_head bh; + pgoff_t index = from >> PAGE_CACHE_SHIFT; + unsigned offset = from & (PAGE_CACHE_SIZE-1); + unsigned length = PAGE_CACHE_ALIGN(from) - from; + int err; + + /* Block boundary? Nothing to do */ + if (!length) + return 0; + + memset(, 0, sizeof(bh)); + bh.b_size = PAGE_CACHE_SIZE; + err = get_block(inode, index, , 0); + if (err < 0) + return err; + if (buffer_written()) { + void *addr; + err = dax_get_addr(, , inode->i_blkbits); + if (err < 0) + return err; + memset(addr + offset, 0, length); + } + + return 0; +} +EXPORT_SYMBOL_GPL(dax_truncate_page); diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 52978b8..5ac0a34 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1210,7 +1210,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); if (IS_DAX(inode)) - error = xip_truncate_page(inode->i_mapping, newsize); + error = dax_truncate_page(inode, newsize, ext2_get_block); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, newsize, ext2_get_block); diff --git a/include/linux/fs.h b/include/linux/fs.h index 338f04b..eee848d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2492,7 +2492,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP int dax_clear_blocks(struct inode *, sector_t block, long size); -extern int xip_truncate_page(struct address_space *mapping, loff_t from); +int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t); @@ -2503,7 +2503,7 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) return 0; } -static inline int xip_truncate_page(struct address_space *mapping, loff_t from) +static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) { return 0; } diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index 9dd45f3..6316578 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -21,43 +21,3 @@ #include #include -/* - * truncate a page used for execute in place - * functionality is analog to block_truncate_page but does use get_xip_mem - * to get the page instead of page cache - */ -int -xip_truncate_page(struct address_space *mapping, loff_t from) -{ - pgoff_t index = from >> PAGE_CACHE_SHIFT; - unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned blocksize; - unsigned length; - void *xip_mem; - unsigned long xip_pfn; - int err; - - BUG_ON(!mapping->a_ops->get_xip_mem); - - blocksize = 1 << mapping->host->i_blkbits; - length = offset & (blocksize - 1); - - /* Block boundary? Nothing to do */ - if (!length) - return 0; - - length = blocksize - length; - - err = mapping->a_ops->get_xip_mem(mapping, index, 0, - _mem, _pfn); - if (unlikely(err)) { - if (err == -ENODATA) - /* Hole? No need to truncate */ - return 0; - else -
[PATCH v10 02/21] Change direct_access calling convention
In order to support accesses to larger chunks of memory, pass in a 'size' parameter (counted in bytes), and return the amount available at that address. Add a new helper function, bdev_direct_access(), to handle common functionality including partition handling, checking the length requested is positive, checking for the sector being page-aligned, and checking the length of the request does not pass the end of the partition. Signed-off-by: Matthew Wilcox Reviewed-by: Jan Kara Reviewed-by: Boaz Harrosh --- Documentation/filesystems/xip.txt | 15 +-- arch/powerpc/sysdev/axonram.c | 17 - drivers/block/brd.c | 12 +--- drivers/s390/block/dcssblk.c | 21 +--- fs/block_dev.c| 40 +++ fs/ext2/xip.c | 31 +- include/linux/blkdev.h| 6 -- 7 files changed, 84 insertions(+), 58 deletions(-) diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt index 0466ee5..b774729 100644 --- a/Documentation/filesystems/xip.txt +++ b/Documentation/filesystems/xip.txt @@ -28,12 +28,15 @@ Implementation Execute-in-place is implemented in three steps: block device operation, address space operation, and file operations. -A block device operation named direct_access is used to retrieve a -reference (pointer) to a block on-disk. The reference is supposed to be -cpu-addressable, physical address and remain valid until the release operation -is performed. A struct block_device reference is used to address the device, -and a sector_t argument is used to identify the individual block. As an -alternative, memory technology devices can be used for this. +A block device operation named direct_access is used to translate the +block device sector number to a page frame number (pfn) that identifies +the physical page for the memory. It also returns a kernel virtual +address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that can be contiguously accessed at that offset. It may also +return a negative errno if an error occurs. The block device operation is optional, these block devices support it as of today: diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 830edc8..8709b9f 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -139,26 +139,17 @@ axon_ram_make_request(struct request_queue *queue, struct bio *bio) * axon_ram_direct_access - direct_access() method for block device * @device, @sector, @data: see block_device_operations method */ -static int +static long axon_ram_direct_access(struct block_device *device, sector_t sector, - void **kaddr, unsigned long *pfn) + void **kaddr, unsigned long *pfn, long size) { struct axon_ram_bank *bank = device->bd_disk->private_data; - loff_t offset; - - offset = sector; - if (device->bd_part != NULL) - offset += device->bd_part->start_sect; - offset <<= AXON_RAM_SECTOR_SHIFT; - if (offset >= bank->size) { - dev_err(>device->dev, "Access outside of address space\n"); - return -ERANGE; - } + loff_t offset = (loff_t)sector << AXON_RAM_SECTOR_SHIFT; *kaddr = (void *)(bank->ph_addr + offset); *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; - return 0; + return bank->size - offset; } static const struct block_device_operations axon_ram_devops = { diff --git a/drivers/block/brd.c b/drivers/block/brd.c index c7d138e..fee10bf 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -370,25 +370,23 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, } #ifdef CONFIG_BLK_DEV_XIP -static int brd_direct_access(struct block_device *bdev, sector_t sector, - void **kaddr, unsigned long *pfn) +static long brd_direct_access(struct block_device *bdev, sector_t sector, + void **kaddr, unsigned long *pfn, long size) { struct brd_device *brd = bdev->bd_disk->private_data; struct page *page; if (!brd) return -ENODEV; - if (sector & (PAGE_SECTORS-1)) - return -EINVAL; - if (sector + PAGE_SECTORS > get_capacity(bdev->bd_disk)) - return -ERANGE; page = brd_insert_page(brd, sector); if (!page) return -ENOSPC; *kaddr = page_address(page); *pfn = page_to_pfn(page); - return 0; + /* If size > PAGE_SIZE, we could look to see if the next page in the +* file happens to be mapped to the next page of physical RAM */ + return PAGE_SIZE; } #endif diff --git a/drivers/s390/block/dcssblk.c
[PATCH v10 07/21] Replace XIP read and write with DAX I/O
Use the generic AIO infrastructure instead of custom read and write methods. In addition to giving us support for AIO, this adds the missing locking between read() and truncate(). Signed-off-by: Matthew Wilcox Reviewed-by: Ross Zwisler Reviewed-by: Jan Kara --- MAINTAINERS| 6 ++ fs/Makefile| 1 + fs/dax.c | 195 fs/ext2/file.c | 6 +- fs/ext2/inode.c| 8 +- include/linux/fs.h | 18 - mm/filemap.c | 6 +- mm/filemap_xip.c | 234 - 8 files changed, 229 insertions(+), 245 deletions(-) create mode 100644 fs/dax.c diff --git a/MAINTAINERS b/MAINTAINERS index 1ff06de..3f29153 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2929,6 +2929,12 @@ L: linux-...@vger.kernel.org S: Maintained F: drivers/i2c/busses/i2c-diolan-u2c.c +DIRECT ACCESS (DAX) +M: Matthew Wilcox +L: linux-fsde...@vger.kernel.org +S: Supported +F: fs/dax.c + DIRECTORY NOTIFICATION (DNOTIFY) M: Eric Paris S: Maintained diff --git a/fs/Makefile b/fs/Makefile index 90c8852..0325ec3 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -28,6 +28,7 @@ obj-$(CONFIG_SIGNALFD)+= signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o +obj-$(CONFIG_FS_XIP) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/dax.c b/fs/dax.c new file mode 100644 index 000..108c68e --- /dev/null +++ b/fs/dax.c @@ -0,0 +1,195 @@ +/* + * fs/dax.c - Direct Access filesystem code + * Copyright (c) 2013-2014 Intel Corporation + * Author: Matthew Wilcox + * Author: Ross Zwisler + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + * more details. + */ + +#include +#include +#include +#include +#include +#include +#include + +static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits) +{ + unsigned long pfn; + sector_t sector = bh->b_blocknr << (blkbits - 9); + return bdev_direct_access(bh->b_bdev, sector, addr, , bh->b_size); +} + +static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos, + loff_t end) +{ + loff_t final = end - pos + first; /* The final byte of the buffer */ + + if (first > 0) + memset(addr, 0, first); + if (final < size) + memset(addr + final, 0, size - final); +} + +static bool buffer_written(struct buffer_head *bh) +{ + return buffer_mapped(bh) && !buffer_unwritten(bh); +} + +/* + * When ext4 encounters a hole, it returns without modifying the buffer_head + * which means that we can't trust b_size. To cope with this, we set b_state + * to 0 before calling get_block and, if any bit is set, we know we can trust + * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is + * and would save us time calling get_block repeatedly. + */ +static bool buffer_size_valid(struct buffer_head *bh) +{ + return bh->b_state != 0; +} + +static ssize_t dax_io(int rw, struct inode *inode, struct iov_iter *iter, + loff_t start, loff_t end, get_block_t get_block, + struct buffer_head *bh) +{ + ssize_t retval = 0; + loff_t pos = start; + loff_t max = start; + loff_t bh_max = start; + void *addr; + bool hole = false; + + if (rw != WRITE) + end = min(end, i_size_read(inode)); + + while (pos < end) { + unsigned len; + if (pos == max) { + unsigned blkbits = inode->i_blkbits; + sector_t block = pos >> blkbits; + unsigned first = pos - (block << blkbits); + long size; + + if (pos == bh_max) { + bh->b_size = PAGE_ALIGN(end - pos); + bh->b_state = 0; + retval = get_block(inode, block, bh, + rw == WRITE); + if (retval) + break; + if (!buffer_size_valid(bh)) + bh->b_size = 1 << blkbits; + bh_max = pos - first + bh->b_size; + } else { +
[PATCH v10 20/21] ext4: Add DAX functionality
From: Ross Zwisler This is a port of the DAX functionality found in the current version of ext2. Signed-off-by: Ross Zwisler Reviewed-by: Andreas Dilger [heavily tweaked] Signed-off-by: Matthew Wilcox --- Documentation/filesystems/dax.txt | 1 + Documentation/filesystems/ext4.txt | 2 ++ fs/ext4/ext4.h | 6 + fs/ext4/file.c | 49 ++-- fs/ext4/indirect.c | 18 ++ fs/ext4/inode.c| 51 -- fs/ext4/namei.c| 10 ++-- fs/ext4/super.c| 39 - 8 files changed, 148 insertions(+), 28 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index ebcd97f..be376d9 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -73,6 +73,7 @@ or a write()) work correctly. These filesystems may be used for inspiration: - ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt +- ext4: the fourth extended filesystem, see Documentation/filesystems/ext4.txt Shortcomings diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 919a329..9c511c4 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -386,6 +386,8 @@ max_dir_size_kb=n This limits the size of directories so that any i_version Enable 64-bit inode version support. This option is off by default. +daxUse direct access if possible + Data Mode = There are 3 different data modes: diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 5b19760..c065a3e 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -969,6 +969,11 @@ struct ext4_inode_info { #define EXT4_MOUNT_ERRORS_MASK 0x00070 #define EXT4_MOUNT_MINIX_DF0x00080 /* Mimics the Minix statfs */ #define EXT4_MOUNT_NOLOAD 0x00100 /* Don't use existing journal*/ +#ifdef CONFIG_FS_DAX +#define EXT4_MOUNT_DAX 0x00200 /* Execute in place */ +#else +#define EXT4_MOUNT_DAX 0 +#endif #define EXT4_MOUNT_DATA_FLAGS 0x00C00 /* Mode for data writes: */ #define EXT4_MOUNT_JOURNAL_DATA0x00400 /* Write data to journal */ #define EXT4_MOUNT_ORDERED_DATA0x00800 /* Flush data before commit */ @@ -2558,6 +2563,7 @@ extern const struct file_operations ext4_dir_operations; /* file.c */ extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; +extern const struct file_operations ext4_dax_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); /* inline.c */ diff --git a/fs/ext4/file.c b/fs/ext4/file.c index aca7b24..9c7bde5 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -95,7 +95,7 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from) struct inode *inode = file_inode(iocb->ki_filp); struct mutex *aio_mutex = NULL; struct blk_plug plug; - int o_direct = file->f_flags & O_DIRECT; + int o_direct = io_is_direct(file); int overwrite = 0; size_t length = iov_iter_count(from); ssize_t ret; @@ -191,6 +191,27 @@ errout: return ret; } +#ifdef CONFIG_FS_DAX +static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_fault(vma, vmf, ext4_get_block); + /* Is this the right get_block? */ +} + +static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return dax_mkwrite(vma, vmf, ext4_get_block); +} + +static const struct vm_operations_struct ext4_dax_vm_ops = { + .fault = ext4_dax_fault, + .page_mkwrite = ext4_dax_mkwrite, + .remap_pages= generic_file_remap_pages, +}; +#else +#define ext4_dax_vm_opsext4_file_vm_ops +#endif + static const struct vm_operations_struct ext4_file_vm_ops = { .fault = filemap_fault, .map_pages = filemap_map_pages, @@ -201,7 +222,12 @@ static const struct vm_operations_struct ext4_file_vm_ops = { static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma) { file_accessed(file); - vma->vm_ops = _file_vm_ops; + if (IS_DAX(file_inode(file))) { + vma->vm_ops = _dax_vm_ops; + vma->vm_flags |= VM_MIXEDMAP; + } else { + vma->vm_ops = _file_vm_ops; + } return 0; } @@ -600,6 +626,25 @@ const struct file_operations ext4_file_operations = { .fallocate = ext4_fallocate, }; +#ifdef CONFIG_FS_DAX +const struct file_operations ext4_dax_file_operations = { + .llseek = ext4_llseek, + .read = new_sync_read, + .write = new_sync_write,
[PATCH v10 21/21] brd: Rename XIP to DAX
From: Matthew Wilcox Since this is relating to FS_XIP, not KERNEL_XIP, it should be called DAX instead of XIP. Signed-off-by: Matthew Wilcox --- drivers/block/Kconfig | 13 +++-- drivers/block/brd.c | 14 +++--- fs/Kconfig| 4 ++-- 3 files changed, 16 insertions(+), 15 deletions(-) diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index 014a1cf..1b8094d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -393,14 +393,15 @@ config BLK_DEV_RAM_SIZE The default value is 4096 kilobytes. Only change this if you know what you are doing. -config BLK_DEV_XIP - bool "Support XIP filesystems on RAM block device" - depends on BLK_DEV_RAM +config BLK_DEV_RAM_DAX + bool "Support Direct Access (DAX) to RAM block devices" + depends on BLK_DEV_RAM && FS_DAX default n help - Support XIP filesystems (such as ext2 with XIP support on) on - top of block ram device. This will slightly enlarge the kernel, and - will prevent RAM block device backing store memory from being + Support filesystems using DAX to access RAM block devices. This + avoids double-buffering data in the page cache before copying it + to the block device. Answering Y will slightly enlarge the kernel, + and will prevent RAM block device backing store memory from being allocated from highmem (only a problem for highmem systems). config CDROM_PKTCDVD diff --git a/drivers/block/brd.c b/drivers/block/brd.c index fee10bf..344681a 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device *brd, sector_t sector) * Must use NOIO because we don't want to recurse back into the * block or filesystem layers from page reclaim. * -* Cannot support XIP and highmem, because our ->direct_access -* routine for XIP must return memory that is always addressable. -* If XIP was reworked to use pfns and kmap throughout, this +* Cannot support DAX and highmem, because our ->direct_access +* routine for DAX must return memory that is always addressable. +* If DAX was reworked to use pfns and kmap throughout, this * restriction might be able to be lifted. */ gfp_flags = GFP_NOIO | __GFP_ZERO; -#ifndef CONFIG_BLK_DEV_XIP +#ifndef CONFIG_BLK_DEV_RAM_DAX gfp_flags |= __GFP_HIGHMEM; #endif page = alloc_page(gfp_flags); @@ -369,7 +369,7 @@ static int brd_rw_page(struct block_device *bdev, sector_t sector, return err; } -#ifdef CONFIG_BLK_DEV_XIP +#ifdef CONFIG_BLK_DEV_RAM_DAX static long brd_direct_access(struct block_device *bdev, sector_t sector, void **kaddr, unsigned long *pfn, long size) { @@ -388,6 +388,8 @@ static long brd_direct_access(struct block_device *bdev, sector_t sector, * file happens to be mapped to the next page of physical RAM */ return PAGE_SIZE; } +#else +#define brd_direct_access NULL #endif static int brd_ioctl(struct block_device *bdev, fmode_t mode, @@ -428,9 +430,7 @@ static const struct block_device_operations brd_fops = { .owner =THIS_MODULE, .rw_page = brd_rw_page, .ioctl =brd_ioctl, -#ifdef CONFIG_BLK_DEV_XIP .direct_access =brd_direct_access, -#endif }; /* diff --git a/fs/Kconfig b/fs/Kconfig index a9eb53d..117900f 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -34,7 +34,7 @@ source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" config FS_DAX - bool "Direct Access support" + bool "Direct Access (DAX) support" depends on MMU help Direct Access (DAX) can be used on memory-backed block devices. @@ -45,7 +45,7 @@ config FS_DAX If you do not have a block device that is capable of using this, or if unsure, say N. Saying Y will increase the size of the kernel - by about 2kB. + by about 5kB. endif # BLOCK -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 16/21] Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX
The fewer Kconfig options we have the better. Use the generic CONFIG_FS_DAX to enable XIP support in ext2 as well as in the core. Signed-off-by: Matthew Wilcox --- fs/Kconfig | 21 ++--- fs/Makefile| 2 +- fs/ext2/Kconfig| 11 --- fs/ext2/ext2.h | 2 +- fs/ext2/file.c | 4 ++-- fs/ext2/super.c| 4 ++-- include/linux/fs.h | 4 ++-- 7 files changed, 22 insertions(+), 26 deletions(-) diff --git a/fs/Kconfig b/fs/Kconfig index 312393f..a9eb53d 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -13,13 +13,6 @@ if BLOCK source "fs/ext2/Kconfig" source "fs/ext3/Kconfig" source "fs/ext4/Kconfig" - -config FS_XIP -# execute in place - bool - depends on EXT2_FS_XIP - default y - source "fs/jbd/Kconfig" source "fs/jbd2/Kconfig" @@ -40,6 +33,20 @@ source "fs/ocfs2/Kconfig" source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" +config FS_DAX + bool "Direct Access support" + depends on MMU + help + Direct Access (DAX) can be used on memory-backed block devices. + If the block device supports DAX and the filesystem supports DAX, + then you can avoid using the pagecache to buffer I/Os. Turning + on this option will compile in support for DAX; you will need to + mount the filesystem using the -o xip option. + + If you do not have a block device that is capable of using this, + or if unsure, say N. Saying Y will increase the size of the kernel + by about 2kB. + endif # BLOCK # Posix ACL utility routines diff --git a/fs/Makefile b/fs/Makefile index 0325ec3..df4a4cf 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -28,7 +28,7 @@ obj-$(CONFIG_SIGNALFD)+= signalfd.o obj-$(CONFIG_TIMERFD) += timerfd.o obj-$(CONFIG_EVENTFD) += eventfd.o obj-$(CONFIG_AIO) += aio.o -obj-$(CONFIG_FS_XIP) += dax.o +obj-$(CONFIG_FS_DAX) += dax.o obj-$(CONFIG_FILE_LOCKING) += locks.o obj-$(CONFIG_COMPAT) += compat.o compat_ioctl.o obj-$(CONFIG_BINFMT_AOUT) += binfmt_aout.o diff --git a/fs/ext2/Kconfig b/fs/ext2/Kconfig index 14a6780..c634874e 100644 --- a/fs/ext2/Kconfig +++ b/fs/ext2/Kconfig @@ -42,14 +42,3 @@ config EXT2_FS_SECURITY If you are not using a security module that requires using extended attributes for file security labels, say N. - -config EXT2_FS_XIP - bool "Ext2 execute in place support" - depends on EXT2_FS && MMU - help - Execute in place can be used on memory-backed block devices. If you - enable this option, you can select to mount block devices which are - capable of this feature without using the page cache. - - If you do not use a block device that is capable of using this, - or if unsure, say N. diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index 5ecf570..b30c3bd 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,7 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID320x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ -#ifdef CONFIG_FS_XIP +#ifdef CONFIG_FS_DAX #define EXT2_MOUNT_XIP 0x01 /* Execute in place */ #else #define EXT2_MOUNT_XIP 0 diff --git a/fs/ext2/file.c b/fs/ext2/file.c index da8dc64..46b333d 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -25,7 +25,7 @@ #include "xattr.h" #include "acl.h" -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { return dax_fault(vma, vmf, ext2_get_block); @@ -109,7 +109,7 @@ const struct file_operations ext2_file_operations = { .splice_write = iter_file_splice_write, }; -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX const struct file_operations ext2_xip_file_operations = { .llseek = generic_file_llseek, .read = new_sync_read, diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 0393c6d..feb53d8 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -287,7 +287,7 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) seq_puts(seq, ",grpquota"); #endif -#if defined(CONFIG_EXT2_FS_XIP) +#ifdef CONFIG_FS_DAX if (sbi->s_mount_opt & EXT2_MOUNT_XIP) seq_puts(seq, ",xip"); #endif @@ -549,7 +549,7 @@ static int parse_options(char *options, struct super_block *sb) break; #endif case Opt_xip: -#ifdef CONFIG_EXT2_FS_XIP +#ifdef CONFIG_FS_DAX set_opt (sbi->s_mount_opt, XIP); #else ext2_msg(sb, KERN_INFO, "xip option not supported"); diff --git a/include/linux/fs.h b/include/linux/fs.h index d73db11..e6b48cc 100644 --- a/include/linux/fs.h +++
[PATCH v10 01/21] axonram: Fix bug in direct_access
The 'pfn' returned by axonram was completely bogus, and has been since 2008. Signed-off-by: Matthew Wilcox Reviewed-by: Jan Kara --- arch/powerpc/sysdev/axonram.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c index 47b6b9f..830edc8 100644 --- a/arch/powerpc/sysdev/axonram.c +++ b/arch/powerpc/sysdev/axonram.c @@ -156,7 +156,7 @@ axon_ram_direct_access(struct block_device *device, sector_t sector, } *kaddr = (void *)(bank->ph_addr + offset); - *pfn = virt_to_phys(kaddr) >> PAGE_SHIFT; + *pfn = virt_to_phys(*kaddr) >> PAGE_SHIFT; return 0; } -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 12/21] Remove get_xip_mem
All callers of get_xip_mem() are now gone. Remove checks for it, initialisers of it, documentation of it and the only implementation of it. Also remove mm/filemap_xip.c as it is now empty. Signed-off-by: Matthew Wilcox --- Documentation/filesystems/Locking | 3 --- fs/exofs/inode.c | 1 - fs/ext2/inode.c | 1 - fs/ext2/xip.c | 45 --- fs/ext2/xip.h | 3 --- fs/open.c | 5 + include/linux/fs.h| 2 -- mm/Makefile | 1 - mm/fadvise.c | 6 -- mm/filemap_xip.c | 23 mm/madvise.c | 2 +- 11 files changed, 6 insertions(+), 86 deletions(-) delete mode 100644 mm/filemap_xip.c diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index f1997e9..226ccc3 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -197,8 +197,6 @@ prototypes: int (*releasepage) (struct page *, int); void (*freepage)(struct page *); int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); - int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, - unsigned long *); int (*migratepage)(struct address_space *, struct page *, struct page *); int (*launder_page)(struct page *); int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long); @@ -223,7 +221,6 @@ invalidatepage: yes releasepage: yes freepage: yes direct_IO: -get_xip_mem: maybe migratepage: yes (both) launder_page: yes is_partially_uptodate: yes diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c index 3f9cafd..c408a53 100644 --- a/fs/exofs/inode.c +++ b/fs/exofs/inode.c @@ -985,7 +985,6 @@ const struct address_space_operations exofs_aops = { .direct_IO = exofs_direct_IO, /* With these NULL has special meaning or default is not exported */ - .get_xip_mem= NULL, .migratepage= NULL, .launder_page = NULL, .is_partially_uptodate = NULL, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 5ac0a34..59d6c7d 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -894,7 +894,6 @@ const struct address_space_operations ext2_aops = { const struct address_space_operations ext2_aops_xip = { .bmap = ext2_bmap, - .get_xip_mem= ext2_get_xip_mem, .direct_IO = ext2_direct_IO, }; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index 8cfca3a..132d4da 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,35 +13,6 @@ #include "ext2.h" #include "xip.h" -static inline long __inode_direct_access(struct inode *inode, sector_t block, - void **kaddr, unsigned long *pfn, long size) -{ - struct block_device *bdev = inode->i_sb->s_bdev; - sector_t sector = block * (PAGE_SIZE / 512); - return bdev_direct_access(bdev, sector, kaddr, pfn, size); -} - -static inline int -__ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, - sector_t *result) -{ - struct buffer_head tmp; - int rc; - - memset(, 0, sizeof(struct buffer_head)); - tmp.b_size = 1 << inode->i_blkbits; - rc = ext2_get_block(inode, pgoff, , create); - *result = tmp.b_blocknr; - - /* did we get a sparse block (hole in the file)? */ - if (!tmp.b_blocknr && !rc) { - BUG_ON(create); - rc = -ENODATA; - } - - return rc; -} - void ext2_xip_verify_sb(struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); @@ -54,19 +25,3 @@ void ext2_xip_verify_sb(struct super_block *sb) "not supported by bdev"); } } - -int ext2_get_xip_mem(struct address_space *mapping, pgoff_t pgoff, int create, - void **kmem, unsigned long *pfn) -{ - long rc; - sector_t block; - - /* first, retrieve the sector number */ - rc = __ext2_get_block(mapping->host, pgoff, create, ); - if (rc) - return rc; - - /* retrieve address of the target data */ - rc = __inode_direct_access(mapping->host, block, kmem, pfn, PAGE_SIZE); - return (rc < 0) ? rc : 0; -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index b2592f2..e7b9f0a 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -12,10 +12,7 @@ static inline int ext2_use_xip (struct super_block *sb) struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } -int ext2_get_xip_mem(struct address_space *, pgoff_t, int, - void **, unsigned long *); #else #define ext2_xip_verify_sb(sb) do { }
[PATCH v10 15/21] ext2: Remove xip.c and xip.h
These files are now empty, so delete them Signed-off-by: Matthew Wilcox --- fs/ext2/Makefile | 1 - fs/ext2/inode.c | 1 - fs/ext2/namei.c | 1 - fs/ext2/super.c | 1 - fs/ext2/xip.c| 15 --- fs/ext2/xip.h| 16 6 files changed, 35 deletions(-) delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile index f42af45..445b0e9 100644 --- a/fs/ext2/Makefile +++ b/fs/ext2/Makefile @@ -10,4 +10,3 @@ ext2-y := balloc.o dir.o file.o ialloc.o inode.o \ ext2-$(CONFIG_EXT2_FS_XATTR)+= xattr.o xattr_user.o xattr_trusted.o ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o ext2-$(CONFIG_EXT2_FS_SECURITY) += xattr_security.o -ext2-$(CONFIG_EXT2_FS_XIP) += xip.o diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index cba3833..154cbcf 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -34,7 +34,6 @@ #include #include "ext2.h" #include "acl.h" -#include "xip.h" #include "xattr.h" static int __ext2_write_inode(struct inode *inode, int do_sync); diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 846c356..7ca803f 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static inline int ext2_add_nondir(struct dentry *dentry, struct inode *inode) { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index d862031..0393c6d 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -35,7 +35,6 @@ #include "ext2.h" #include "xattr.h" #include "acl.h" -#include "xip.h" static void ext2_sync_super(struct super_block *sb, struct ext2_super_block *es, int wait); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c deleted file mode 100644 index 66ca113..000 --- a/fs/ext2/xip.c +++ /dev/null @@ -1,15 +0,0 @@ -/* - * linux/fs/ext2/xip.c - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (co...@de.ibm.com) - */ - -#include -#include -#include -#include -#include -#include "ext2.h" -#include "xip.h" - diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h deleted file mode 100644 index 87eeb04..000 --- a/fs/ext2/xip.h +++ /dev/null @@ -1,16 +0,0 @@ -/* - * linux/fs/ext2/xip.h - * - * Copyright (C) 2005 IBM Corporation - * Author: Carsten Otte (co...@de.ibm.com) - */ - -#ifdef CONFIG_EXT2_FS_XIP -static inline int ext2_use_xip (struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - return (sbi->s_mount_opt & EXT2_MOUNT_XIP); -} -#else -#define ext2_use_xip(sb) 0 -#endif -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 18/21] Get rid of most mentions of XIP in ext2
To help people transition, accept the 'xip' mount option (and report it in /proc/mounts), but print a message encouraging people to switch over to the 'dax' option. --- fs/ext2/ext2.h | 13 +++-- fs/ext2/file.c | 2 +- fs/ext2/inode.c | 6 +++--- fs/ext2/namei.c | 8 fs/ext2/super.c | 25 - 5 files changed, 31 insertions(+), 23 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index b8b1c11..46133a0 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,14 +380,15 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID320x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ -#ifdef CONFIG_FS_DAX -#define EXT2_MOUNT_XIP 0x01 /* Execute in place */ -#else -#define EXT2_MOUNT_XIP 0 -#endif +#define EXT2_MOUNT_XIP 0x01 /* Obsolete, use DAX */ #define EXT2_MOUNT_USRQUOTA0x02 /* user quota */ #define EXT2_MOUNT_GRPQUOTA0x04 /* group quota */ #define EXT2_MOUNT_RESERVATION 0x08 /* Preallocation */ +#ifdef CONFIG_FS_DAX +#define EXT2_MOUNT_DAX 0x10 /* Direct Access */ +#else +#define EXT2_MOUNT_DAX 0 +#endif #define clear_opt(o, opt) o &= ~EXT2_MOUNT_##opt @@ -789,7 +790,7 @@ extern int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync); extern const struct inode_operations ext2_file_inode_operations; extern const struct file_operations ext2_file_operations; -extern const struct file_operations ext2_xip_file_operations; +extern const struct file_operations ext2_dax_file_operations; /* inode.c */ extern const struct address_space_operations ext2_aops; diff --git a/fs/ext2/file.c b/fs/ext2/file.c index 46b333d..5b8cab5 100644 --- a/fs/ext2/file.c +++ b/fs/ext2/file.c @@ -110,7 +110,7 @@ const struct file_operations ext2_file_operations = { }; #ifdef CONFIG_FS_DAX -const struct file_operations ext2_xip_file_operations = { +const struct file_operations ext2_dax_file_operations = { .llseek = generic_file_llseek, .read = new_sync_read, .write = new_sync_write, diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 034fd42..6434bc0 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1286,7 +1286,7 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; - if (test_opt(inode->i_sb, XIP)) + if (test_opt(inode->i_sb, DAX)) inode->i_flags |= S_DAX; } @@ -1388,9 +1388,9 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = _file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = _aops; - inode->i_fop = _xip_file_operations; + inode->i_fop = _dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; inode->i_fop = _file_operations; diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index 0db888c..148f6e3 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -104,9 +104,9 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = _file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = _aops; - inode->i_fop = _xip_file_operations; + inode->i_fop = _dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; inode->i_fop = _file_operations; @@ -125,9 +125,9 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = _file_inode_operations; - if (test_opt(inode->i_sb, XIP)) { + if (test_opt(inode->i_sb, DAX)) { inode->i_mapping->a_ops = _aops; - inode->i_fop = _xip_file_operations; + inode->i_fop = _dax_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { inode->i_mapping->a_ops = _nobh_aops; inode->i_fop = _file_operations; diff --git a/fs/ext2/super.c b/fs/ext2/super.c index feb53d8..8b9debf 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -290,6 +290,8 @@ static int ext2_show_options(struct seq_file *seq, struct dentry *root) #ifdef CONFIG_FS_DAX if (sbi->s_mount_opt &
[PATCH v10 09/21] Replace the XIP page fault handler with the DAX page fault handler
Instead of calling aops->get_xip_mem from the fault handler, the filesystem passes a get_block_t that is used to find the appropriate blocks. Signed-off-by: Matthew Wilcox Reviewed-by: Jan Kara --- fs/dax.c | 215 + fs/ext2/file.c | 35 - include/linux/fs.h | 4 +- mm/filemap_xip.c | 206 -- 4 files changed, 251 insertions(+), 209 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 02e226f..f134078 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -19,9 +19,13 @@ #include #include #include +#include +#include +#include #include #include #include +#include int dax_clear_blocks(struct inode *inode, sector_t block, long size) { @@ -64,6 +68,14 @@ static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits) return bdev_direct_access(bh->b_bdev, sector, addr, , bh->b_size); } +static long dax_get_pfn(struct buffer_head *bh, unsigned long *pfn, + unsigned blkbits) +{ + void *addr; + sector_t sector = bh->b_blocknr << (blkbits - 9); + return bdev_direct_access(bh->b_bdev, sector, , pfn, bh->b_size); +} + static void dax_new_buf(void *addr, unsigned size, unsigned first, loff_t pos, loff_t end) { @@ -228,3 +240,206 @@ ssize_t dax_do_io(int rw, struct kiocb *iocb, struct inode *inode, return retval; } EXPORT_SYMBOL_GPL(dax_do_io); + +/* + * The user has performed a load from a hole in the file. Allocating + * a new page in the file would cause excessive storage usage for + * workloads with sparse files. We allocate a page cache page instead. + * We'll kick it out of the page cache if it's ever written to, + * otherwise it will simply fall out of the page cache under memory + * pressure without ever having been dirtied. + */ +static int dax_load_hole(struct address_space *mapping, struct page *page, + struct vm_fault *vmf) +{ + unsigned long size; + struct inode *inode = mapping->host; + if (!page) + page = find_or_create_page(mapping, vmf->pgoff, + GFP_KERNEL | __GFP_ZERO); + if (!page) + return VM_FAULT_OOM; + /* Recheck i_size under page lock to avoid truncate race */ + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) { + unlock_page(page); + page_cache_release(page); + return VM_FAULT_SIGBUS; + } + + vmf->page = page; + return VM_FAULT_LOCKED; +} + +static int copy_user_bh(struct page *to, struct buffer_head *bh, + unsigned blkbits, unsigned long vaddr) +{ + void *vfrom, *vto; + if (dax_get_addr(bh, , blkbits) < 0) + return -EIO; + vto = kmap_atomic(to); + copy_user_page(vto, vfrom, vaddr, to); + kunmap_atomic(vto); + return 0; +} + +static int do_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, + get_block_t get_block) +{ + struct file *file = vma->vm_file; + struct inode *inode = file_inode(file); + struct address_space *mapping = file->f_mapping; + struct page *page; + struct buffer_head bh; + unsigned long vaddr = (unsigned long)vmf->virtual_address; + unsigned blkbits = inode->i_blkbits; + sector_t block; + pgoff_t size; + unsigned long pfn; + int error; + int major = 0; + + size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; + if (vmf->pgoff >= size) + return VM_FAULT_SIGBUS; + + memset(, 0, sizeof(bh)); + block = (sector_t)vmf->pgoff << (PAGE_SHIFT - blkbits); + bh.b_size = PAGE_SIZE; + + repeat: + page = find_get_page(mapping, vmf->pgoff); + if (page) { + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) { + page_cache_release(page); + return VM_FAULT_RETRY; + } + if (unlikely(page->mapping != mapping)) { + unlock_page(page); + page_cache_release(page); + goto repeat; + } + } + + error = get_block(inode, block, , 0); + if (!error && (bh.b_size < PAGE_SIZE)) + error = -EIO; + if (error) + goto unlock_page; + + if (!buffer_written() && !vmf->cow_page) { + if (vmf->flags & FAULT_FLAG_WRITE) { + error = get_block(inode, block, , 1); + count_vm_event(PGMAJFAULT); + mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT); + major = VM_FAULT_MAJOR; + if (!error && (bh.b_size < PAGE_SIZE)) +
[PATCH v10 14/21] ext2: Remove ext2_use_xip
Replace ext2_use_xip() with test_opt(XIP) which expands to the same code Signed-off-by: Matthew Wilcox --- fs/ext2/ext2.h | 4 fs/ext2/inode.c | 2 +- fs/ext2/namei.c | 4 ++-- 3 files changed, 7 insertions(+), 3 deletions(-) diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h index d9a17d0..5ecf570 100644 --- a/fs/ext2/ext2.h +++ b/fs/ext2/ext2.h @@ -380,7 +380,11 @@ struct ext2_inode { #define EXT2_MOUNT_NO_UID320x000200 /* Disable 32-bit UIDs */ #define EXT2_MOUNT_XATTR_USER 0x004000 /* Extended user attributes */ #define EXT2_MOUNT_POSIX_ACL 0x008000 /* POSIX Access Control Lists */ +#ifdef CONFIG_FS_XIP #define EXT2_MOUNT_XIP 0x01 /* Execute in place */ +#else +#define EXT2_MOUNT_XIP 0 +#endif #define EXT2_MOUNT_USRQUOTA0x02 /* user quota */ #define EXT2_MOUNT_GRPQUOTA0x04 /* group quota */ #define EXT2_MOUNT_RESERVATION 0x08 /* Preallocation */ diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 59d6c7d..cba3833 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1394,7 +1394,7 @@ struct inode *ext2_iget (struct super_block *sb, unsigned long ino) if (S_ISREG(inode->i_mode)) { inode->i_op = _file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = _aops_xip; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c index c268d0a..846c356 100644 --- a/fs/ext2/namei.c +++ b/fs/ext2/namei.c @@ -105,7 +105,7 @@ static int ext2_create (struct inode * dir, struct dentry * dentry, umode_t mode return PTR_ERR(inode); inode->i_op = _file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = _aops_xip; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { @@ -126,7 +126,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry *dentry, umode_t mode) return PTR_ERR(inode); inode->i_op = _file_inode_operations; - if (ext2_use_xip(inode->i_sb)) { + if (test_opt(inode->i_sb, XIP)) { inode->i_mapping->a_ops = _aops_xip; inode->i_fop = _xip_file_operations; } else if (test_opt(inode->i_sb, NOBH)) { -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 08/21] Replace ext2_clear_xip_target with dax_clear_blocks
This is practically generic code; other filesystems will want to call it from other places, but there's nothing ext2-specific about it. Make it a little more generic by allowing it to take a count of the number of bytes to zero rather than fixing it to a single page. Thanks to Dave Hansen for suggesting that I need to call cond_resched() if zeroing more than one page. Signed-off-by: Matthew Wilcox --- fs/dax.c | 35 +++ fs/ext2/inode.c| 8 +--- fs/ext2/xip.c | 14 -- fs/ext2/xip.h | 3 --- include/linux/fs.h | 6 ++ 5 files changed, 46 insertions(+), 20 deletions(-) diff --git a/fs/dax.c b/fs/dax.c index 108c68e..02e226f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -20,8 +20,43 @@ #include #include #include +#include #include +int dax_clear_blocks(struct inode *inode, sector_t block, long size) +{ + struct block_device *bdev = inode->i_sb->s_bdev; + sector_t sector = block << (inode->i_blkbits - 9); + + might_sleep(); + do { + void *addr; + unsigned long pfn; + long count; + + count = bdev_direct_access(bdev, sector, , , size); + if (count < 0) + return count; + while (count > 0) { + unsigned pgsz = PAGE_SIZE - offset_in_page(addr); + if (pgsz > count) + pgsz = count; + if (pgsz < PAGE_SIZE) + memset(addr, 0, pgsz); + else + clear_page(addr); + addr += pgsz; + size -= pgsz; + count -= pgsz; + sector += pgsz / 512; + cond_resched(); + } + } while (size); + + return 0; +} +EXPORT_SYMBOL_GPL(dax_clear_blocks); + static long dax_get_addr(struct buffer_head *bh, void **addr, unsigned blkbits) { unsigned long pfn; diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 3ccd5fd..52978b8 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -733,10 +733,12 @@ static int ext2_get_blocks(struct inode *inode, if (IS_DAX(inode)) { /* -* we need to clear the block +* block must be initialised before we put it in the tree +* so that it's not found by another thread before it's +* initialised */ - err = ext2_clear_xip_target (inode, - le32_to_cpu(chain[depth-1].key)); + err = dax_clear_blocks(inode, le32_to_cpu(chain[depth-1].key), + 1 << inode->i_blkbits); if (err) { mutex_unlock(>truncate_mutex); goto cleanup; diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index bbc5fec..8cfca3a 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -42,20 +42,6 @@ __ext2_get_block(struct inode *inode, pgoff_t pgoff, int create, return rc; } -int -ext2_clear_xip_target(struct inode *inode, sector_t block) -{ - void *kaddr; - unsigned long pfn; - long size; - - size = __inode_direct_access(inode, block, , , PAGE_SIZE); - if (size < 0) - return size; - clear_page(kaddr); - return 0; -} - void ext2_xip_verify_sb(struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 29be737..b2592f2 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -7,8 +7,6 @@ #ifdef CONFIG_EXT2_FS_XIP extern void ext2_xip_verify_sb (struct super_block *); -extern int ext2_clear_xip_target (struct inode *, sector_t); - static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); @@ -19,6 +17,5 @@ int ext2_get_xip_mem(struct address_space *, pgoff_t, int, #else #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 -#define ext2_clear_xip_target(inode, chain)0 #define ext2_get_xip_mem NULL #endif diff --git a/include/linux/fs.h b/include/linux/fs.h index 45839e8..c04d371 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2490,11 +2490,17 @@ extern int generic_file_open(struct inode * inode, struct file * filp); extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_XIP +int dax_clear_blocks(struct inode *, sector_t block, long size); extern int xip_file_mmap(struct file * file, struct vm_area_struct * vma); extern int xip_truncate_page(struct address_space *mapping, loff_t from); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); #else +static inline int
Re: [PATCH RFC v7 net-next 00/28] BPF syscall
On Tue, Aug 26, 2014 at 8:56 PM, Andy Lutomirski wrote: > On Aug 26, 2014 7:29 PM, "Alexei Starovoitov" wrote: >> >> Hi Ingo, David, >> >> posting whole thing again as RFC to get feedback on syscall only. >> If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok, >> I'll split them into small chunks as requested and will repost without RFC. > > IMO it's much easier to review a syscall if we just look at a > specification of what it does. The code is, in some sense, secondary. 'specification of what it does'... hmm, you mean beyond what's there in commit logs and in Documentation/networking/filter.txt ? Aren't samples at the end give an idea on 'what it does'? I'm happy to add 'specification', I just don't understand yet what it suppose to talk about beyond what's already written. I understand that the patches are missing explanation on 'why' the syscall is being added, but I don't think it's what you're asking... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 04/21] Allow page fault handlers to perform the COW
Currently COW of an XIP file is done by first bringing in a read-only mapping, then retrying the fault and copying the page. It is much more efficient to tell the fault handler that a COW is being attempted (by passing in the pre-allocated page in the vm_fault structure), and allow the handler to perform the COW operation itself. The handler cannot insert the page itself if there is already a read-only mapping at that address, so allow the handler to return VM_FAULT_LOCKED and set the fault_page to be NULL. This indicates to the MM code that the i_mmap_mutex is held instead of the page lock. Signed-off-by: Matthew Wilcox Acked-by: Kirill A. Shutemov --- include/linux/mm.h | 1 + mm/memory.c| 33 - 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 8981cc8..0a47817 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -208,6 +208,7 @@ struct vm_fault { pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ + struct page *cow_page; /* Handler may choose to COW */ struct page *page; /* ->fault handlers should return a * page here, unless VM_FAULT_NOPAGE * is set (which is also implied by diff --git a/mm/memory.c b/mm/memory.c index adeac30..3368785 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2000,6 +2000,7 @@ static int do_page_mkwrite(struct vm_area_struct *vma, struct page *page, vmf.pgoff = page->index; vmf.flags = FAULT_FLAG_WRITE|FAULT_FLAG_MKWRITE; vmf.page = page; + vmf.cow_page = NULL; ret = vma->vm_ops->page_mkwrite(vma, ); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) @@ -2698,7 +2699,8 @@ oom: * See filemap_fault() and __lock_page_retry(). */ static int __do_fault(struct vm_area_struct *vma, unsigned long address, - pgoff_t pgoff, unsigned int flags, struct page **page) + pgoff_t pgoff, unsigned int flags, + struct page *cow_page, struct page **page) { struct vm_fault vmf; int ret; @@ -2707,10 +2709,13 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, vmf.pgoff = pgoff; vmf.flags = flags; vmf.page = NULL; + vmf.cow_page = cow_page; ret = vma->vm_ops->fault(vma, ); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; + if (!vmf.page) + goto out; if (unlikely(PageHWPoison(vmf.page))) { if (ret & VM_FAULT_LOCKED) @@ -2724,6 +2729,7 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address, else VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page); + out: *page = vmf.page; return ret; } @@ -2897,7 +2903,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma, pte_unmap_unlock(pte, ptl); } - ret = __do_fault(vma, address, pgoff, flags, _page); + ret = __do_fault(vma, address, pgoff, flags, NULL, _page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; @@ -2937,26 +2943,35 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma, return VM_FAULT_OOM; } - ret = __do_fault(vma, address, pgoff, flags, _page); + ret = __do_fault(vma, address, pgoff, flags, new_page, _page); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) goto uncharge_out; - copy_user_highpage(new_page, fault_page, address, vma); + if (fault_page) + copy_user_highpage(new_page, fault_page, address, vma); __SetPageUptodate(new_page); pte = pte_offset_map_lock(mm, pmd, address, ); if (unlikely(!pte_same(*pte, orig_pte))) { pte_unmap_unlock(pte, ptl); - unlock_page(fault_page); - page_cache_release(fault_page); + if (fault_page) { + unlock_page(fault_page); + page_cache_release(fault_page); + } else { + mutex_unlock(>vm_file->f_mapping->i_mmap_mutex); + } goto uncharge_out; } do_set_pte(vma, address, new_page, pte, true, true); mem_cgroup_commit_charge(new_page, memcg, false); lru_cache_add_active_or_unevictable(new_page, vma); pte_unmap_unlock(pte, ptl); - unlock_page(fault_page); - page_cache_release(fault_page); + if (fault_page) { + unlock_page(fault_page); + page_cache_release(fault_page); + } else { +
[RFC PATCH v2] tpm_tis: verify interrupt during init
On Mon, 25 Aug 2014, Jason Gunthorpe wrote: > On Mon, Aug 25, 2014, Scot Doyle wrote: >> 3. Custom SeaBIOS. Blacklist the tpm_tis module so that it doesn't load >>and therefore doesn't issue startup(clear) to the TPM chip. > > It seems to me at least in this case you should be able to get rid of > the IRQ entry, people are going to be flashing the custom SeaBIOS > anyhow. The person building many of these custom SeaBIOS packages has removed the TPM section from the DSDT, so this may be addressed. On Mon, 25 Aug 2014, Jason Gunthorpe wrote: > I think you'll have to directly test in the tis driver if the > interrupt is working. > > The ordering in the TIS driver is wrong, interrupts should be turned > on before any TPM commands are issued. This is what other drivers are > doing. > > If you fix this, tis can then just count interrupts recieved and check > if that is 0 to detect failure and then turn them off. How about something like this? It doesn't enable stock SeaBIOS machines to suspend/resume before the 30 second interrupt timeout, unless using interrupts=0 or force=1. --- diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c index 2c46734..ae701d8 100644 --- a/drivers/char/tpm/tpm_tis.c +++ b/drivers/char/tpm/tpm_tis.c @@ -493,6 +493,8 @@ static irqreturn_t tis_int_probe(int irq, void *dev_id) return IRQ_HANDLED; } +static bool interrupted = false; + static irqreturn_t tis_int_handler(int dummy, void *dev_id) { struct tpm_chip *chip = dev_id; @@ -511,6 +513,8 @@ static irqreturn_t tis_int_handler(int dummy, void *dev_id) for (i = 0; i < 5; i++) if (check_locality(chip, i) >= 0) break; + if (interrupt & TPM_INTF_CMD_READY_INT) + interrupted = true; if (interrupt & (TPM_INTF_LOCALITY_CHANGE_INT | TPM_INTF_STS_VALID_INT | TPM_INTF_CMD_READY_INT)) @@ -612,12 +616,6 @@ static int tpm_tis_init(struct device *dev, resource_size_t start, goto out_err; } - if (tpm_do_selftest(chip)) { - dev_err(dev, "TPM self test failed\n"); - rc = -ENODEV; - goto out_err; - } - /* INTERRUPT Setup */ init_waitqueue_head(>vendor.read_queue); init_waitqueue_head(>vendor.int_queue); @@ -693,7 +691,7 @@ static int tpm_tis_init(struct device *dev, resource_size_t start, free_irq(i, chip); } } - if (chip->vendor.irq) { + if (interrupts && chip->vendor.irq) { iowrite8(chip->vendor.irq, chip->vendor.iobase + TPM_INT_VECTOR(chip->vendor.locality)); @@ -719,6 +717,32 @@ static int tpm_tis_init(struct device *dev, resource_size_t start, } } + /* Test interrupt and/or prepare for later save state */ + interrupted = false; + if (tpm_do_selftest(chip)) { + if (!interrupts || interrupted) { + dev_err(dev, "TPM self test failed\n"); + rc = -ENODEV; + goto out_err; + } else { + /* Turn off interrupt */ + iowrite32(intmask, + chip->vendor.iobase + + TPM_INT_ENABLE(chip->vendor.locality)); + free_irq(chip->vendor.irq, chip); + + /* Retry in polling mode */ + chip->vendor.irq = 0; + if (tpm_do_selftest(chip)) { + dev_err(dev, "TPM self test failed\n"); + rc = -ENODEV; + goto out_err; + } else { + dev_err(dev, "ACPI DSDT entry incorrect, polling instead\n"); + } + } + } + INIT_LIST_HEAD(>vendor.list); mutex_lock(_lock); list_add(>vendor.list, _chips); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 19/21] xip: Add xip_zero_page_range
This new function allows us to support hole-punch for XIP files by zeroing a partial page, as opposed to the xip_truncate_page() function which can only truncate to the end of the page. Reimplement xip_truncate_page() as a macro that calls xip_zero_page_range(). Signed-off-by: Matthew Wilcox [ported to 3.13-rc2] Signed-off-by: Ross Zwisler --- Documentation/filesystems/dax.txt | 1 + fs/dax.c | 20 ++-- include/linux/fs.h| 9 - 3 files changed, 23 insertions(+), 7 deletions(-) diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt index 635adaa..ebcd97f 100644 --- a/Documentation/filesystems/dax.txt +++ b/Documentation/filesystems/dax.txt @@ -62,6 +62,7 @@ Filesystem support consists of for fault and page_mkwrite (which should probably call dax_fault() and dax_mkwrite(), passing the appropriate get_block() callback) - calling dax_truncate_page() instead of block_truncate_page() for DAX files +- calling dax_zero_page_range() instead of zero_user() for DAX files - ensuring that there is sufficient locking between reads, writes, truncates and page faults diff --git a/fs/dax.c b/fs/dax.c index d54f7d3..96c4fed 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -445,13 +445,16 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf, EXPORT_SYMBOL_GPL(dax_fault); /** - * dax_truncate_page - handle a partial page being truncated in a DAX file + * dax_zero_page_range - zero a range within a page of a DAX file * @inode: The file being truncated * @from: The file offset that is being truncated to + * @length: The number of bytes to zero * @get_block: The filesystem method used to translate file offsets to blocks * - * Similar to block_truncate_page(), this function can be called by a - * filesystem when it is truncating an DAX file to handle the partial page. + * This function can be called by a filesystem when it is zeroing part of a + * page in a DAX file. This is intended for hole-punch operations. If + * you are truncating a file, the helper function dax_truncate_page() may be + * more convenient. * * We work in terms of PAGE_CACHE_SIZE here for commonality with * block_truncate_page(), but we could go down to PAGE_SIZE if the filesystem @@ -459,12 +462,12 @@ EXPORT_SYMBOL_GPL(dax_fault); * block size is smaller than PAGE_SIZE, we have to zero the rest of the page * since the file might be mmaped. */ -int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) +int dax_zero_page_range(struct inode *inode, loff_t from, unsigned length, + get_block_t get_block) { struct buffer_head bh; pgoff_t index = from >> PAGE_CACHE_SHIFT; unsigned offset = from & (PAGE_CACHE_SIZE-1); - unsigned length = PAGE_CACHE_ALIGN(from) - from; int err; /* Block boundary? Nothing to do */ @@ -481,9 +484,14 @@ int dax_truncate_page(struct inode *inode, loff_t from, get_block_t get_block) err = dax_get_addr(, , inode->i_blkbits); if (err < 0) return err; + /* +* ext4 sometimes asks to zero past the end of a block. It +* really just wants to zero to the end of the block. +*/ + length = min_t(unsigned, length, PAGE_CACHE_SIZE - offset); memset(addr + offset, 0, length); } return 0; } -EXPORT_SYMBOL_GPL(dax_truncate_page); +EXPORT_SYMBOL_GPL(dax_zero_page_range); diff --git a/include/linux/fs.h b/include/linux/fs.h index e6b48cc..b0078df 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2490,6 +2490,7 @@ extern int nonseekable_open(struct inode * inode, struct file * filp); #ifdef CONFIG_FS_DAX int dax_clear_blocks(struct inode *, sector_t block, long size); +int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t); int dax_truncate_page(struct inode *, loff_t from, get_block_t); ssize_t dax_do_io(int rw, struct kiocb *, struct inode *, struct iov_iter *, loff_t, get_block_t, dio_iodone_t, int flags); @@ -2501,7 +2502,8 @@ static inline int dax_clear_blocks(struct inode *i, sector_t blk, long sz) return 0; } -static inline int dax_truncate_page(struct inode *i, loff_t frm, get_block_t gb) +static inline int dax_zero_page_range(struct inode *inode, loff_t from, + unsigned len, get_block_t gb) { return 0; } @@ -2514,6 +2516,11 @@ static inline ssize_t dax_do_io(int rw, struct kiocb *iocb, } #endif +/* Can't be a function because PAGE_CACHE_SIZE is defined in pagemap.h */ +#define dax_truncate_page(inode, from, get_block) \ + dax_zero_page_range(inode, from, PAGE_CACHE_SIZE, get_block) + + #ifdef CONFIG_BLOCK typedef void (dio_submit_t)(int rw, struct bio *bio, struct inode
[PATCH v10 05/21] Introduce IS_DAX(inode)
Use an inode flag to tag inodes which should avoid using the page cache. Convert ext2 to use it instead of mapping_is_xip(). Signed-off-by: Matthew Wilcox Reviewed-by: Jan Kara --- fs/ext2/inode.c| 9 ++--- fs/ext2/xip.h | 2 -- include/linux/fs.h | 6 ++ 3 files changed, 12 insertions(+), 5 deletions(-) diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 36d35c3..0cb0448 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -731,7 +731,7 @@ static int ext2_get_blocks(struct inode *inode, goto cleanup; } - if (ext2_use_xip(inode->i_sb)) { + if (IS_DAX(inode)) { /* * we need to clear the block */ @@ -1201,7 +1201,7 @@ static int ext2_setsize(struct inode *inode, loff_t newsize) inode_dio_wait(inode); - if (mapping_is_xip(inode->i_mapping)) + if (IS_DAX(inode)) error = xip_truncate_page(inode->i_mapping, newsize); else if (test_opt(inode->i_sb, NOBH)) error = nobh_truncate_page(inode->i_mapping, @@ -1273,7 +1273,8 @@ void ext2_set_inode_flags(struct inode *inode) { unsigned int flags = EXT2_I(inode)->i_flags; - inode->i_flags &= ~(S_SYNC|S_APPEND|S_IMMUTABLE|S_NOATIME|S_DIRSYNC); + inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | + S_DIRSYNC | S_DAX); if (flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT2_APPEND_FL) @@ -1284,6 +1285,8 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_NOATIME; if (flags & EXT2_DIRSYNC_FL) inode->i_flags |= S_DIRSYNC; + if (test_opt(inode->i_sb, XIP)) + inode->i_flags |= S_DAX; } /* Propagate flags from i_flags to EXT2_I(inode)->i_flags */ diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index 18b34d2..29be737 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -16,9 +16,7 @@ static inline int ext2_use_xip (struct super_block *sb) } int ext2_get_xip_mem(struct address_space *, pgoff_t, int, void **, unsigned long *); -#define mapping_is_xip(map) unlikely(map->a_ops->get_xip_mem) #else -#define mapping_is_xip(map)0 #define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #define ext2_clear_xip_target(inode, chain)0 diff --git a/include/linux/fs.h b/include/linux/fs.h index 9418772..e99e5c4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1605,6 +1605,7 @@ struct super_operations { #define S_IMA 1024/* Inode has an associated IMA struct */ #define S_AUTOMOUNT2048/* Automount/referral quasi-directory */ #define S_NOSEC4096/* no suid or xattr security attributes */ +#define S_DAX 8192/* Direct Access, avoiding the page cache */ /* * Note that nosuid etc flags are inode-specific: setting some file-system @@ -1642,6 +1643,11 @@ struct super_operations { #define IS_IMA(inode) ((inode)->i_flags & S_IMA) #define IS_AUTOMOUNT(inode)((inode)->i_flags & S_AUTOMOUNT) #define IS_NOSEC(inode)((inode)->i_flags & S_NOSEC) +#ifdef CONFIG_FS_XIP +#define IS_DAX(inode) ((inode)->i_flags & S_DAX) +#else +#define IS_DAX(inode) 0 +#endif /* * Inode state bits. Protected by inode->i_lock -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 11/21] Replace XIP documentation with DAX documentation
From: Matthew Wilcox Based on the original XIP documentation, this documents the current state of affairs, and includes instructions on how users can enable DAX if their devices and kernel support it. Signed-off-by: Matthew Wilcox Reviewed-by: Randy Dunlap --- Documentation/filesystems/dax.txt | 89 +++ Documentation/filesystems/xip.txt | 71 --- 2 files changed, 89 insertions(+), 71 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt new file mode 100644 index 000..635adaa --- /dev/null +++ b/Documentation/filesystems/dax.txt @@ -0,0 +1,89 @@ +Direct Access for files +--- + +Motivation +-- + +The page cache is usually used to buffer reads and writes to files. +It is also used to provide the pages which are mapped into userspace +by a call to mmap. + +For block devices that are memory-like, the page cache pages would be +unnecessary copies of the original storage. The DAX code removes the +extra copy by performing reads and writes directly to the storage device. +For file mappings, the storage device is mapped directly into userspace. + + +Usage +- + +If you have a block device which supports DAX, you can make a filesystem +on it as usual. When mounting it, use the -o dax option manually +or add 'dax' to the options in /etc/fstab. + + +Implementation Tips for Block Driver Writers + + +To support DAX in your block driver, implement the 'direct_access' +block device operation. It is used to translate the sector number +(expressed in units of 512-byte sectors) to a page frame number (pfn) +that identifies the physical page for the memory. It also returns a +kernel virtual address that can be used to access the memory. + +The direct_access method takes a 'size' parameter that indicates the +number of bytes being requested. The function should return the number +of bytes that can be contiguously accessed at that offset. It may also +return a negative errno if an error occurs. + +In order to support this method, the storage must be byte-accessible by +the CPU at all times. If your device uses paging techniques to expose +a large amount of memory through a smaller window, then you cannot +implement direct_access. Equally, if your device can occasionally +stall the CPU for an extended period, you should also not attempt to +implement direct_access. + +These block devices may be used for inspiration: +- axonram: Axon DDR2 device driver +- brd: RAM backed block device driver +- dcssblk: s390 dcss block device driver + + +Implementation Tips for Filesystem Writers +-- + +Filesystem support consists of +- adding support to mark inodes as being DAX by setting the S_DAX flag in + i_flags +- implementing the direct_IO address space operation, and calling + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set +- implementing an mmap file operation for DAX files which sets the + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers + for fault and page_mkwrite (which should probably call dax_fault() and + dax_mkwrite(), passing the appropriate get_block() callback) +- calling dax_truncate_page() instead of block_truncate_page() for DAX files +- ensuring that there is sufficient locking between reads, writes, + truncates and page faults + +The get_block() callback passed to the DAX functions may return +uninitialised extents. If it does, it must ensure that simultaneous +calls to get_block() (for example by a page-fault racing with a read() +or a write()) work correctly. + +These filesystems may be used for inspiration: +- ext2: the second extended filesystem, see Documentation/filesystems/ext2.txt + + +Shortcomings + + +Even if the kernel or its modules are stored on a filesystem that supports +DAX on a block device that supports DAX, they will still be copied into RAM. + +Calling get_user_pages() on a range of user memory that has been mmaped +from a DAX file will fail as there are no 'struct page' to describe +those pages. This problem is being worked on. That means that O_DIRECT +reads/writes to those memory ranges from a non-DAX file will fail (note +that O_DIRECT reads/writes _of a DAX file_ do work, it is the memory +that is being accessed that is key here). Other things that will not +work include RDMA, sendfile() and splice(). diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt deleted file mode 100644 index b774729..000 --- a/Documentation/filesystems/xip.txt +++ /dev/null @@ -1,71 +0,0 @@ -Execute-in-place for file mappings --- - -Motivation --- -File mappings are performed by mapping page cache pages to userspace. In -addition, read type file
[PATCH v10 00/21] Support ext4 on NV-DIMMs
One of the primary uses for NV-DIMMs is to expose them as a block device and use a filesystem to store files on the NV-DIMM. While that works, it currently wastes memory and CPU time buffering the files in the page cache. We have support in ext2 for bypassing the page cache, but it has some races which are unfixable in the current design. This series of patches rewrite the underlying support, and add support for direct access to ext4. Note that patch 6/21 has been included in https://git.kernel.org/cgit/linux/kernel/git/viro/vfs.git/log/?h=for-next-candidate This iteration of the patchset rebases to 3.17-rc2, changes the page fault locking, fixes a couple of bugs and makes a few other minor changes. - Move the calculation of the maximum size available at the requested location from the ->direct_access implementations to bdev_direct_access() - Fix a comment typo (Ross Zwisler) - Check that the requested length is positive in bdev_direct_access(). If it is not, assume that it's an errno, and just return it. - Fix some whitespace issues flagged by checkpatch - Added the Acked-by responses from Kirill that I forget in the last round - Added myself to MAINTAINERS for DAX - Fixed compilation with !CONFIG_DAX (Vishal Verma) - Revert the locking in the page fault handler back to an earlier version. If we hit the race that we were trying to protect against, we will leave blocks allocated past the end of the file. They will be removed on file removal, the next truncate, or fsck. Matthew Wilcox (20): axonram: Fix bug in direct_access Change direct_access calling convention Fix XIP fault vs truncate race Allow page fault handlers to perform the COW Introduce IS_DAX(inode) Add copy_to_iter(), copy_from_iter() and iov_iter_zero() Replace XIP read and write with DAX I/O Replace ext2_clear_xip_target with dax_clear_blocks Replace the XIP page fault handler with the DAX page fault handler Replace xip_truncate_page with dax_truncate_page Replace XIP documentation with DAX documentation Remove get_xip_mem ext2: Remove ext2_xip_verify_sb() ext2: Remove ext2_use_xip ext2: Remove xip.c and xip.h Remove CONFIG_EXT2_FS_XIP and rename CONFIG_FS_XIP to CONFIG_FS_DAX ext2: Remove ext2_aops_xip Get rid of most mentions of XIP in ext2 xip: Add xip_zero_page_range brd: Rename XIP to DAX Ross Zwisler (1): ext4: Add DAX functionality Documentation/filesystems/Locking | 3 - Documentation/filesystems/dax.txt | 91 +++ Documentation/filesystems/ext4.txt | 2 + Documentation/filesystems/xip.txt | 68 - MAINTAINERS| 6 + arch/powerpc/sysdev/axonram.c | 19 +- drivers/block/Kconfig | 13 +- drivers/block/brd.c| 26 +- drivers/s390/block/dcssblk.c | 21 +- fs/Kconfig | 21 +- fs/Makefile| 1 + fs/block_dev.c | 40 +++ fs/dax.c | 497 + fs/exofs/inode.c | 1 - fs/ext2/Kconfig| 11 - fs/ext2/Makefile | 1 - fs/ext2/ext2.h | 10 +- fs/ext2/file.c | 45 +++- fs/ext2/inode.c| 38 +-- fs/ext2/namei.c| 13 +- fs/ext2/super.c| 53 ++-- fs/ext2/xip.c | 91 --- fs/ext2/xip.h | 26 -- fs/ext4/ext4.h | 6 + fs/ext4/file.c | 49 +++- fs/ext4/indirect.c | 18 +- fs/ext4/inode.c| 51 ++-- fs/ext4/namei.c| 10 +- fs/ext4/super.c| 39 ++- fs/open.c | 5 +- include/linux/blkdev.h | 6 +- include/linux/fs.h | 49 +++- include/linux/mm.h | 1 + include/linux/uio.h| 3 + mm/Makefile| 1 - mm/fadvise.c | 6 +- mm/filemap.c | 6 +- mm/filemap_xip.c | 483 --- mm/iov_iter.c | 237 -- mm/madvise.c | 2 +- mm/memory.c| 33 ++- 41 files changed, 1229 insertions(+), 873 deletions(-) create mode 100644 Documentation/filesystems/dax.txt delete mode 100644 Documentation/filesystems/xip.txt create mode 100644 fs/dax.c delete mode 100644 fs/ext2/xip.c delete mode 100644 fs/ext2/xip.h delete mode 100644 mm/filemap_xip.c -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v10 06/21] Add copy_to_iter(), copy_from_iter() and iov_iter_zero()
From: Matthew Wilcox For DAX, we want to be able to copy between iovecs and kernel addresses that don't necessarily have a struct page. This is a fairly simple rearrangement for bvec iters to kmap the pages outside and pass them in, but for user iovecs it gets more complicated because we might try various different ways to kmap the memory. Duplicating the existing logic works out best in this case. We need to be able to write zeroes to an iovec for reads from unwritten ranges in a file. This is performed by the new iov_iter_zero() function, again patterned after the existing code that handles iovec iterators. Signed-off-by: Matthew Wilcox --- include/linux/uio.h | 3 + mm/iov_iter.c | 237 2 files changed, 226 insertions(+), 14 deletions(-) diff --git a/include/linux/uio.h b/include/linux/uio.h index 48d64e6..1863ddd 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -80,6 +80,9 @@ size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); size_t copy_page_from_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); +size_t copy_to_iter(void *addr, size_t bytes, struct iov_iter *i); +size_t copy_from_iter(void *addr, size_t bytes, struct iov_iter *i); +size_t iov_iter_zero(size_t bytes, struct iov_iter *); unsigned long iov_iter_alignment(const struct iov_iter *i); void iov_iter_init(struct iov_iter *i, int direction, const struct iovec *iov, unsigned long nr_segs, size_t count); diff --git a/mm/iov_iter.c b/mm/iov_iter.c index ab88dc0..d481fd8 100644 --- a/mm/iov_iter.c +++ b/mm/iov_iter.c @@ -4,6 +4,96 @@ #include #include +static size_t copy_to_iter_iovec(void *from, size_t bytes, struct iov_iter *i) +{ + size_t skip, copy, left, wanted; + const struct iovec *iov; + char __user *buf; + + if (unlikely(bytes > i->count)) + bytes = i->count; + + if (unlikely(!bytes)) + return 0; + + wanted = bytes; + iov = i->iov; + skip = i->iov_offset; + buf = iov->iov_base + skip; + copy = min(bytes, iov->iov_len - skip); + + left = __copy_to_user(buf, from, copy); + copy -= left; + skip += copy; + from += copy; + bytes -= copy; + while (unlikely(!left && bytes)) { + iov++; + buf = iov->iov_base; + copy = min(bytes, iov->iov_len); + left = __copy_to_user(buf, from, copy); + copy -= left; + skip = copy; + from += copy; + bytes -= copy; + } + + if (skip == iov->iov_len) { + iov++; + skip = 0; + } + i->count -= wanted - bytes; + i->nr_segs -= iov - i->iov; + i->iov = iov; + i->iov_offset = skip; + return wanted - bytes; +} + +static size_t copy_from_iter_iovec(void *to, size_t bytes, struct iov_iter *i) +{ + size_t skip, copy, left, wanted; + const struct iovec *iov; + char __user *buf; + + if (unlikely(bytes > i->count)) + bytes = i->count; + + if (unlikely(!bytes)) + return 0; + + wanted = bytes; + iov = i->iov; + skip = i->iov_offset; + buf = iov->iov_base + skip; + copy = min(bytes, iov->iov_len - skip); + + left = __copy_from_user(to, buf, copy); + copy -= left; + skip += copy; + to += copy; + bytes -= copy; + while (unlikely(!left && bytes)) { + iov++; + buf = iov->iov_base; + copy = min(bytes, iov->iov_len); + left = __copy_from_user(to, buf, copy); + copy -= left; + skip = copy; + to += copy; + bytes -= copy; + } + + if (skip == iov->iov_len) { + iov++; + skip = 0; + } + i->count -= wanted - bytes; + i->nr_segs -= iov - i->iov; + i->iov = iov; + i->iov_offset = skip; + return wanted - bytes; +} + static size_t copy_page_to_iter_iovec(struct page *page, size_t offset, size_t bytes, struct iov_iter *i) { @@ -166,6 +256,50 @@ done: return wanted - bytes; } +static size_t zero_iovec(size_t bytes, struct iov_iter *i) +{ + size_t skip, copy, left, wanted; + const struct iovec *iov; + char __user *buf; + + if (unlikely(bytes > i->count)) + bytes = i->count; + + if (unlikely(!bytes)) + return 0; + + wanted = bytes; + iov = i->iov; + skip = i->iov_offset; + buf = iov->iov_base + skip; + copy = min(bytes, iov->iov_len - skip); + + left = __clear_user(buf, copy); + copy -= left; + skip += copy; + bytes -= copy; + + while (unlikely(!left &&
[PATCH v10 13/21] ext2: Remove ext2_xip_verify_sb()
Jan Kara pointed out that calling ext2_xip_verify_sb() in ext2_remount() doesn't make sense, since changing the XIP option on remount isn't allowed. It also doesn't make sense to re-check whether blocksize is supported since it can't change between mounts. Replace the call to ext2_xip_verify_sb() in ext2_fill_super() with the equivalent check and delete the definition. Signed-off-by: Matthew Wilcox --- fs/ext2/super.c | 33 - fs/ext2/xip.c | 12 fs/ext2/xip.h | 2 -- 3 files changed, 12 insertions(+), 35 deletions(-) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index b88edc0..d862031 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -868,9 +868,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || EXT2_HAS_RO_COMPAT_FEATURE(sb, ~0U) || @@ -900,11 +897,17 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) blocksize = BLOCK_SIZE << le32_to_cpu(sbi->s_es->s_log_block_size); - if (ext2_use_xip(sb) && blocksize != PAGE_SIZE) { - if (!silent) + if (sbi->s_mount_opt & EXT2_MOUNT_XIP) { + if (blocksize != PAGE_SIZE) { ext2_msg(sb, KERN_ERR, - "error: unsupported blocksize for xip"); - goto failed_mount; + "error: unsupported blocksize for xip"); + goto failed_mount; + } + if (!sb->s_bdev->bd_disk->fops->direct_access) { + ext2_msg(sb, KERN_ERR, + "error: device does not support xip"); + goto failed_mount; + } } /* If the blocksize doesn't match, re-read the thing.. */ @@ -1249,7 +1252,6 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) { struct ext2_sb_info * sbi = EXT2_SB(sb); struct ext2_super_block * es; - unsigned long old_mount_opt = sbi->s_mount_opt; struct ext2_mount_options old_opts; unsigned long old_sb_flags; int err; @@ -1274,22 +1276,11 @@ static int ext2_remount (struct super_block * sb, int * flags, char * data) sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | ((sbi->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - ext2_xip_verify_sb(sb); /* see if bdev supports xip, unset - EXT2_MOUNT_XIP if not */ - - if ((ext2_use_xip(sb)) && (sb->s_blocksize != PAGE_SIZE)) { - ext2_msg(sb, KERN_WARNING, - "warning: unsupported blocksize for xip"); - err = -EINVAL; - goto restore_opts; - } - es = sbi->s_es; - if ((sbi->s_mount_opt ^ old_mount_opt) & EXT2_MOUNT_XIP) { + if ((sbi->s_mount_opt ^ old_opts.s_mount_opt) & EXT2_MOUNT_XIP) { ext2_msg(sb, KERN_WARNING, "warning: refusing change of " "xip flag with busy inodes while remounting"); - sbi->s_mount_opt &= ~EXT2_MOUNT_XIP; - sbi->s_mount_opt |= old_mount_opt & EXT2_MOUNT_XIP; + sbi->s_mount_opt ^= EXT2_MOUNT_XIP; } if ((*flags & MS_RDONLY) == (sb->s_flags & MS_RDONLY)) { spin_unlock(>s_lock); diff --git a/fs/ext2/xip.c b/fs/ext2/xip.c index 132d4da..66ca113 100644 --- a/fs/ext2/xip.c +++ b/fs/ext2/xip.c @@ -13,15 +13,3 @@ #include "ext2.h" #include "xip.h" -void ext2_xip_verify_sb(struct super_block *sb) -{ - struct ext2_sb_info *sbi = EXT2_SB(sb); - - if ((sbi->s_mount_opt & EXT2_MOUNT_XIP) && - !sb->s_bdev->bd_disk->fops->direct_access) { - sbi->s_mount_opt &= (~EXT2_MOUNT_XIP); - ext2_msg(sb, KERN_WARNING, -"warning: ignoring xip option - " -"not supported by bdev"); - } -} diff --git a/fs/ext2/xip.h b/fs/ext2/xip.h index e7b9f0a..87eeb04 100644 --- a/fs/ext2/xip.h +++ b/fs/ext2/xip.h @@ -6,13 +6,11 @@ */ #ifdef CONFIG_EXT2_FS_XIP -extern void ext2_xip_verify_sb (struct super_block *); static inline int ext2_use_xip (struct super_block *sb) { struct ext2_sb_info *sbi = EXT2_SB(sb); return (sbi->s_mount_opt & EXT2_MOUNT_XIP); } #else -#define ext2_xip_verify_sb(sb) do { } while (0) #define ext2_use_xip(sb) 0 #endif -- 2.0.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to
[PATCH V2] regulator: DA9211 : support device tree
This is a patch for supporting device tree of DA9211/DA9213. Signed-off-by: James Ban --- This patch is relative to linux-next repository tag next-20140826. Changes in V2: - defined what the valid regulators for the device and where their configuration should be specified in the device tree. .../devicetree/bindings/regulator/da9211.txt | 63 +++ drivers/regulator/da9211-regulator.c | 85 ++-- include/linux/regulator/da9211.h |2 +- 3 files changed, 142 insertions(+), 8 deletions(-) create mode 100644 Documentation/devicetree/bindings/regulator/da9211.txt diff --git a/Documentation/devicetree/bindings/regulator/da9211.txt b/Documentation/devicetree/bindings/regulator/da9211.txt new file mode 100644 index 000..240019a --- /dev/null +++ b/Documentation/devicetree/bindings/regulator/da9211.txt @@ -0,0 +1,63 @@ +* Dialog Semiconductor DA9211/DA9213 Voltage Regulator + +Required properties: +- compatible: "dlg,da9211" or "dlg,da9213". +- reg: I2C slave address, usually 0x68. +- interrupts: the interrupt outputs of the controller +- regulators: A node that houses a sub-node for each regulator within the + device. Each sub-node is identified using the node's name, with valid + values listed below. The content of each sub-node is defined by the + standard binding for regulators; see regulator.txt. + BUCKA and BUCKB. + +Optional properties: +- Any optional property defined in regulator.txt + +Example 1) DA9211 + + pmic: da9211@68 { + compatible = "dlg,da9211"; + reg = <0x68>; + interrupts = <3 27>; + + regulators { + BUCKA { + regulator-name = "VBUCKA"; + regulator-min-microvolt = < 30>; + regulator-max-microvolt = <157>; + regulator-min-microamp = <200>; + regulator-max-microamp = <500>; + }; + BUCKB { + regulator-name = "VBUCKB"; + regulator-min-microvolt = < 30>; + regulator-max-microvolt = <157>; + regulator-min-microamp = <200>; + regulator-max-microamp = <500>; + }; + }; + }; + +Example 2) DA92113 + pmic: da9213@68 { + compatible = "dlg,da9213"; + reg = <0x68>; + interrupts = <3 27>; + + regulators { + BUCKA { + regulator-name = "VBUCKA"; + regulator-min-microvolt = < 30>; + regulator-max-microvolt = <157>; + regulator-min-microamp = <300>; + regulator-max-microamp = <600>; + }; + BUCKB { + regulator-name = "VBUCKB"; + regulator-min-microvolt = < 30>; + regulator-max-microvolt = <157>; + regulator-min-microamp = <300>; + regulator-max-microamp = <600>; + }; + }; + }; diff --git a/drivers/regulator/da9211-regulator.c b/drivers/regulator/da9211-regulator.c index a26f1d2..5aabbac 100644 --- a/drivers/regulator/da9211-regulator.c +++ b/drivers/regulator/da9211-regulator.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include "da9211-regulator.h" @@ -236,6 +237,59 @@ static struct regulator_desc da9211_regulators[] = { DA9211_BUCK(BUCKB), }; +#ifdef CONFIG_OF +static struct of_regulator_match da9211_matches[] = { + [DA9211_ID_BUCKA] = { .name = "BUCKA" }, + [DA9211_ID_BUCKB] = { .name = "BUCKB" }, + }; + +static struct da9211_pdata *da9211_parse_regulators_dt( + struct device *dev) +{ + struct da9211_pdata *pdata; + struct device_node *node; + int i, num, n; + + node = of_get_child_by_name(dev->of_node, "regulators"); + if (!node) { + dev_err(dev, "regulators node not found\n"); + return ERR_PTR(-ENODEV); + } + + num = of_regulator_match(dev, node, da9211_matches, +ARRAY_SIZE(da9211_matches)); + of_node_put(node); + if (num < 0) { + dev_err(dev, "Failed to match reg
[PATCH V2 3/6] arm64: LLVMLinux: Calculate current_thread_info from current_stack_pointer
From: Behan Webster Use the global current_stack_pointer to get the value of the stack pointer. This change supports being able to compile the kernel with both gcc and clang. Signed-off-by: Behan Webster Signed-off-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Reviewed-by: Olof Johansson Acked-by: Will Deacon --- arch/arm64/include/asm/thread_info.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 356e037..459bf8e 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -80,8 +80,8 @@ static inline struct thread_info *current_thread_info(void) __attribute_const__; static inline struct thread_info *current_thread_info(void) { - register unsigned long sp asm ("sp"); - return (struct thread_info *)(sp & ~(THREAD_SIZE - 1)); + return (struct thread_info *) + (current_stack_pointer & ~(THREAD_SIZE - 1)); } #define thread_saved_pc(tsk) \ -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 2/6] arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk
From: Behan Webster Use the global current_stack_pointer to get the value of the stack pointer. This change supports being able to compile the kernel with both gcc and clang. Signed-off-by: Behan Webster Signed-off-by: Mark Charlebois Reviewed-by: Jan-Simon Möller Reviewed-by: Olof Johansson Acked-by: Will Deacon --- arch/arm64/kernel/stacktrace.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c index 55437ba..407991b 100644 --- a/arch/arm64/kernel/stacktrace.c +++ b/arch/arm64/kernel/stacktrace.c @@ -111,10 +111,9 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace) frame.sp = thread_saved_sp(tsk); frame.pc = thread_saved_pc(tsk); } else { - register unsigned long current_sp asm("sp"); data.no_sched_functions = 0; frame.fp = (unsigned long)__builtin_frame_address(0); - frame.sp = current_sp; + frame.sp = current_stack_pointer; frame.pc = (unsigned long)save_stack_trace_tsk; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 5/6] arm64: LLVMLinux: Use global stack register variable for aarch64
From: Mark Charlebois To support both Clang and GCC, use the global stack register variable vs a local register variable. Author: Mark Charlebois Signed-off-by: Mark Charlebois Signed-off-by: Behan Webster --- arch/arm64/include/asm/percpu.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h index 453a179..5279e57 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -26,13 +26,13 @@ static inline void set_my_cpu_offset(unsigned long off) static inline unsigned long __my_cpu_offset(void) { unsigned long off; - register unsigned long *sp asm ("sp"); /* * We want to allow caching the value, so avoid using volatile and * instead use a fake stack read to hazard against barrier(). */ - asm("mrs %0, tpidr_el1" : "=r" (off) : "Q" (*sp)); + asm("mrs %0, tpidr_el1" : "=r" (off) : + "Q" (*(const unsigned long *)current_stack_pointer)); return off; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 4/6] arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c
From: Behan Webster Use the global current_stack_pointer to get the value of the stack pointer. This change supports being able to compile the kernel with both gcc and clang. Signed-off-by: Behan Webster Signed-off-by: Mark Charlebois Reviewed-by: Olof Johansson Acked-by: Will Deacon --- arch/arm64/kernel/traps.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c index 02cd3f0..de1b085 100644 --- a/arch/arm64/kernel/traps.c +++ b/arch/arm64/kernel/traps.c @@ -132,7 +132,6 @@ static void dump_instr(const char *lvl, struct pt_regs *regs) static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk) { struct stackframe frame; - const register unsigned long current_sp asm ("sp"); pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk); @@ -145,7 +144,7 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk) frame.pc = regs->pc; } else if (tsk == current) { frame.fp = (unsigned long)__builtin_frame_address(0); - frame.sp = current_sp; + frame.sp = current_stack_pointer; frame.pc = (unsigned long)dump_backtrace; } else { /* -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 6/6] arm64: LLVMLinux: Use global stack pointer in return_address()
From: Behan Webster The global register current_stack_pointer holds the current stack pointer. This change supports being able to compile the kernel with both gcc and clang. Author: Mark Charlebois Signed-off-by: Mark Charlebois Signed-off-by: Behan Webster --- arch/arm64/kernel/return_address.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/arch/arm64/kernel/return_address.c b/arch/arm64/kernel/return_address.c index 89102a6..6c4fd28 100644 --- a/arch/arm64/kernel/return_address.c +++ b/arch/arm64/kernel/return_address.c @@ -36,13 +36,12 @@ void *return_address(unsigned int level) { struct return_address_data data; struct stackframe frame; - register unsigned long current_sp asm ("sp"); data.level = level + 2; data.addr = NULL; frame.fp = (unsigned long)__builtin_frame_address(0); - frame.sp = current_sp; + frame.sp = current_stack_pointer; frame.pc = (unsigned long)return_address; /* dummy */ walk_stackframe(, save_return_addr, ); -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 0/6] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM
From: Behan Webster This patch set moves from using locally defined named registers to access the stack pointer to using a globally defined named register. This allows the code to work both with gcc and clang. The LLVMLinux project aims to fully build the Linux kernel using both gcc and clang (the C front end for the LLVM compiler infrastructure project). Behan Webster (5): arm64: LLVMLinux: Add current_stack_pointer() for arm64 arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk arm64: LLVMLinux: Calculate current_thread_info from current_stack_pointer arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c arm64: LLVMLinux: Use global stack pointer in return_address() Mark Charlebois (1): arm64: LLVMLinux: Use global stack register variable for aarch64 arch/arm64/include/asm/percpu.h | 4 ++-- arch/arm64/include/asm/thread_info.h | 9 +++-- arch/arm64/kernel/return_address.c | 3 +-- arch/arm64/kernel/stacktrace.c | 3 +-- arch/arm64/kernel/traps.c| 3 +-- 5 files changed, 12 insertions(+), 10 deletions(-) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 1/6] arm64: LLVMLinux: Add current_stack_pointer() for arm64
From: Behan Webster Define a global named register for current_stack_pointer. The use of this new variable guarantees that both gcc and clang can access this register in C code. Signed-off-by: Behan Webster Reviewed-by: Jan-Simon Möller Reviewed-by: Mark Charlebois Reviewed-by: Olof Johansson Acked-by: Will Deacon --- arch/arm64/include/asm/thread_info.h | 5 + 1 file changed, 5 insertions(+) diff --git a/arch/arm64/include/asm/thread_info.h b/arch/arm64/include/asm/thread_info.h index 45108d8..356e037 100644 --- a/arch/arm64/include/asm/thread_info.h +++ b/arch/arm64/include/asm/thread_info.h @@ -69,6 +69,11 @@ struct thread_info { #define init_stack (init_thread_union.stack) /* + * how to get the current stack pointer from C + */ +register unsigned long current_stack_pointer asm ("sp"); + +/* * how to get the thread information struct from C */ static inline struct thread_info *current_thread_info(void) __attribute_const__; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/4] LLVMLinux: Patches to enable the kernel to be compiled with clang/LLVM
On 08/26/14 07:16, Will Deacon wrote: Hi Behan, On Fri, Aug 01, 2014 at 05:11:59AM +0100, Behan Webster wrote: On 07/31/14 03:33, Will Deacon wrote: On Thu, Jul 31, 2014 at 12:57:25AM +0100, beh...@converseincode.com wrote: From: Behan Webster This patch set moves from using locally defined named registers to access the stack pointer to using a globally defined named register. This allows the code to work both with gcc and clang. The LLVMLinux project aims to fully build the Linux kernel using both gcc and clang (the C front end for the LLVM compiler infrastructure project). Behan Webster (4): arm64: LLVMLinux: Add current_stack_pointer() for arm64 arm64: LLVMLinux: Use current_stack_pointer in save_stack_trace_tsk arm64: LLVMLinux: Calculate current_thread_info from current_stack_pointer arm64: LLVMLinux: Use current_stack_pointer in kernel/traps.c Once Andreas's comments have been addressed: Acked-by: Will Deacon Please can you send a new series after the merge window? Pity. I was hoping to get it in this merge window. However, will resubmit for 3.18. Any chance of a v2 for this series, please? If you address the comments pending for v1, I think it's good to merge. Sure thing. 2 more named register patches added. Look for them at the end of the new patch series. I kept missing you in Chicago. I was hoping to say "hi". Behan -- Behan Webster beh...@converseincode.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
linux-next: build failure after merge of the percpu tree
Hi all, After merging the percpu tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: In file included from arch/powerpc/include/asm/xics.h:9:0, from arch/powerpc/kernel/asm-offsets.c:47: include/linux/interrupt.h:372:0: warning: "set_softirq_pending" redefined #define set_softirq_pending(x) (local_softirq_pending() = (x)) ^ In file included from include/linux/hardirq.h:8:0, from include/linux/memcontrol.h:24, from include/linux/swap.h:8, from include/linux/suspend.h:4, from arch/powerpc/kernel/asm-offsets.c:24: arch/powerpc/include/asm/hardirq.h:25:0: note: this is the location of the previous definition #define set_softirq_pending(x) __this_cpu_write(irq_stat._softirq_pending, (x)) ^ I got lots (and lots :-() of these and some were considered errors (powerpc is built with -Werr in arch/powerpc). Caused by commit 5828f666c069 ("powerpc: Replace __get_cpu_var uses"). I have used the percpu tree from next-20140826 for today. -- Cheers, Stephen Rothwells...@canb.auug.org.au signature.asc Description: PGP signature
[PATCH 1/1] ice1712: Replacing hex with #defines
Adds to te readability of the ice1712 driver. Signed-off-by: Konstantinos Tsimpoukas --- sound/pci/ice1712/ice1712.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sound/pci/ice1712/ice1712.c b/sound/pci/ice1712/ice1712.c index 87f7fc4..206ed2c 100644 --- a/sound/pci/ice1712/ice1712.c +++ b/sound/pci/ice1712/ice1712.c @@ -2528,7 +2528,7 @@ static int snd_ice1712_free(struct snd_ice1712 *ice) if (!ice->port) goto __hw_end; /* mask all interrupts */ - outb(0xc0, ICEMT(ice, IRQ)); + outb(ICE1712_MULTI_CAPTURE | ICE1712_MULTI_PLAYBACK, ICEMT(ice, IRQ)); outb(0xff, ICEREG(ice, IRQMASK)); /* --- */ __hw_end: -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 00/14] arm64: eBPF JIT compiler
Hi Will, Catalin, This is a respin of series implementing eBPF JIT compiler for arm64, on top of 3.17-rc2. The v1 series [1] missed the previous merge window. Patches [1-13/14] implement code generation functions. Unchanged from v1. Patch [14/14] implements the actual eBPF JIT compiler. Updated from v1: straightforward fixups due to changes from net. Please see [14/14] for detailed change log. This series is applies cleanly against 3.17-rc2 and is tested working with lib/test_bpf on ARMv8 Foundation Model. Will had previously reported that v1 series works on Juno platform. Since v2 only involves straightforward renaming in [14/14], I don't anticipate any regressions. Thanks, z [1] https://lkml.org/lkml/2014/7/18/683 The following changes since commit 52addcf9d6669fa439387610bc65c92fa0980cef: Linux 3.17-rc2 (2014-08-25 15:36:20 -0700) are available in the git repository at: https://github.com/zlim/linux.git tags/arm64/bpf-v2 for you to fetch changes up to 2f4a4b8df4ba1cbd24957fb1a8371d30b1976174: arm64: eBPF JIT compiler (2014-08-26 19:04:43 -0700) Documentation/networking/filter.txt | 6 +- arch/arm64/Kconfig | 1 + arch/arm64/Makefile | 1 + arch/arm64/include/asm/insn.h | 249 + arch/arm64/kernel/insn.c| 646 +- arch/arm64/net/Makefile | 4 + arch/arm64/net/bpf_jit.h| 169 + arch/arm64/net/bpf_jit_comp.c | 677 8 files changed, 1743 insertions(+), 10 deletions(-) create mode 100644 arch/arm64/net/Makefile create mode 100644 arch/arm64/net/bpf_jit.h create mode 100644 arch/arm64/net/bpf_jit_comp.c Zi Shen Lim (14): arm64: introduce aarch64_insn_gen_comp_branch_imm() arm64: introduce aarch64_insn_gen_branch_reg() arm64: introduce aarch64_insn_gen_cond_branch_imm() arm64: introduce aarch64_insn_gen_load_store_reg() arm64: introduce aarch64_insn_gen_load_store_pair() arm64: introduce aarch64_insn_gen_add_sub_imm() arm64: introduce aarch64_insn_gen_bitfield() arm64: introduce aarch64_insn_gen_movewide() arm64: introduce aarch64_insn_gen_add_sub_shifted_reg() arm64: introduce aarch64_insn_gen_data1() arm64: introduce aarch64_insn_gen_data2() arm64: introduce aarch64_insn_gen_data3() arm64: introduce aarch64_insn_gen_logical_shifted_reg() arm64: eBPF JIT compiler Documentation/networking/filter.txt | 6 +- arch/arm64/Kconfig | 1 + arch/arm64/Makefile | 1 + arch/arm64/include/asm/insn.h | 249 + arch/arm64/kernel/insn.c| 646 +- arch/arm64/net/Makefile | 4 + arch/arm64/net/bpf_jit.h| 169 + arch/arm64/net/bpf_jit_comp.c | 677 8 files changed, 1743 insertions(+), 10 deletions(-) create mode 100644 arch/arm64/net/Makefile create mode 100644 arch/arm64/net/bpf_jit.h create mode 100644 arch/arm64/net/bpf_jit_comp.c -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 04/14] arm64: introduce aarch64_insn_gen_load_store_reg()
Introduce function to generate load/store (register offset) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 20 ++ arch/arm64/kernel/insn.c | 62 +++ 2 files changed, 82 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 86a8a9c..5bc1cc3 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -72,6 +72,7 @@ enum aarch64_insn_imm_type { enum aarch64_insn_register_type { AARCH64_INSN_REGTYPE_RT, AARCH64_INSN_REGTYPE_RN, + AARCH64_INSN_REGTYPE_RM, }; enum aarch64_insn_register { @@ -143,12 +144,26 @@ enum aarch64_insn_branch_type { AARCH64_INSN_BRANCH_COMP_NONZERO, }; +enum aarch64_insn_size_type { + AARCH64_INSN_SIZE_8, + AARCH64_INSN_SIZE_16, + AARCH64_INSN_SIZE_32, + AARCH64_INSN_SIZE_64, +}; + +enum aarch64_insn_ldst_type { + AARCH64_INSN_LDST_LOAD_REG_OFFSET, + AARCH64_INSN_LDST_STORE_REG_OFFSET, +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ { return (val); } +__AARCH64_INSN_FUNCS(str_reg, 0x3FE0EC00, 0x38206800) +__AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -184,6 +199,11 @@ u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op); u32 aarch64_insn_gen_nop(void); u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg, enum aarch64_insn_branch_type type); +u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg, + enum aarch64_insn_register base, + enum aarch64_insn_register offset, + enum aarch64_insn_size_type size, + enum aarch64_insn_ldst_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index b65edc0..b882c85 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -286,6 +286,9 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, case AARCH64_INSN_REGTYPE_RN: shift = 5; break; + case AARCH64_INSN_REGTYPE_RM: + shift = 16; + break; default: pr_err("%s: unknown register type encoding %d\n", __func__, type); @@ -298,6 +301,35 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, return insn; } +static u32 aarch64_insn_encode_ldst_size(enum aarch64_insn_size_type type, +u32 insn) +{ + u32 size; + + switch (type) { + case AARCH64_INSN_SIZE_8: + size = 0; + break; + case AARCH64_INSN_SIZE_16: + size = 1; + break; + case AARCH64_INSN_SIZE_32: + size = 2; + break; + case AARCH64_INSN_SIZE_64: + size = 3; + break; + default: + pr_err("%s: unknown size encoding %d\n", __func__, type); + return 0; + } + + insn &= ~GENMASK(31, 30); + insn |= size << 30; + + return insn; +} + static inline long branch_imm_common(unsigned long pc, unsigned long addr, long range) { @@ -428,3 +460,33 @@ u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg, return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, reg); } + +u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg, + enum aarch64_insn_register base, + enum aarch64_insn_register offset, + enum aarch64_insn_size_type size, + enum aarch64_insn_ldst_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_LDST_LOAD_REG_OFFSET: + insn = aarch64_insn_get_ldr_reg_value(); + break; + case AARCH64_INSN_LDST_STORE_REG_OFFSET: + insn = aarch64_insn_get_str_reg_value(); + break; + default: + BUG_ON(1); + } + + insn = aarch64_insn_encode_ldst_size(size, insn); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RT, insn, reg); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, + base); + + return
[PATCHv2 03/14] arm64: introduce aarch64_insn_gen_cond_branch_imm()
Introduce function to generate conditional branch (immediate) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 21 + arch/arm64/kernel/insn.c | 17 + 2 files changed, 38 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 5080962..86a8a9c 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -117,6 +117,24 @@ enum aarch64_insn_variant { AARCH64_INSN_VARIANT_64BIT }; +enum aarch64_insn_condition { + AARCH64_INSN_COND_EQ = 0x0, /* == */ + AARCH64_INSN_COND_NE = 0x1, /* != */ + AARCH64_INSN_COND_CS = 0x2, /* unsigned >= */ + AARCH64_INSN_COND_CC = 0x3, /* unsigned < */ + AARCH64_INSN_COND_MI = 0x4, /* < 0 */ + AARCH64_INSN_COND_PL = 0x5, /* >= 0 */ + AARCH64_INSN_COND_VS = 0x6, /* overflow */ + AARCH64_INSN_COND_VC = 0x7, /* no overflow */ + AARCH64_INSN_COND_HI = 0x8, /* unsigned > */ + AARCH64_INSN_COND_LS = 0x9, /* unsigned <= */ + AARCH64_INSN_COND_GE = 0xa, /* signed >= */ + AARCH64_INSN_COND_LT = 0xb, /* signed < */ + AARCH64_INSN_COND_GT = 0xc, /* signed > */ + AARCH64_INSN_COND_LE = 0xd, /* signed <= */ + AARCH64_INSN_COND_AL = 0xe, /* always */ +}; + enum aarch64_insn_branch_type { AARCH64_INSN_BRANCH_NOLINK, AARCH64_INSN_BRANCH_LINK, @@ -135,6 +153,7 @@ __AARCH64_INSN_FUNCS(b, 0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) __AARCH64_INSN_FUNCS(cbnz, 0xFE00, 0x3500) +__AARCH64_INSN_FUNCS(bcond,0xFF10, 0x5400) __AARCH64_INSN_FUNCS(svc, 0xFFE0001F, 0xD401) __AARCH64_INSN_FUNCS(hvc, 0xFFE0001F, 0xD402) __AARCH64_INSN_FUNCS(smc, 0xFFE0001F, 0xD403) @@ -159,6 +178,8 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, enum aarch64_insn_register reg, enum aarch64_insn_variant variant, enum aarch64_insn_branch_type type); +u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr, +enum aarch64_insn_condition cond); u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op); u32 aarch64_insn_gen_nop(void); u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg, diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 6797936..b65edc0 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -380,6 +380,23 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, offset >> 2); } +u32 aarch64_insn_gen_cond_branch_imm(unsigned long pc, unsigned long addr, +enum aarch64_insn_condition cond) +{ + u32 insn; + long offset; + + offset = branch_imm_common(pc, addr, SZ_1M); + + insn = aarch64_insn_get_bcond_value(); + + BUG_ON(cond < AARCH64_INSN_COND_EQ || cond > AARCH64_INSN_COND_AL); + insn |= cond; + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_19, insn, +offset >> 2); +} + u32 __kprobes aarch64_insn_gen_hint(enum aarch64_insn_hint_op op) { return aarch64_insn_get_hint_value() | op; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 01/14] arm64: introduce aarch64_insn_gen_comp_branch_imm()
Introduce function to generate compare & branch (immediate) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 57 arch/arm64/kernel/insn.c | 88 --- 2 files changed, 140 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index dc1f73b..a98c495 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -2,6 +2,8 @@ * Copyright (C) 2013 Huawei Ltd. * Author: Jiang Liu * + * Copyright (C) 2014 Zi Shen Lim + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. @@ -67,9 +69,58 @@ enum aarch64_insn_imm_type { AARCH64_INSN_IMM_MAX }; +enum aarch64_insn_register_type { + AARCH64_INSN_REGTYPE_RT, +}; + +enum aarch64_insn_register { + AARCH64_INSN_REG_0 = 0, + AARCH64_INSN_REG_1 = 1, + AARCH64_INSN_REG_2 = 2, + AARCH64_INSN_REG_3 = 3, + AARCH64_INSN_REG_4 = 4, + AARCH64_INSN_REG_5 = 5, + AARCH64_INSN_REG_6 = 6, + AARCH64_INSN_REG_7 = 7, + AARCH64_INSN_REG_8 = 8, + AARCH64_INSN_REG_9 = 9, + AARCH64_INSN_REG_10 = 10, + AARCH64_INSN_REG_11 = 11, + AARCH64_INSN_REG_12 = 12, + AARCH64_INSN_REG_13 = 13, + AARCH64_INSN_REG_14 = 14, + AARCH64_INSN_REG_15 = 15, + AARCH64_INSN_REG_16 = 16, + AARCH64_INSN_REG_17 = 17, + AARCH64_INSN_REG_18 = 18, + AARCH64_INSN_REG_19 = 19, + AARCH64_INSN_REG_20 = 20, + AARCH64_INSN_REG_21 = 21, + AARCH64_INSN_REG_22 = 22, + AARCH64_INSN_REG_23 = 23, + AARCH64_INSN_REG_24 = 24, + AARCH64_INSN_REG_25 = 25, + AARCH64_INSN_REG_26 = 26, + AARCH64_INSN_REG_27 = 27, + AARCH64_INSN_REG_28 = 28, + AARCH64_INSN_REG_29 = 29, + AARCH64_INSN_REG_FP = 29, /* Frame pointer */ + AARCH64_INSN_REG_30 = 30, + AARCH64_INSN_REG_LR = 30, /* Link register */ + AARCH64_INSN_REG_ZR = 31, /* Zero: as source register */ + AARCH64_INSN_REG_SP = 31 /* Stack pointer: as load/store base reg */ +}; + +enum aarch64_insn_variant { + AARCH64_INSN_VARIANT_32BIT, + AARCH64_INSN_VARIANT_64BIT +}; + enum aarch64_insn_branch_type { AARCH64_INSN_BRANCH_NOLINK, AARCH64_INSN_BRANCH_LINK, + AARCH64_INSN_BRANCH_COMP_ZERO, + AARCH64_INSN_BRANCH_COMP_NONZERO, }; #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ @@ -80,6 +131,8 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) +__AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) +__AARCH64_INSN_FUNCS(cbnz, 0xFE00, 0x3500) __AARCH64_INSN_FUNCS(svc, 0xFFE0001F, 0xD401) __AARCH64_INSN_FUNCS(hvc, 0xFFE0001F, 0xD402) __AARCH64_INSN_FUNCS(smc, 0xFFE0001F, 0xD403) @@ -97,6 +150,10 @@ u32 aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, u32 insn, u64 imm); u32 aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, enum aarch64_insn_branch_type type); +u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, +enum aarch64_insn_register reg, +enum aarch64_insn_variant variant, +enum aarch64_insn_branch_type type); u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op); u32 aarch64_insn_gen_nop(void); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 92f3683..d9f7827 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -2,6 +2,8 @@ * Copyright (C) 2013 Huawei Ltd. * Author: Jiang Liu * + * Copyright (C) 2014 Zi Shen Lim + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License version 2 as * published by the Free Software Foundation. @@ -23,6 +25,8 @@ #include #include +#define AARCH64_INSN_SF_BITBIT(31) + static int aarch64_insn_encoding_class[] = { AARCH64_INSN_CLS_UNKNOWN, AARCH64_INSN_CLS_UNKNOWN, @@ -264,10 +268,36 @@ u32 __kprobes aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, return insn; } -u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, - enum aarch64_insn_branch_type type) +static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, + u32 insn, + enum aarch64_insn_register reg) +{ + int shift; + + if (reg
[PATCHv2 10/14] arm64: introduce aarch64_insn_gen_data1()
Introduce function to generate data-processing (1 source) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 13 + arch/arm64/kernel/insn.c | 37 + 2 files changed, 50 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index c0a765d..246d214 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -185,6 +185,12 @@ enum aarch64_insn_bitfield_type { AARCH64_INSN_BITFIELD_MOVE_SIGNED }; +enum aarch64_insn_data1_type { + AARCH64_INSN_DATA1_REVERSE_16, + AARCH64_INSN_DATA1_REVERSE_32, + AARCH64_INSN_DATA1_REVERSE_64, +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -211,6 +217,9 @@ __AARCH64_INSN_FUNCS(add, 0x7F20, 0x0B00) __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00) __AARCH64_INSN_FUNCS(sub, 0x7F20, 0x4B00) __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00) +__AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400) +__AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800) +__AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -276,6 +285,10 @@ u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst, int shift, enum aarch64_insn_variant variant, enum aarch64_insn_adsb_type type); +u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_variant variant, + enum aarch64_insn_data1_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index d7a4dd4..81ef3b5 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -747,3 +747,40 @@ u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_6, insn, shift); } + +u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_variant variant, + enum aarch64_insn_data1_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_DATA1_REVERSE_16: + insn = aarch64_insn_get_rev16_value(); + break; + case AARCH64_INSN_DATA1_REVERSE_32: + insn = aarch64_insn_get_rev32_value(); + break; + case AARCH64_INSN_DATA1_REVERSE_64: + BUG_ON(variant != AARCH64_INSN_VARIANT_64BIT); + insn = aarch64_insn_get_rev64_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + break; + default: + BUG_ON(1); + } + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 06/14] arm64: introduce aarch64_insn_gen_add_sub_imm()
Introduce function to generate add/subtract (immediate) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 16 arch/arm64/kernel/insn.c | 44 +++ 2 files changed, 60 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index eef8f1e..29386aa 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -75,6 +75,7 @@ enum aarch64_insn_register_type { AARCH64_INSN_REGTYPE_RN, AARCH64_INSN_REGTYPE_RT2, AARCH64_INSN_REGTYPE_RM, + AARCH64_INSN_REGTYPE_RD, }; enum aarch64_insn_register { @@ -162,6 +163,13 @@ enum aarch64_insn_ldst_type { AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX, }; +enum aarch64_insn_adsb_type { + AARCH64_INSN_ADSB_ADD, + AARCH64_INSN_ADSB_SUB, + AARCH64_INSN_ADSB_ADD_SETFLAGS, + AARCH64_INSN_ADSB_SUB_SETFLAGS +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -174,6 +182,10 @@ __AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880) __AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0) __AARCH64_INSN_FUNCS(stp_pre, 0x7FC0, 0x2980) __AARCH64_INSN_FUNCS(ldp_pre, 0x7FC0, 0x29C0) +__AARCH64_INSN_FUNCS(add_imm, 0x7F00, 0x1100) +__AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100) +__AARCH64_INSN_FUNCS(sub_imm, 0x7F00, 0x5100) +__AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -220,6 +232,10 @@ u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, int offset, enum aarch64_insn_variant variant, enum aarch64_insn_ldst_type type); +u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +int imm, enum aarch64_insn_variant variant, +enum aarch64_insn_adsb_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 7880c06..ec3a902 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -285,6 +285,7 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, switch (type) { case AARCH64_INSN_REGTYPE_RT: + case AARCH64_INSN_REGTYPE_RD: shift = 0; break; case AARCH64_INSN_REGTYPE_RN: @@ -555,3 +556,46 @@ u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_7, insn, offset >> shift); } + +u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +int imm, enum aarch64_insn_variant variant, +enum aarch64_insn_adsb_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_ADSB_ADD: + insn = aarch64_insn_get_add_imm_value(); + break; + case AARCH64_INSN_ADSB_SUB: + insn = aarch64_insn_get_sub_imm_value(); + break; + case AARCH64_INSN_ADSB_ADD_SETFLAGS: + insn = aarch64_insn_get_adds_imm_value(); + break; + case AARCH64_INSN_ADSB_SUB_SETFLAGS: + insn = aarch64_insn_get_subs_imm_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + break; + default: + BUG_ON(1); + } + + BUG_ON(imm & ~(SZ_4K - 1)); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 08/14] arm64: introduce aarch64_insn_gen_movewide()
Introduce function to generate move wide (immediate) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 13 + arch/arm64/kernel/insn.c | 43 +++ 2 files changed, 56 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 8fd31fc..49dec28 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -172,6 +172,12 @@ enum aarch64_insn_adsb_type { AARCH64_INSN_ADSB_SUB_SETFLAGS }; +enum aarch64_insn_movewide_type { + AARCH64_INSN_MOVEWIDE_ZERO, + AARCH64_INSN_MOVEWIDE_KEEP, + AARCH64_INSN_MOVEWIDE_INVERSE +}; + enum aarch64_insn_bitfield_type { AARCH64_INSN_BITFIELD_MOVE, AARCH64_INSN_BITFIELD_MOVE_UNSIGNED, @@ -194,9 +200,12 @@ __AARCH64_INSN_FUNCS(add_imm, 0x7F00, 0x1100) __AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100) __AARCH64_INSN_FUNCS(sub_imm, 0x7F00, 0x5100) __AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100) +__AARCH64_INSN_FUNCS(movn, 0x7F80, 0x1280) __AARCH64_INSN_FUNCS(sbfm, 0x7F80, 0x1300) __AARCH64_INSN_FUNCS(bfm, 0x7F80, 0x3300) +__AARCH64_INSN_FUNCS(movz, 0x7F80, 0x5280) __AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300) +__AARCH64_INSN_FUNCS(movk, 0x7F80, 0x7280) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -252,6 +261,10 @@ u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst, int immr, int imms, enum aarch64_insn_variant variant, enum aarch64_insn_bitfield_type type); +u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst, + int imm, int shift, + enum aarch64_insn_variant variant, + enum aarch64_insn_movewide_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index e07d026..7aa2784 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -655,3 +655,46 @@ u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_S, insn, imms); } + +u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst, + int imm, int shift, + enum aarch64_insn_variant variant, + enum aarch64_insn_movewide_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_MOVEWIDE_ZERO: + insn = aarch64_insn_get_movz_value(); + break; + case AARCH64_INSN_MOVEWIDE_KEEP: + insn = aarch64_insn_get_movk_value(); + break; + case AARCH64_INSN_MOVEWIDE_INVERSE: + insn = aarch64_insn_get_movn_value(); + break; + default: + BUG_ON(1); + } + + BUG_ON(imm & ~(SZ_64K - 1)); + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + BUG_ON(shift != 0 && shift != 16); + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + BUG_ON(shift != 0 && shift != 16 && shift != 32 && + shift != 48); + break; + default: + BUG_ON(1); + } + + insn |= (shift >> 4) << 21; + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, imm); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 05/14] arm64: introduce aarch64_insn_gen_load_store_pair()
Introduce function to generate load/store pair instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 16 +++ arch/arm64/kernel/insn.c | 65 +++ 2 files changed, 81 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 5bc1cc3..eef8f1e 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -66,12 +66,14 @@ enum aarch64_insn_imm_type { AARCH64_INSN_IMM_14, AARCH64_INSN_IMM_12, AARCH64_INSN_IMM_9, + AARCH64_INSN_IMM_7, AARCH64_INSN_IMM_MAX }; enum aarch64_insn_register_type { AARCH64_INSN_REGTYPE_RT, AARCH64_INSN_REGTYPE_RN, + AARCH64_INSN_REGTYPE_RT2, AARCH64_INSN_REGTYPE_RM, }; @@ -154,6 +156,10 @@ enum aarch64_insn_size_type { enum aarch64_insn_ldst_type { AARCH64_INSN_LDST_LOAD_REG_OFFSET, AARCH64_INSN_LDST_STORE_REG_OFFSET, + AARCH64_INSN_LDST_LOAD_PAIR_PRE_INDEX, + AARCH64_INSN_LDST_STORE_PAIR_PRE_INDEX, + AARCH64_INSN_LDST_LOAD_PAIR_POST_INDEX, + AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX, }; #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ @@ -164,6 +170,10 @@ static __always_inline u32 aarch64_insn_get_##abbr##_value(void) \ __AARCH64_INSN_FUNCS(str_reg, 0x3FE0EC00, 0x38206800) __AARCH64_INSN_FUNCS(ldr_reg, 0x3FE0EC00, 0x38606800) +__AARCH64_INSN_FUNCS(stp_post, 0x7FC0, 0x2880) +__AARCH64_INSN_FUNCS(ldp_post, 0x7FC0, 0x28C0) +__AARCH64_INSN_FUNCS(stp_pre, 0x7FC0, 0x2980) +__AARCH64_INSN_FUNCS(ldp_pre, 0x7FC0, 0x29C0) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -204,6 +214,12 @@ u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg, enum aarch64_insn_register offset, enum aarch64_insn_size_type size, enum aarch64_insn_ldst_type type); +u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, +enum aarch64_insn_register reg2, +enum aarch64_insn_register base, +int offset, +enum aarch64_insn_variant variant, +enum aarch64_insn_ldst_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index b882c85..7880c06 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -255,6 +255,10 @@ u32 __kprobes aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, mask = BIT(9) - 1; shift = 12; break; + case AARCH64_INSN_IMM_7: + mask = BIT(7) - 1; + shift = 15; + break; default: pr_err("aarch64_insn_encode_immediate: unknown immediate encoding %d\n", type); @@ -286,6 +290,9 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, case AARCH64_INSN_REGTYPE_RN: shift = 5; break; + case AARCH64_INSN_REGTYPE_RT2: + shift = 10; + break; case AARCH64_INSN_REGTYPE_RM: shift = 16; break; @@ -490,3 +497,61 @@ u32 aarch64_insn_gen_load_store_reg(enum aarch64_insn_register reg, return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, offset); } + +u32 aarch64_insn_gen_load_store_pair(enum aarch64_insn_register reg1, +enum aarch64_insn_register reg2, +enum aarch64_insn_register base, +int offset, +enum aarch64_insn_variant variant, +enum aarch64_insn_ldst_type type) +{ + u32 insn; + int shift; + + switch (type) { + case AARCH64_INSN_LDST_LOAD_PAIR_PRE_INDEX: + insn = aarch64_insn_get_ldp_pre_value(); + break; + case AARCH64_INSN_LDST_STORE_PAIR_PRE_INDEX: + insn = aarch64_insn_get_stp_pre_value(); + break; + case AARCH64_INSN_LDST_LOAD_PAIR_POST_INDEX: + insn = aarch64_insn_get_ldp_post_value(); + break; + case AARCH64_INSN_LDST_STORE_PAIR_POST_INDEX: + insn = aarch64_insn_get_stp_post_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + /* offset must be multiples of 4 in
[PATCHv2 11/14] arm64: introduce aarch64_insn_gen_data2()
Introduce function to generate data-processing (2 source) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 20 ++ arch/arm64/kernel/insn.c | 48 +++ 2 files changed, 68 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 246d214..367245f 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -191,6 +191,15 @@ enum aarch64_insn_data1_type { AARCH64_INSN_DATA1_REVERSE_64, }; +enum aarch64_insn_data2_type { + AARCH64_INSN_DATA2_UDIV, + AARCH64_INSN_DATA2_SDIV, + AARCH64_INSN_DATA2_LSLV, + AARCH64_INSN_DATA2_LSRV, + AARCH64_INSN_DATA2_ASRV, + AARCH64_INSN_DATA2_RORV, +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -217,6 +226,12 @@ __AARCH64_INSN_FUNCS(add, 0x7F20, 0x0B00) __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00) __AARCH64_INSN_FUNCS(sub, 0x7F20, 0x4B00) __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00) +__AARCH64_INSN_FUNCS(udiv, 0x7FE0FC00, 0x1AC00800) +__AARCH64_INSN_FUNCS(sdiv, 0x7FE0FC00, 0x1AC00C00) +__AARCH64_INSN_FUNCS(lslv, 0x7FE0FC00, 0x1AC02000) +__AARCH64_INSN_FUNCS(lsrv, 0x7FE0FC00, 0x1AC02400) +__AARCH64_INSN_FUNCS(asrv, 0x7FE0FC00, 0x1AC02800) +__AARCH64_INSN_FUNCS(rorv, 0x7FE0FC00, 0x1AC02C00) __AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400) __AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800) __AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00) @@ -289,6 +304,11 @@ u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst, enum aarch64_insn_register src, enum aarch64_insn_variant variant, enum aarch64_insn_data1_type type); +u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_register reg, + enum aarch64_insn_variant variant, + enum aarch64_insn_data2_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 81ef3b5..c054164 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -784,3 +784,51 @@ u32 aarch64_insn_gen_data1(enum aarch64_insn_register dst, return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); } + +u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_register reg, + enum aarch64_insn_variant variant, + enum aarch64_insn_data2_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_DATA2_UDIV: + insn = aarch64_insn_get_udiv_value(); + break; + case AARCH64_INSN_DATA2_SDIV: + insn = aarch64_insn_get_sdiv_value(); + break; + case AARCH64_INSN_DATA2_LSLV: + insn = aarch64_insn_get_lslv_value(); + break; + case AARCH64_INSN_DATA2_LSRV: + insn = aarch64_insn_get_lsrv_value(); + break; + case AARCH64_INSN_DATA2_ASRV: + insn = aarch64_insn_get_asrv_value(); + break; + case AARCH64_INSN_DATA2_RORV: + insn = aarch64_insn_get_rorv_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + break; + default: + BUG_ON(1); + } + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); + + return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 13/14] arm64: introduce aarch64_insn_gen_logical_shifted_reg()
Introduce function to generate logical (shifted register) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 25 ++ arch/arm64/kernel/insn.c | 60 +++ 2 files changed, 85 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 36e8465..56a9e63 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -206,6 +206,17 @@ enum aarch64_insn_data3_type { AARCH64_INSN_DATA3_MSUB, }; +enum aarch64_insn_logic_type { + AARCH64_INSN_LOGIC_AND, + AARCH64_INSN_LOGIC_BIC, + AARCH64_INSN_LOGIC_ORR, + AARCH64_INSN_LOGIC_ORN, + AARCH64_INSN_LOGIC_EOR, + AARCH64_INSN_LOGIC_EON, + AARCH64_INSN_LOGIC_AND_SETFLAGS, + AARCH64_INSN_LOGIC_BIC_SETFLAGS +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -243,6 +254,14 @@ __AARCH64_INSN_FUNCS(rorv, 0x7FE0FC00, 0x1AC02C00) __AARCH64_INSN_FUNCS(rev16,0x7C00, 0x5AC00400) __AARCH64_INSN_FUNCS(rev32,0x7C00, 0x5AC00800) __AARCH64_INSN_FUNCS(rev64,0x7C00, 0x5AC00C00) +__AARCH64_INSN_FUNCS(and, 0x7F20, 0x0A00) +__AARCH64_INSN_FUNCS(bic, 0x7F20, 0x0A20) +__AARCH64_INSN_FUNCS(orr, 0x7F20, 0x2A00) +__AARCH64_INSN_FUNCS(orn, 0x7F20, 0x2A20) +__AARCH64_INSN_FUNCS(eor, 0x7F20, 0x4A00) +__AARCH64_INSN_FUNCS(eon, 0x7F20, 0x4A20) +__AARCH64_INSN_FUNCS(ands, 0x7F20, 0x6A00) +__AARCH64_INSN_FUNCS(bics, 0x7F20, 0x6A20) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -323,6 +342,12 @@ u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst, enum aarch64_insn_register reg2, enum aarch64_insn_variant variant, enum aarch64_insn_data3_type type); +u32 aarch64_insn_gen_logical_shifted_reg(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +enum aarch64_insn_register reg, +int shift, +enum aarch64_insn_variant variant, +enum aarch64_insn_logic_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index f73a4bf..0668ee5 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -874,3 +874,63 @@ u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst, return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg2); } + +u32 aarch64_insn_gen_logical_shifted_reg(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +enum aarch64_insn_register reg, +int shift, +enum aarch64_insn_variant variant, +enum aarch64_insn_logic_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_LOGIC_AND: + insn = aarch64_insn_get_and_value(); + break; + case AARCH64_INSN_LOGIC_BIC: + insn = aarch64_insn_get_bic_value(); + break; + case AARCH64_INSN_LOGIC_ORR: + insn = aarch64_insn_get_orr_value(); + break; + case AARCH64_INSN_LOGIC_ORN: + insn = aarch64_insn_get_orn_value(); + break; + case AARCH64_INSN_LOGIC_EOR: + insn = aarch64_insn_get_eor_value(); + break; + case AARCH64_INSN_LOGIC_EON: + insn = aarch64_insn_get_eon_value(); + break; + case AARCH64_INSN_LOGIC_AND_SETFLAGS: + insn = aarch64_insn_get_ands_value(); + break; + case AARCH64_INSN_LOGIC_BIC_SETFLAGS: + insn = aarch64_insn_get_bics_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + BUG_ON(shift & ~(SZ_32 - 1)); + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + BUG_ON(shift & ~(SZ_64 - 1)); + break; + default: + BUG_ON(1); + } + + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); + +
[PATCHv2 12/14] arm64: introduce aarch64_insn_gen_data3()
Introduce function to generate data-processing (3 source) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 14 ++ arch/arm64/kernel/insn.c | 42 ++ 2 files changed, 56 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 367245f..36e8465 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -79,6 +79,7 @@ enum aarch64_insn_register_type { AARCH64_INSN_REGTYPE_RT2, AARCH64_INSN_REGTYPE_RM, AARCH64_INSN_REGTYPE_RD, + AARCH64_INSN_REGTYPE_RA, }; enum aarch64_insn_register { @@ -200,6 +201,11 @@ enum aarch64_insn_data2_type { AARCH64_INSN_DATA2_RORV, }; +enum aarch64_insn_data3_type { + AARCH64_INSN_DATA3_MADD, + AARCH64_INSN_DATA3_MSUB, +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -226,6 +232,8 @@ __AARCH64_INSN_FUNCS(add, 0x7F20, 0x0B00) __AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00) __AARCH64_INSN_FUNCS(sub, 0x7F20, 0x4B00) __AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00) +__AARCH64_INSN_FUNCS(madd, 0x7FE08000, 0x1B00) +__AARCH64_INSN_FUNCS(msub, 0x7FE08000, 0x1B008000) __AARCH64_INSN_FUNCS(udiv, 0x7FE0FC00, 0x1AC00800) __AARCH64_INSN_FUNCS(sdiv, 0x7FE0FC00, 0x1AC00C00) __AARCH64_INSN_FUNCS(lslv, 0x7FE0FC00, 0x1AC02000) @@ -309,6 +317,12 @@ u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst, enum aarch64_insn_register reg, enum aarch64_insn_variant variant, enum aarch64_insn_data2_type type); +u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_register reg1, + enum aarch64_insn_register reg2, + enum aarch64_insn_variant variant, + enum aarch64_insn_data3_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index c054164..f73a4bf 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -302,6 +302,7 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, shift = 5; break; case AARCH64_INSN_REGTYPE_RT2: + case AARCH64_INSN_REGTYPE_RA: shift = 10; break; case AARCH64_INSN_REGTYPE_RM: @@ -832,3 +833,44 @@ u32 aarch64_insn_gen_data2(enum aarch64_insn_register dst, return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg); } + +u32 aarch64_insn_gen_data3(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + enum aarch64_insn_register reg1, + enum aarch64_insn_register reg2, + enum aarch64_insn_variant variant, + enum aarch64_insn_data3_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_DATA3_MADD: + insn = aarch64_insn_get_madd_value(); + break; + case AARCH64_INSN_DATA3_MSUB: + insn = aarch64_insn_get_msub_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + break; + default: + BUG_ON(1); + } + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RA, insn, src); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, + reg1); + + return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, + reg2); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 09/14] arm64: introduce aarch64_insn_gen_add_sub_shifted_reg()
Introduce function to generate add/subtract (shifted register) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 11 ++ arch/arm64/kernel/insn.c | 49 +++ 2 files changed, 60 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 49dec28..c0a765d 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -67,6 +67,7 @@ enum aarch64_insn_imm_type { AARCH64_INSN_IMM_12, AARCH64_INSN_IMM_9, AARCH64_INSN_IMM_7, + AARCH64_INSN_IMM_6, AARCH64_INSN_IMM_S, AARCH64_INSN_IMM_R, AARCH64_INSN_IMM_MAX @@ -206,6 +207,10 @@ __AARCH64_INSN_FUNCS(bfm, 0x7F80, 0x3300) __AARCH64_INSN_FUNCS(movz, 0x7F80, 0x5280) __AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300) __AARCH64_INSN_FUNCS(movk, 0x7F80, 0x7280) +__AARCH64_INSN_FUNCS(add, 0x7F20, 0x0B00) +__AARCH64_INSN_FUNCS(adds, 0x7F20, 0x2B00) +__AARCH64_INSN_FUNCS(sub, 0x7F20, 0x4B00) +__AARCH64_INSN_FUNCS(subs, 0x7F20, 0x6B00) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -265,6 +270,12 @@ u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst, int imm, int shift, enum aarch64_insn_variant variant, enum aarch64_insn_movewide_type type); +u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +enum aarch64_insn_register reg, +int shift, +enum aarch64_insn_variant variant, +enum aarch64_insn_adsb_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index 7aa2784..d7a4dd4 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -260,6 +260,7 @@ u32 __kprobes aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, mask = BIT(7) - 1; shift = 15; break; + case AARCH64_INSN_IMM_6: case AARCH64_INSN_IMM_S: mask = BIT(6) - 1; shift = 10; @@ -698,3 +699,51 @@ u32 aarch64_insn_gen_movewide(enum aarch64_insn_register dst, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_16, insn, imm); } + +u32 aarch64_insn_gen_add_sub_shifted_reg(enum aarch64_insn_register dst, +enum aarch64_insn_register src, +enum aarch64_insn_register reg, +int shift, +enum aarch64_insn_variant variant, +enum aarch64_insn_adsb_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_ADSB_ADD: + insn = aarch64_insn_get_add_value(); + break; + case AARCH64_INSN_ADSB_SUB: + insn = aarch64_insn_get_sub_value(); + break; + case AARCH64_INSN_ADSB_ADD_SETFLAGS: + insn = aarch64_insn_get_adds_value(); + break; + case AARCH64_INSN_ADSB_SUB_SETFLAGS: + insn = aarch64_insn_get_subs_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + BUG_ON(shift & ~(SZ_32 - 1)); + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT; + BUG_ON(shift & ~(SZ_64 - 1)); + break; + default: + BUG_ON(1); + } + + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RM, insn, reg); + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_6, insn, shift); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCHv2 14/14] arm64: eBPF JIT compiler
The JIT compiler emits A64 instructions. It supports eBPF only. Legacy BPF is supported thanks to conversion by BPF core. JIT is enabled in the same way as for other architectures: echo 1 > /proc/sys/net/core/bpf_jit_enable Or for additional compiler output: echo 2 > /proc/sys/net/core/bpf_jit_enable See Documentation/networking/filter.txt for more information. The implementation passes all 57 tests in lib/test_bpf.c on ARMv8 Foundation Model :) Also tested by Will on Juno platform. Signed-off-by: Zi Shen Lim Acked-by: Alexei Starovoitov Acked-by: Will Deacon --- v1->v2: Rebased onto 3.17-rc2, and fixed up changes related to: - sock_filter_int -> bpf_insn: 2695fb552cbe (net: filter: rename 'struct sock_filter_int' into 'struct bpf_insn') - sk_filter -> bpf_prog: 7ae457c1e5b4 (net: filter: split 'struct sk_filter' into socket and bpf parts) RFCv3->v1: Addressed review comments from Will wrt codegen bits: - define and use {SF,N}_BIT - use masks for limit checks Also: - rebase onto net-next RFCv2->RFCv3: - clarify 16B stack alignment requirement - I missed one reference - fixed a couple checks for immediate bits - make bpf_jit.h checkpatch clean - remove stale DW case in LD_IND and LD_ABS (good catch by Alexei) - add Alexei's Acked-by - rebase onto net-next Also, per discussion with Will, consolidated bpf_jit.h into arch/arm64/.../insn.{c,h}: - instruction encoding stuff moved into arch/arm64/kernel/insn.c - bpf_jit.h uses arch/arm64/include/asm/insn.h RFCv1->RFCv2: Addressed review comments from Alexei: - use core-$(CONFIG_NET) - use GENMASK - lower-case function names in header file - drop LD_ABS+DW and LD_IND+DW, which do not exist in eBPF yet - use pr_xxx_once() to prevent spamming logs - clarify 16B stack alignment requirement - drop usage of EMIT macro which was saving just one argument, turns out having additional argument wasn't too much of an eyesore Also, per discussion with Alexei, and additional suggestion from Daniel: - moved load_pointer() from net/core/filter.c into filter.h as bpf_load_pointer() which is done as a separate preparatory patch. [1] [1] http://patchwork.ozlabs.org/patch/366906/ NOTES: * The preparatory patch [1] has been merged into net-next 9f12fbe603f7 ("net: filter: move load_pointer() into filter.h"). * bpf_jit_comp.c and bpf_jit.h is checkpatch clean. * The following sparse warning is not applicable: warning: symbol 'bpf_jit_enable' was not declared. Should it be static? FUTURE WORK: 1. Implement remaining classes of eBPF instructions: ST|MEM, STX|XADD which currently do not have corresponding test cases in test_bpf. 2. Further compiler optimization, such as optimization for small immediates. Documentation/networking/filter.txt | 6 +- arch/arm64/Kconfig | 1 + arch/arm64/Makefile | 1 + arch/arm64/net/Makefile | 4 + arch/arm64/net/bpf_jit.h| 169 + arch/arm64/net/bpf_jit_comp.c | 677 6 files changed, 855 insertions(+), 3 deletions(-) create mode 100644 arch/arm64/net/Makefile create mode 100644 arch/arm64/net/bpf_jit.h create mode 100644 arch/arm64/net/bpf_jit_comp.c diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index c48a970..1842d4f 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -462,9 +462,9 @@ JIT compiler The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, -ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is -transparently invoked for each attached filter from user space or for internal -kernel users if it has been previously enabled by root: +ARM, ARM64 and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler +is transparently invoked for each attached filter from user space or for +internal kernel users if it has been previously enabled by root: echo 1 > /proc/sys/net/core/bpf_jit_enable diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fd4e81a..cfea623 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -35,6 +35,7 @@ config ARM64 select HAVE_ARCH_JUMP_LABEL select HAVE_ARCH_KGDB select HAVE_ARCH_TRACEHOOK + select HAVE_BPF_JIT select HAVE_C_RECORDMCOUNT select HAVE_CC_STACKPROTECTOR select HAVE_DEBUG_BUGVERBOSE diff --git a/arch/arm64/Makefile b/arch/arm64/Makefile index 2df5e5d..59c86b6 100644 --- a/arch/arm64/Makefile +++ b/arch/arm64/Makefile @@ -47,6 +47,7 @@ endif export TEXT_OFFSET GZFLAGS core-y += arch/arm64/kernel/ arch/arm64/mm/ +core-$(CONFIG_NET) += arch/arm64/net/ core-$(CONFIG_KVM) += arch/arm64/kvm/ core-$(CONFIG_XEN) += arch/arm64/xen/ core-$(CONFIG_CRYPTO) += arch/arm64/crypto/ diff --git a/arch/arm64/net/Makefile b/arch/arm64/net/Makefile new
[PATCHv2 07/14] arm64: introduce aarch64_insn_gen_bitfield()
Introduce function to generate bitfield instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 16 + arch/arm64/kernel/insn.c | 56 +++ 2 files changed, 72 insertions(+) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index 29386aa..8fd31fc 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -67,6 +67,8 @@ enum aarch64_insn_imm_type { AARCH64_INSN_IMM_12, AARCH64_INSN_IMM_9, AARCH64_INSN_IMM_7, + AARCH64_INSN_IMM_S, + AARCH64_INSN_IMM_R, AARCH64_INSN_IMM_MAX }; @@ -170,6 +172,12 @@ enum aarch64_insn_adsb_type { AARCH64_INSN_ADSB_SUB_SETFLAGS }; +enum aarch64_insn_bitfield_type { + AARCH64_INSN_BITFIELD_MOVE, + AARCH64_INSN_BITFIELD_MOVE_UNSIGNED, + AARCH64_INSN_BITFIELD_MOVE_SIGNED +}; + #define__AARCH64_INSN_FUNCS(abbr, mask, val) \ static __always_inline bool aarch64_insn_is_##abbr(u32 code) \ { return (code & (mask)) == (val); } \ @@ -186,6 +194,9 @@ __AARCH64_INSN_FUNCS(add_imm, 0x7F00, 0x1100) __AARCH64_INSN_FUNCS(adds_imm, 0x7F00, 0x3100) __AARCH64_INSN_FUNCS(sub_imm, 0x7F00, 0x5100) __AARCH64_INSN_FUNCS(subs_imm, 0x7F00, 0x7100) +__AARCH64_INSN_FUNCS(sbfm, 0x7F80, 0x1300) +__AARCH64_INSN_FUNCS(bfm, 0x7F80, 0x3300) +__AARCH64_INSN_FUNCS(ubfm, 0x7F80, 0x5300) __AARCH64_INSN_FUNCS(b,0xFC00, 0x1400) __AARCH64_INSN_FUNCS(bl, 0xFC00, 0x9400) __AARCH64_INSN_FUNCS(cbz, 0xFE00, 0x3400) @@ -236,6 +247,11 @@ u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst, enum aarch64_insn_register src, int imm, enum aarch64_insn_variant variant, enum aarch64_insn_adsb_type type); +u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + int immr, int imms, + enum aarch64_insn_variant variant, + enum aarch64_insn_bitfield_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index ec3a902..e07d026 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -26,6 +26,7 @@ #include #define AARCH64_INSN_SF_BITBIT(31) +#define AARCH64_INSN_N_BIT BIT(22) static int aarch64_insn_encoding_class[] = { AARCH64_INSN_CLS_UNKNOWN, @@ -259,6 +260,14 @@ u32 __kprobes aarch64_insn_encode_immediate(enum aarch64_insn_imm_type type, mask = BIT(7) - 1; shift = 15; break; + case AARCH64_INSN_IMM_S: + mask = BIT(6) - 1; + shift = 10; + break; + case AARCH64_INSN_IMM_R: + mask = BIT(6) - 1; + shift = 16; + break; default: pr_err("aarch64_insn_encode_immediate: unknown immediate encoding %d\n", type); @@ -599,3 +608,50 @@ u32 aarch64_insn_gen_add_sub_imm(enum aarch64_insn_register dst, return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_12, insn, imm); } + +u32 aarch64_insn_gen_bitfield(enum aarch64_insn_register dst, + enum aarch64_insn_register src, + int immr, int imms, + enum aarch64_insn_variant variant, + enum aarch64_insn_bitfield_type type) +{ + u32 insn; + u32 mask; + + switch (type) { + case AARCH64_INSN_BITFIELD_MOVE: + insn = aarch64_insn_get_bfm_value(); + break; + case AARCH64_INSN_BITFIELD_MOVE_UNSIGNED: + insn = aarch64_insn_get_ubfm_value(); + break; + case AARCH64_INSN_BITFIELD_MOVE_SIGNED: + insn = aarch64_insn_get_sbfm_value(); + break; + default: + BUG_ON(1); + } + + switch (variant) { + case AARCH64_INSN_VARIANT_32BIT: + mask = GENMASK(4, 0); + break; + case AARCH64_INSN_VARIANT_64BIT: + insn |= AARCH64_INSN_SF_BIT | AARCH64_INSN_N_BIT; + mask = GENMASK(5, 0); + break; + default: + BUG_ON(1); + } + + BUG_ON(immr & ~mask); + BUG_ON(imms & ~mask); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RD, insn, dst); + + insn = aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, src); + + insn = aarch64_insn_encode_immediate(AARCH64_INSN_IMM_R, insn, immr); + + return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_S, insn, imms); +} -- 1.9.1 --
[PATCHv2 02/14] arm64: introduce aarch64_insn_gen_branch_reg()
Introduce function to generate unconditional branch (register) instructions. Signed-off-by: Zi Shen Lim Acked-by: Will Deacon --- arch/arm64/include/asm/insn.h | 7 +++ arch/arm64/kernel/insn.c | 35 +-- 2 files changed, 40 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/insn.h b/arch/arm64/include/asm/insn.h index a98c495..5080962 100644 --- a/arch/arm64/include/asm/insn.h +++ b/arch/arm64/include/asm/insn.h @@ -71,6 +71,7 @@ enum aarch64_insn_imm_type { enum aarch64_insn_register_type { AARCH64_INSN_REGTYPE_RT, + AARCH64_INSN_REGTYPE_RN, }; enum aarch64_insn_register { @@ -119,6 +120,7 @@ enum aarch64_insn_variant { enum aarch64_insn_branch_type { AARCH64_INSN_BRANCH_NOLINK, AARCH64_INSN_BRANCH_LINK, + AARCH64_INSN_BRANCH_RETURN, AARCH64_INSN_BRANCH_COMP_ZERO, AARCH64_INSN_BRANCH_COMP_NONZERO, }; @@ -138,6 +140,9 @@ __AARCH64_INSN_FUNCS(hvc, 0xFFE0001F, 0xD402) __AARCH64_INSN_FUNCS(smc, 0xFFE0001F, 0xD403) __AARCH64_INSN_FUNCS(brk, 0xFFE0001F, 0xD420) __AARCH64_INSN_FUNCS(hint, 0xF01F, 0xD503201F) +__AARCH64_INSN_FUNCS(br, 0xFC1F, 0xD61F) +__AARCH64_INSN_FUNCS(blr, 0xFC1F, 0xD63F) +__AARCH64_INSN_FUNCS(ret, 0xFC1F, 0xD65F) #undef __AARCH64_INSN_FUNCS @@ -156,6 +161,8 @@ u32 aarch64_insn_gen_comp_branch_imm(unsigned long pc, unsigned long addr, enum aarch64_insn_branch_type type); u32 aarch64_insn_gen_hint(enum aarch64_insn_hint_op op); u32 aarch64_insn_gen_nop(void); +u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg, + enum aarch64_insn_branch_type type); bool aarch64_insn_hotpatch_safe(u32 old_insn, u32 new_insn); diff --git a/arch/arm64/kernel/insn.c b/arch/arm64/kernel/insn.c index d9f7827..6797936 100644 --- a/arch/arm64/kernel/insn.c +++ b/arch/arm64/kernel/insn.c @@ -283,6 +283,9 @@ static u32 aarch64_insn_encode_register(enum aarch64_insn_register_type type, case AARCH64_INSN_REGTYPE_RT: shift = 0; break; + case AARCH64_INSN_REGTYPE_RN: + shift = 5; + break; default: pr_err("%s: unknown register type encoding %d\n", __func__, type); @@ -325,10 +328,16 @@ u32 __kprobes aarch64_insn_gen_branch_imm(unsigned long pc, unsigned long addr, */ offset = branch_imm_common(pc, addr, SZ_128M); - if (type == AARCH64_INSN_BRANCH_LINK) + switch (type) { + case AARCH64_INSN_BRANCH_LINK: insn = aarch64_insn_get_bl_value(); - else + break; + case AARCH64_INSN_BRANCH_NOLINK: insn = aarch64_insn_get_b_value(); + break; + default: + BUG_ON(1); + } return aarch64_insn_encode_immediate(AARCH64_INSN_IMM_26, insn, offset >> 2); @@ -380,3 +389,25 @@ u32 __kprobes aarch64_insn_gen_nop(void) { return aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP); } + +u32 aarch64_insn_gen_branch_reg(enum aarch64_insn_register reg, + enum aarch64_insn_branch_type type) +{ + u32 insn; + + switch (type) { + case AARCH64_INSN_BRANCH_NOLINK: + insn = aarch64_insn_get_br_value(); + break; + case AARCH64_INSN_BRANCH_LINK: + insn = aarch64_insn_get_blr_value(); + break; + case AARCH64_INSN_BRANCH_RETURN: + insn = aarch64_insn_get_ret_value(); + break; + default: + BUG_ON(1); + } + + return aarch64_insn_encode_register(AARCH64_INSN_REGTYPE_RN, insn, reg); +} -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mmc: dw_mmc: Pass back errors from mmc_of_parse()
Looks good to me. Acked-by: Jaehoon Chung Best Regards, Jaehoon Chung On 08/26/2014 03:19 AM, Doug Anderson wrote: > It's possible that mmc_of_parse() could return errors (possibly in > some future version it might return -EPROBE_DEFER even). Let's pass > those errors back. > > Signed-off-by: Doug Anderson > --- > drivers/mmc/host/dw_mmc.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c > index 7f227e9..9ef4df0 100644 > --- a/drivers/mmc/host/dw_mmc.c > +++ b/drivers/mmc/host/dw_mmc.c > @@ -2131,7 +2131,9 @@ static int dw_mci_init_slot(struct dw_mci *host, > unsigned int id) > if (host->pdata->caps2) > mmc->caps2 = host->pdata->caps2; > > - mmc_of_parse(mmc); > + ret = mmc_of_parse(mmc); > + if (ret) > + goto err_host_allocated; > > if (host->pdata->blk_settings) { > mmc->max_segs = host->pdata->blk_settings->max_segs; > @@ -2163,7 +2165,7 @@ static int dw_mci_init_slot(struct dw_mci *host, > unsigned int id) > > ret = mmc_add_host(mmc); > if (ret) > - goto err_setup_bus; > + goto err_host_allocated; > > #if defined(CONFIG_DEBUG_FS) > dw_mci_init_debugfs(slot); > @@ -2174,9 +2176,9 @@ static int dw_mci_init_slot(struct dw_mci *host, > unsigned int id) > > return 0; > > -err_setup_bus: > +err_host_allocated: > mmc_free_host(mmc); > - return -EINVAL; > + return ret; > } > > static void dw_mci_cleanup_slot(struct dw_mci_slot *slot, unsigned int id) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Xen-devel] [PATCH 1/3] x86: Make page cache mode a real type
On 08/26/2014 09:44 PM, Toshi Kani wrote: On Tue, 2014-08-26 at 08:16 +0200, Juergen Gross wrote: At the moment there are a lot of places that handle setting or getting the page cache mode by treating the pgprot bits equal to the cache mode. This is only true because there are a lot of assumptions about the setup of the PAT MSR. Otherwise the cache type needs to get translated into pgprot bits and vice versa. This patch tries to prepare for that by introducing a seperate type for the cache mode and adding functions to translate between those and pgprot values. To avoid too much performance penalty the translation between cache mode and pgprot values is done via tables which contain the relevant information. Write-back cache mode is hard-wired to be 0, all other modes are configurable via those tables. For large pages there are translation functions as the PAT bit is located at different positions in the ptes of 4k and large pages. Signed-off-by: Stefan Bader Signed-off-by: Juergen Gross Hi Juergen, Thanks for the updates! A few comments below... @@ -73,6 +73,9 @@ void *kmap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) /* * Map 'pfn' using protections 'prot' */ +#define __PAGE_KERNEL_WC (__PAGE_KERNEL | \ +cachemode2protval(_PAGE_CACHE_MODE_WC)) + void __iomem * iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) { @@ -82,12 +85,14 @@ iomap_atomic_prot_pfn(unsigned long pfn, pgprot_t prot) * MTRR is UC or WC. UC_MINUS gets the real intention, of the * user, which is "WC if the MTRR is WC, UC if you can't do that." */ - if (!pat_enabled && pgprot_val(prot) == pgprot_val(PAGE_KERNEL_WC)) - prot = PAGE_KERNEL_UC_MINUS; + if (!pat_enabled && pgprot_val(prot) == __PAGE_KERNEL_WC) + prot = __pgprot(__PAGE_KERNEL | + protval_pagemode(_PAGE_CACHE_MODE_UC_MINUS)); protval_pagemode() should be cachemode2protval(). Obviously, yes. /* diff --git a/drivers/video/fbdev/vermilion/vermilion.c b/drivers/video/fbdev/vermilion/vermilion.c index 048a666..6bbc559 100644 --- a/drivers/video/fbdev/vermilion/vermilion.c +++ b/drivers/video/fbdev/vermilion/vermilion.c @@ -1004,13 +1004,15 @@ static int vmlfb_mmap(struct fb_info *info, struct vm_area_struct *vma) struct vml_info *vinfo = container_of(info, struct vml_info, info); unsigned long offset = vma->vm_pgoff << PAGE_SHIFT; int ret; + unsigned long prot; ret = vmlfb_vram_offset(vinfo, offset); if (ret) return -EINVAL; - pgprot_val(vma->vm_page_prot) |= _PAGE_PCD; - pgprot_val(vma->vm_page_prot) &= ~_PAGE_PWT; + prot = pgprot_val(vma->vm_page_prot) & ~_PAGE_CACHE_MASK; + pgprot_val(vma->vm_page_prot) = + prot | cachemode2protval(_PAGE_CACHE_MODE_UC); This cache mode should be _PAGE_CACHE_MODE_UC_MINUS as the original code only sets the PCD bit. I'll change it. Thanks, Juergen -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/3] Adding Skyworks SKY81452 MFD driver
On Tue, Aug 26, 2014 at 09:22:58AM +0100, Lee Jones wrote: > On Mon, 25 Aug 2014, Gyungoh Yoo wrote: > > On Thu, Aug 21, 2014 at 10:45:02AM +0100, Lee Jones wrote: > > > When you send patch-sets, you should send them connected to one > > > another AKA threaded. That way, when we're reviewing we can look at > > > the other patches in the set for reference. See the man page for `git > > > send-email` for details. > > > > > > > > > > > > > Signed-off-by: Gyungoh Yoo > > > > --- > > [...] > > > > > +static int sky81452_register_devices(struct device *dev, > > > > + const struct sky81452_platform_data *pdata) > > > > +{ > > > > + struct mfd_cell cells[] = { > > > > + { > > > > + .name = "sky81452-bl", > > > > + .platform_data = pdata->bl_pdata, > > > > + .pdata_size = sizeof(*pdata->bl_pdata), > > > > > > Have you tested this with DT? > > > > > > You're not passing the compatible string and not using > > > of_platform_populate() so I'm struggling to see how it would work > > > properly. > > > > sky81452-bl and regulator-sky81452 is parsing the information > > in regulator node of its parent node. So I thought these 2 drivers > > don't need compatible attribute. That is what it didn't have > > compatible string. > > Is is mandatory that all drivers should have compatible attribute? > > How do they obtain their DT nodes? The backlight driver which is one of the child driver is obtain its DT node like this np = of_get_child_by_name(dev->parent->of_node, "backlight"); > > [...] > > > > > + return mfd_add_devices(dev, -1, cells, ARRAY_SIZE(cells), > > > > + NULL, 0, NULL); > > > > > > This doesn't really need to be in a function of its own. Please put > > > it in .probe(). Also check for the return value and present the user > > > with an error message if it fails. > > > > I think this need to be, in case of !CONFIG_OF. > > Can you please explain more in details? > > Then how to you obtain the shared register map you created? regmap is stored in driver data in MFD. i2c_set_clientdata(client, regmap); The child drivers obain the regmap from the parent. struct regmap *regmap = dev_get_drvdata(dev->parent); > > [...] > > -- > Lee Jones > Linaro STMicroelectronics Landing Team Lead > Linaro.org │ Open source software for ARM SoCs > Follow Linaro: Facebook | Twitter | Blog -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC v7 net-next 00/28] BPF syscall
On Aug 26, 2014 7:29 PM, "Alexei Starovoitov" wrote: > > Hi Ingo, David, > > posting whole thing again as RFC to get feedback on syscall only. > If syscall bpf(int cmd, union bpf_attr *attr, unsigned int size) is ok, > I'll split them into small chunks as requested and will repost without RFC. IMO it's much easier to review a syscall if we just look at a specification of what it does. The code is, in some sense, secondary. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH net v3 4/4] tg3: Fix tx_pending checks for tg3_tso_bug
> static inline bool tg3_maybe_stop_txq(struct tg3_napi *tnapi, > struct netdev_queue *txq, > @@ -7841,14 +7847,16 @@ static inline bool tg3_maybe_stop_txq(struct tg3_napi > *tnapi, > if (!netif_tx_queue_stopped(txq)) { > stopped = true; > netif_tx_stop_queue(txq); > - BUG_ON(wakeup_thresh >= tnapi->tx_pending); > + tnapi->wakeup_thresh = wakeup_thresh; > + BUG_ON(tnapi->wakeup_thresh >= tnapi->tx_pending); > } > /* netif_tx_stop_queue() must be done before checking tx index >* in tg3_tx_avail(), because in tg3_tx(), we update tx index > - * before checking for netif_tx_queue_stopped(). > + * before checking for netif_tx_queue_stopped(). The memory > + * barrier also synchronizes wakeup_thresh changes. >*/ > smp_mb(); > - if (tg3_tx_avail(tnapi) > wakeup_thresh) > + if (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh) > netif_tx_wake_queue(txq); you can add a comment here... stopped is not set to false even if queue wakes up, to log the netdev_err "BUG! TX Ring.." message. > } > return stopped; > @@ -7861,10 +7869,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct > tg3_napi *tnapi, > struct netdev_queue *txq, struct sk_buff *skb) > @@ -12318,9 +12354,7 @@ static int tg3_set_ringparam(struct net_device *dev, > struct ethtool_ringparam *e > if ((ering->rx_pending > tp->rx_std_ring_mask) || > (ering->rx_jumbo_pending > tp->rx_jmb_ring_mask) || > (ering->tx_pending > TG3_TX_RING_SIZE - 1) || > - (ering->tx_pending <= MAX_SKB_FRAGS + 1) || > - (tg3_flag(tp, TSO_BUG) && > - (ering->tx_pending <= (MAX_SKB_FRAGS * 3 > + (ering->tx_pending <= MAX_SKB_FRAGS + 1)) > return -EINVAL; > > if (netif_running(dev)) { > @@ -12340,6 +12374,7 @@ static int tg3_set_ringparam(struct net_device *dev, > struct ethtool_ringparam *e > if (tg3_flag(tp, JUMBO_RING_ENABLE)) > tp->rx_jumbo_pending = ering->rx_jumbo_pending; > > + dev->gso_max_segs = TG3_TX_SEG_PER_DESC(ering->tx_pending - 1); Assuming a LSO skb of 64k size takes the tg3_tso_bug() code path, if the available TX descriptors is <= 135 assuming gso_segs is 45 for this skb based on the estimate 45 * 3 driver would stop this TX queue and set the tnapi->wakeup_thresh to 135 and return NETDEV_TX_BUSY. This skb will be queued to be resent when the queue wakes up. Meanwhile if the user changes the TX ring size tx_pending=135, dev->gso_max_segs is modified accordingly to 44, the LSO skb which was queued will now be GSO'ed (in net/dev.c) before calling tg3_start_xmit(). To note tg3_tx() cannot wake the queue as it is expecting to be woken up when available free TX descriptors is 136. So we end up with HW TX ring empty and not able to send any pkts. > for (i = 0; i < tp->irq_max; i++) > tp->napi[i].tx_pending = ering->tx_pending; > > @@ -17816,6 +17851,7 @@ static int tg3_init_one(struct pci_dev *pdev, > else -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel: signal: NULL ptr deref when killing process
On 08/21/2014 01:17 PM, Oleg Nesterov wrote: >> Is there a race between kill() and exit() brought on by the kill path only >> > using the RCU read lock? This doesn't prevent ->real_cred from being >> > modified, but it looks like this should, in combination with >> > delayed_put_task_struct(), prevent it from being cleared. > Yes, rcu should protect us from both delayed_put_pid() and delayed_put_task(). > Everything looks correct... And there are a lot of other similar users of > find_vpid/find_task_by_vpid/pid_task/etc under rcu, I can't recall any bug > in this area. > > I am puzzled. Note also that ->signal == NULL. Will try to think more, > but so far I have no any idea. I've hit something similar earlier today, and it might be related: [ 973.452840] BUG: unable to handle kernel NULL pointer dereference at 02b0 [ 973.455347] IP: flush_sigqueue_mask (include/linux/signal.h:118 kernel/signal.c:715) [ 973.457526] PGD 4dfdc7067 PUD 5f77d9067 PMD 0 [ 973.459216] Oops: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 973.460086] Dumping ftrace buffer: [ 973.460086](ftrace buffer empty) [ 973.460086] Modules linked in: [ 973.460086] CPU: 4 PID: 13145 Comm: trinity-c767 Not tainted 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 [ 973.460086] task: 88060480 ti: 880586648000 task.ti: 880586648000 [ 973.460086] RIP: flush_sigqueue_mask (include/linux/signal.h:118 kernel/signal.c:715) [ 973.460086] RSP: 0018:88058664bec8 EFLAGS: 00010046 [ 973.460086] RAX: RBX: f730 RCX: 0001 [ 973.460086] RDX: RSI: 02a0 RDI: 88058664bed8 [ 973.460086] RBP: 88058664bf10 R08: 0001 R09: 0001 [ 973.460086] R10: 0002d201 R11: 0254 R12: [ 973.460086] R13: 88058664bf40 R14: 88060480 R15: 0010 [ 973.460086] FS: 7fe3a3045700() GS:880277c0() knlGS: [ 973.460086] CS: 0010 DS: ES: CR0: 8005003b [ 973.460086] CR2: 02b0 CR3: 0004e23d5000 CR4: 06a0 [ 973.460086] Stack: [ 973.460086] ac183690 01017fffb3247180 0001 [ 973.460086] 7fffb3247180 7fffb3247220 0011 [ 973.460086] 88058664bf78 ac183ef5 [ 973.460086] Call Trace: [ 973.460086] ? do_sigaction (kernel/signal.c:3124 (discriminator 17)) [ 973.460086] SyS_rt_sigaction (kernel/signal.c:3360 kernel/signal.c:3341) [ 973.460086] tracesys (arch/x86/kernel/entry_64.S:542) [ 973.460086] Code: b7 49 09 d5 4d 89 6e 10 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 48 8b 0f 31 c0 <48> 8b 56 10 48 85 ca 74 7b 55 48 f7 d1 48 89 e5 41 56 48 21 ca All code 0: b7 49 mov$0x49,%bh 2: 09 d5 or %edx,%ebp 4: 4d 89 6e 10 mov%r13,0x10(%r14) 8: 48 83 c4 08 add$0x8,%rsp c: 5b pop%rbx d: 41 5c pop%r12 f: 41 5d pop%r13 11: 41 5e pop%r14 13: 41 5f pop%r15 15: 5d pop%rbp 16: c3 retq 17: 66 2e 0f 1f 84 00 00nopw %cs:0x0(%rax,%rax,1) 1e: 00 00 00 21: 66 66 66 66 90 data32 data32 data32 xchg %ax,%ax 26: 48 8b 0fmov(%rdi),%rcx 29: 31 c0 xor%eax,%eax 2b:* 48 8b 56 10 mov0x10(%rsi),%rdx <-- trapping instruction 2f: 48 85 catest %rcx,%rdx 32: 74 7b je 0xaf 34: 55 push %rbp 35: 48 f7 d1not%rcx 38: 48 89 e5mov%rsp,%rbp 3b: 41 56 push %r14 3d: 48 21 caand%rcx,%rdx ... Code starting with the faulting instruction === 0: 48 8b 56 10 mov0x10(%rsi),%rdx 4: 48 85 catest %rcx,%rdx 7: 74 7b je 0x84 9: 55 push %rbp a: 48 f7 d1not%rcx d: 48 89 e5mov%rsp,%rbp 10: 41 56 push %r14 12: 48 21 caand%rcx,%rdx ... [ 973.460086] RIP flush_sigqueue_mask (include/linux/signal.h:118 kernel/signal.c:715) [ 973.460086] RSP [ 973.460086] CR2: 02b0 Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] memory-hotplug: fix not enough check of valid_zones
(2014/08/27 10:55), Zhang Zhen wrote: On 2014/8/26 18:23, Yasuaki Ishimatsu wrote: (2014/08/26 18:57), Zhang Zhen wrote: As Yasuaki Ishimatsu described the check here is not enough if memory has hole as follows: PFN 0x00 0xd0 0xe0 0xf0 +-+-+-+ zone type | Normal| hole| Normal| +-+-+-+ In this case, the check can't guarantee that this is "the last block of memory". The check of ZONE_MOVABLE has the same problem. Change the interface name to valid_zones according to most pepole's suggestion. Sample output of the sysfs files: memory0/valid_zones: none memory1/valid_zones: DMA32 memory2/valid_zones: DMA32 memory3/valid_zones: DMA32 memory4/valid_zones: Normal memory5/valid_zones: Normal memory6/valid_zones: Normal Movable memory7/valid_zones: Movable Normal memory8/valid_zones: Movable The patch has two changes: - change sysfs interface name - change check of ZONE_MOVABLE So please separate them. Ok, i will separate them. Thanks! Signed-off-by: Zhang Zhen --- Documentation/ABI/testing/sysfs-devices-memory | 8 ++--- Documentation/memory-hotplug.txt | 4 +-- drivers/base/memory.c | 42 ++ 3 files changed, 15 insertions(+), 39 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-devices-memory b/Documentation/ABI/testing/sysfs-devices-memory index 2b2a1d7..deef3b5 100644 --- a/Documentation/ABI/testing/sysfs-devices-memory +++ b/Documentation/ABI/testing/sysfs-devices-memory @@ -61,13 +61,13 @@ Users:hotplug memory remove tools http://www.ibm.com/developerworks/wikis/display/LinuxP/powerpc-utils -What: /sys/devices/system/memory/memoryX/zones_online_to +What: /sys/devices/system/memory/memoryX/valid_zones Date: July 2014 Contact:Zhang Zhen Description: -The file /sys/devices/system/memory/memoryX/zones_online_to -is read-only and is designed to show which zone this memory block can -be onlined to. +The file /sys/devices/system/memory/memoryX/valid_zonesis +read-only and is designed to show which zone this memory +block can be onlined to. What:/sys/devices/system/memoryX/nodeY Date:October 2009 diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt index 5b34e33..947229c 100644 --- a/Documentation/memory-hotplug.txt +++ b/Documentation/memory-hotplug.txt @@ -155,7 +155,7 @@ Under each memory block, you can see 4 files: /sys/devices/system/memory/memoryXXX/phys_device /sys/devices/system/memory/memoryXXX/state /sys/devices/system/memory/memoryXXX/removable -/sys/devices/system/memory/memoryXXX/zones_online_to +/sys/devices/system/memory/memoryXXX/valid_zones 'phys_index' : read-only and contains memory block id, same as XXX. 'state' : read-write @@ -171,7 +171,7 @@ Under each memory block, you can see 4 files: block is removable and a value of 0 indicates that it is not removable. A memory block is removable only if every section in the block is removable. -'zones_online_to' : read-only: designed to show which zone this memory block +'valid_zones' : read-only: designed to show which zone this memory block can be onlined to. NOTE: diff --git a/drivers/base/memory.c b/drivers/base/memory.c index ccaf37c..efd456c 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -374,21 +374,7 @@ static ssize_t show_phys_device(struct device *dev, } #ifdef CONFIG_MEMORY_HOTREMOVE -static int __zones_online_to(unsigned long end_pfn, -struct page *first_page, unsigned long nr_pages) -{ -struct zone *zone_next; - -/* The mem block is the last block of memory. */ -if (!pfn_valid(end_pfn + 1)) -return 1; -zone_next = page_zone(first_page + nr_pages); -if (zone_idx(zone_next) == ZONE_MOVABLE) -return 1; -return 0; -} - -static ssize_t show_zones_online_to(struct device *dev, +static ssize_t show_valid_zones(struct device *dev, struct device_attribute *attr, char *buf) { struct memory_block *mem = to_memory_block(dev); @@ -407,33 +393,23 @@ static ssize_t show_zones_online_to(struct device *dev, zone = page_zone(first_page); -#ifdef CONFIG_HIGHMEM -if (zone_idx(zone) == ZONE_HIGHMEM) { -if (__zones_online_to(end_pfn, first_page, nr_pages)) +if (zone_idx(zone) == ZONE_MOVABLE - 1) { +/*The mem block is the last memoryblock of this zone.*/ +if (end_pfn == zone_end_pfn(zone)) return sprintf(buf, "%s %s\n", zone->name, (zone + 1)->name); } -#else -if (zone_idx(zone) ==
Re: [PATCH] random: add and use memzero_explicit() for clearing data
On Tue, Aug 26, 2014 at 01:11:30AM +0200, Hannes Frederic Sowa wrote: > On Mo, 2014-08-25 at 22:01 +0200, Daniel Borkmann wrote: > > zatimend has reported that in his environment (3.16/gcc4.8.3/corei7) > > memset() calls which clear out sensitive data in extract_{buf,entropy, > > entropy_user}() in random driver are being optimized away by gcc. > > > > Add a helper memzero_explicit() (similarly as explicit_bzero() variants) > > that can be used in such cases where a variable with sensitive data is > > being cleared out in the end. Other use cases might also be in crypto > > code. [ I have put this into lib/string.c though, as it's always built-in > > and doesn't need any dependencies then. ] > > > > Fixes kernel bugzilla: 82041 > > > > Reported-by: zatim...@hotmail.co.uk > > Signed-off-by: Daniel Borkmann > > Cc: Hannes Frederic Sowa > > Cc: Alexey Dobriyan > > Acked-by: Hannes Frederic Sowa Applied to the random tree, thanks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 00/16] rcu: Some minor fixes and cleanups
On Tue, Aug 26, 2014 at 09:10:10PM -0400, Pranith Kumar wrote: > On Wed, Jul 23, 2014 at 10:45 AM, Paul E. McKenney > wrote: > > On Wed, Jul 23, 2014 at 01:09:37AM -0400, Pranith Kumar wrote: > >> Hi Paul, > >> > >> This is a series of minor fixes and cleanup patches which I found while > >> studying > >> the code. All my previous pending (but not rejected ;) patches are > >> superseded by > >> this series, expect the rcutorture snprintf changes. I am still waiting > >> for you > >> to decide on that one :) > >> > >> These changes have been tested by the kvm rcutorture test setup. Some > >> tests give > >> me stall warnings, but otherwise have SUCCESS messages in the logs. > > > > For patches 1, 3, 5, 8, 12, and 13, once you get a Reviewed-by from one > > of the co-maintainers or designated reviewers, I will queue them. > > The other patches I have responded to. > > Hi Paul, just a reminder so that these don't get forgotten :) Hello, Pranith, haven't forgotten them, but also haven't seen any reviews. Thanx, Paul > >> Pranith Kumar (16): > >> rcu: Use rcu_num_nodes instead of NUM_RCU_NODES > >> rcu: Check return value for cpumask allocation > >> rcu: Fix comment for gp_state field values > >> rcu: Remove redundant check for an online CPU > >> rcu: Add noreturn attribute to boost kthread > >> rcu: Clear gp_flags only when actually starting new gp > >> rcu: Save and restore irq flags in rcu_gp_cleanup() > >> rcu: Clean up rcu_spawn_one_boost_kthread() > >> rcu: Remove redundant check for online cpu > >> rcu: Check for RCU_FLAG_GP_INIT bit in gp_flags for spurious wakeup > >> rcu: Check for spurious wakeup using return value > >> rcu: Rename rcu_spawn_gp_kthread() to rcu_spawn_kthreads() > >> rcu: Spawn nocb kthreads from rcu_prepare_kthreads() > >> rcu: Remove redundant checks for rcu_scheduler_fully_active > >> rcu: Check for a nocb cpu before trying to spawn nocb threads > >> rcu: kvm.sh: Fix error when you pass --cpus argument > >> > >> kernel/rcu/tree.c | 42 > >> ++- > >> kernel/rcu/tree.h | 4 +-- > >> kernel/rcu/tree_plugin.h | 40 > >> + > >> tools/testing/selftests/rcutorture/bin/kvm.sh | 4 +-- > >> 4 files changed, 47 insertions(+), 43 deletions(-) > >> > >> -- > >> 2.0.0.rc2 > >> > > > > > > -- > Pranith > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: BUG in unmap_page_range
On 08/11/2014 11:28 PM, Sasha Levin wrote: > On 08/05/2014 09:04 PM, Sasha Levin wrote: >> > Thanks Hugh, Mel. I've added both patches to my local tree and will update >> > tomorrow >> > with the weather. >> > >> > Also: >> > >> > On 08/05/2014 08:42 PM, Hugh Dickins wrote: >>> >> One thing I did wonder, though: at first I was reassured by the >>> >> VM_BUG_ON(!pte_present(pte)) you add to pte_mknuma(); but then thought >>> >> it would be better as VM_BUG_ON(!(val & _PAGE_PRESENT)), being stronger >>> >> - asserting that indeed we do not put NUMA hints on PROT_NONE areas. >>> >> (But I have not tested, perhaps such a VM_BUG_ON would actually fire.) >> > >> > I've added VM_BUG_ON(!(val & _PAGE_PRESENT)) in just as a curiosity, I'll >> > update how that one looks as well. > Sorry for the rather long delay. > > The patch looks fine, the issue didn't reproduce. > > The added VM_BUG_ON didn't trigger either, so maybe we should consider adding > it in. It took a while, but I've managed to hit that VM_BUG_ON: [ 707.975456] kernel BUG at include/asm-generic/pgtable.h:724! [ 707.977147] invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 707.978974] Dumping ftrace buffer: [ 707.980110](ftrace buffer empty) [ 707.981221] Modules linked in: [ 707.982312] CPU: 18 PID: 9488 Comm: trinity-c538 Not tainted 3.17.0-rc2-next-20140826-sasha-00031-gc48c9ac-dirty #1079 [ 707.982801] task: 880165e28000 ti: 880165e3 task.ti: 880165e3 [ 707.982801] RIP: 0010:[] [] change_protection_range+0x94a/0x970 [ 707.982801] RSP: 0018:880165e33d98 EFLAGS: 00010246 [ 707.982801] RAX: 9d340902 RBX: 880511204a08 RCX: 0100 [ 707.982801] RDX: 9d340902 RSI: 41741000 RDI: 9d340902 [ 707.982801] RBP: 880165e33e88 R08: 880708a23c00 R09: 00b52000 [ 707.982801] R10: 1e01 R11: 0008 R12: 41751000 [ 707.982801] R13: 00f7 R14: 9d340902 R15: 41741000 [ 707.982801] FS: 7f358a9aa700() GS:88071c60() knlGS: [ 707.982801] CS: 0010 DS: ES: CR0: 8005003b [ 707.982801] CR2: 7f3586b69490 CR3: 000165d88000 CR4: 06a0 [ 707.982801] Stack: [ 707.982801] 8804db88d058 88070fb17cf0 [ 707.982801] 880165d88000 8801686a5000 4163e000 [ 707.982801] 8801686a5000 0001 0025 41750fff [ 707.982801] Call Trace: [ 707.982801] [] change_protection+0x14/0x30 [ 707.982801] [] change_prot_numa+0x1b/0x40 [ 707.982801] [] task_numa_work+0x1f6/0x330 [ 707.982801] [] task_work_run+0xc4/0xf0 [ 707.982801] [] do_notify_resume+0x97/0xb0 [ 707.982801] [] int_signal+0x12/0x17 [ 707.982801] Code: e8 2c 84 21 03 e9 72 ff ff ff 0f 1f 80 00 00 00 00 0f 0b 48 8b 7d a8 4c 89 f2 4c 89 fe e8 9f 7b 03 00 e9 47 f9 ff ff 0f 0b 0f 0b <0f> 0b 0f 0b 48 8b b5 70 ff ff ff 4c 89 ea 48 89 c7 e8 10 d5 01 [ 707.982801] RIP [] change_protection_range+0x94a/0x970 [ 707.982801] RSP Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v4] zram: add num_{discard_req, discarded} for discard stat
Since we have supported handling discard request in this commit f4659d8e620d08bd1a84a8aec5d2f5294a242764 (zram: support REQ_DISCARD), zram got one more chance to free unused memory whenever received discard request. But without stating for discard request, there is no method for user to know whether discard request has been handled by zram or how many blocks were discarded by zram when user wants to know the effect of discard. In this patch, we add num_discard_req to stat discard request and add num_discarded to stat real discarded blocks, and export them to sysfs for users. * From v1 * Update zram document to show num_discards in statistics list. * From v2 * Update description of this patch with clear goal. * From v3 * Stat discard request and discarded pages separately as "previous stat indicates lots of free page discarded without real freeing, so the stat makes our user's misunderstanding" pointed out by Minchan Kim. Signed-off-by: Chao Yu --- Documentation/ABI/testing/sysfs-block-zram | 17 + Documentation/blockdev/zram.txt| 2 ++ drivers/block/zram/zram_drv.c | 17 ++--- drivers/block/zram/zram_drv.h | 2 ++ 4 files changed, 35 insertions(+), 3 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram index 70ec992..805fb11 100644 --- a/Documentation/ABI/testing/sysfs-block-zram +++ b/Documentation/ABI/testing/sysfs-block-zram @@ -57,6 +57,23 @@ Description: The failed_writes file is read-only and specifies the number of failed writes happened on this device. +What: /sys/block/zram/num_discard_req +Date: August 2014 +Contact: Chao Yu +Description: + The num_discard_req file is read-only and specifies the number + of requests received by this device. These requests are sent by + swap layer or filesystem when they want to free blocks which are + no longer used. + +What: /sys/block/zram/num_discarded +Date: August 2014 +Contact: Chao Yu +Description: + The num_discarded file is read-only and specifies the number of + real discarded blocks (pages which are really freed) in this + device after discard request is sent to this device. + What: /sys/block/zram/max_comp_streams Date: February 2014 Contact: Sergey Senozhatsky diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt index 0595c3f..f9c1e41 100644 --- a/Documentation/blockdev/zram.txt +++ b/Documentation/blockdev/zram.txt @@ -89,6 +89,8 @@ size of the disk when not in use so a huge zram is wasteful. num_writes failed_reads failed_writes + num_discard_req + num_discarded invalid_io notify_free zero_pages diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index d00831c..1d012e8 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -322,7 +322,7 @@ static void handle_zero_page(struct bio_vec *bvec) * caller should hold this table index entry's bit_spinlock to * indicate this index entry is accessing. */ -static void zram_free_page(struct zram *zram, size_t index) +static bool zram_free_page(struct zram *zram, size_t index) { struct zram_meta *meta = zram->meta; unsigned long handle = meta->table[index].handle; @@ -336,7 +336,7 @@ static void zram_free_page(struct zram *zram, size_t index) zram_clear_flag(meta, index, ZRAM_ZERO); atomic64_dec(>stats.zero_pages); } - return; + return false; } zs_free(meta->mem_pool, handle); @@ -347,6 +347,7 @@ static void zram_free_page(struct zram *zram, size_t index) meta->table[index].handle = 0; zram_set_obj_size(meta, index, 0); + return true; } static int zram_decompress_page(struct zram *zram, char *mem, u32 index) @@ -603,12 +604,18 @@ static void zram_bio_discard(struct zram *zram, u32 index, } while (n >= PAGE_SIZE) { + bool discarded; + bit_spin_lock(ZRAM_ACCESS, >table[index].value); - zram_free_page(zram, index); + discarded = zram_free_page(zram, index); bit_spin_unlock(ZRAM_ACCESS, >table[index].value); + if (discarded) + atomic64_inc(>stats.num_discarded); index++; n -= PAGE_SIZE; } + + atomic64_inc(>stats.num_discard_req); } static void zram_reset_device(struct zram *zram, bool reset_capacity) @@ -866,6 +873,8 @@ ZRAM_ATTR_RO(num_reads); ZRAM_ATTR_RO(num_writes); ZRAM_ATTR_RO(failed_reads); ZRAM_ATTR_RO(failed_writes);
Re: [PATCH v3 3/4] thermal: add more description for thermal-zones
On 08/26/2014 08:12 PM, Eduardo Valentin wrote: > On Tue, Aug 26, 2014 at 10:17:29AM +0800, Wei Ni wrote: >> On 08/25/2014 07:07 PM, Eduardo Valentin wrote: >>> Hello Wei Ni, >>> >>> On Mon, Aug 25, 2014 at 02:29:47PM +0800, Wei Ni wrote: Add more description for the "polling-delay" property. Set "trips" and "cooling maps" as optional property, because if missing these two sub-nodes, the thermal zone device still work properly. Signed-off-by: Wei Ni --- Documentation/devicetree/bindings/thermal/thermal.txt | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt b/Documentation/devicetree/bindings/thermal/thermal.txt index f5db6b7..e3d3ed9 100644 --- a/Documentation/devicetree/bindings/thermal/thermal.txt +++ b/Documentation/devicetree/bindings/thermal/thermal.txt @@ -136,8 +136,8 @@ containing trip nodes and one sub-node containing all the zone cooling maps. Required properties: - polling-delay: The maximum number of milliseconds to wait between polls - Type: unsigned when checking this thermal zone. - Size: one cell + Type: unsigned when checking this thermal zone. If this value is 0, the + Size: one cell driver will not run polling queue, but just cancel it. >>> >>> The description above is specific to Linux kernel implementation >>> nomenclature. DT description needs to be OS agnostic. >>> - polling-delay-passive: The maximum number of milliseconds to wait Type: unsigned between polls when performing passive cooling. @@ -148,14 +148,16 @@ Required properties: phandles + sensor specifier +Optional property: - trips: A sub-node which is a container of only trip point nodes Type: sub-node required to describe the thermal zone. - cooling-maps: A sub-node which is a container of only cooling device Type: sub-node map nodes, used to describe the relation between trips - and cooling devices. + and cooling devices. If missing the "trips" property, + This sub-node will not be parsed, because no trips can + be bound to cooling devices. >>> >>> Do you mean if the thermal zone misses the "trips" property? Actually, >>> the binding describes both, cooling-maps and trips, as required >>> properties. Thus, both needs to be in place to consider the thermal zone >>> as a proper described zone. >> >> I moved the "trips" and "cooling-maps" to optional property, because if >> missing these two properties, the thermal zone devices still can be >> registered, and the driver can work properly, it has the basic function, >> can read temperature from thermal sysfs, although it doesn't have trips >> and bind with cooling devices. > > > If a thermal zone is used only for monitoring, then I believe it lost > its purpose. As Maybe a different framework shall be used, such as hwmon, > for instance? Yes, if we only use it for monitoring, we can use hwmon. But we have more functions base on these two thermal zone devices. We have a skin-temperature driver, which used nct1008's remote and local temperatures to estimator the skin temperature. As you know the thermal framework is more powerful, the remote/local sensors can be register as thermal zone, then the skin-temp driver can use thermal_zone_get_temp() to read their temperatures and then estimator skin's temp. We also will set trips and cooling devices for this skin-temp. Wei. > > The purpose of a thermal zone is to describe thermal behavior of a > hardware. As it is mentioned in the thermal.txt file. > > >> >> Thanks. >> Wei. >> >>> -Optional property: - coefficients: An array of integers (one signed cell) containing Type: array coefficients to compose a linear relation between Elem size: one cell the sensors listed in the thermal-sensors property. -- 1.8.1.5 >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-tegra" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 3/4] zram: zram memory size limitation
Hey Joonsoo, On Wed, Aug 27, 2014 at 10:26:11AM +0900, Joonsoo Kim wrote: > Hello, Minchan and David. > > On Tue, Aug 26, 2014 at 08:22:29AM -0400, David Horner wrote: > > On Tue, Aug 26, 2014 at 3:55 AM, Minchan Kim wrote: > > > Hey Joonsoo, > > > > > > On Tue, Aug 26, 2014 at 04:37:30PM +0900, Joonsoo Kim wrote: > > >> On Mon, Aug 25, 2014 at 09:05:55AM +0900, Minchan Kim wrote: > > >> > @@ -513,6 +540,14 @@ static int zram_bvec_write(struct zram *zram, > > >> > struct bio_vec *bvec, u32 index, > > >> > ret = -ENOMEM; > > >> > goto out; > > >> > } > > >> > + > > >> > + if (zram->limit_pages && > > >> > + zs_get_total_pages(meta->mem_pool) > zram->limit_pages) { > > >> > + zs_free(meta->mem_pool, handle); > > >> > + ret = -ENOMEM; > > >> > + goto out; > > >> > + } > > >> > + > > >> > cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO); > > >> > > >> Hello, > > >> > > >> I don't follow up previous discussion, so I could be wrong. > > >> Why this enforcement should be here? > > >> > > >> I think that this has two problems. > > >> 1) alloc/free happens unnecessarilly if we have used memory over the > > >> limitation. > > > > > > True but firstly, I implemented the logic in zsmalloc, not zram but > > > as I described in cover-letter, it's not a requirement of zsmalloc > > > but zram so it should be in there. If every user want it in future, > > > then we could move the function into zsmalloc. That's what we > > > concluded in previous discussion. > > Hmm... > Problem is that we can't avoid these unnecessary overhead in this > implementation. If we can implement this feature in zram efficiently, > it's okay. But, I think that current form isn't. If we can add it in zsmalloc, it would be more clean and efficient for zram but as I said, at the moment, I didn't want to put zram's requirement into zsmalloc because to me, it's weird to enforce max limit to allocator. It's client's role, I think. If current implementation is expensive and rather hard to follow, It would be one reason to move the feature into zsmalloc but I don't think it makes critical trobule in zram usecase. See below. But I still open and will wait others's opinion. If other guys think zsmalloc is better place, I am willing to move it into zsmalloc. > > > > > > > Another idea is we could call zs_get_total_pages right before zs_malloc > > > but the problem is we cannot know how many of pages are allocated > > > by zsmalloc in advance. > > > IOW, zram should be blind on zsmalloc's internal. > > > > > > > We did however suggest that we could check before hand to see if > > max was already exceeded as an optimization. > > (possibly with a guess on usage but at least using the minimum of 1 page) > > In the contested case, the max may already be exceeded transiently and > > therefore we know this one _could_ fail (it could also pass, but odds > > aren't good). > > As Minchan mentions this was discussed before - but not into great detail. > > Testing should be done to determine possible benefit. And as he also > > mentions, the better place for it may be in zsmalloc, but that > > requires an ABI change. > > Why we hesitate to change zsmalloc API? It is in-kernel API and there > are just two users now, zswap and zram. We can change it easily. > I think that we just need following simple API change in zsmalloc.c. > > zs_zpool_create(gfp_t gfp, struct zpool_ops *zpool_op) > => > zs_zpool_create(unsigned long limit, gfp_t gfp, struct zpool_ops > *zpool_op) > > It's pool allocator so there is no obstacle for us to limit maximum > memory usage in zsmalloc. It's a natural idea to limit memory usage > for pool allocator. > > > Certainly a detailed suggestion could happen on this thread and I'm > > also interested > > in your thoughts, but this patchset should be able to go in as is. > > Memory exhaustion avoidance probably trumps the possible thrashing at > > threshold. > > > > > About alloc/free cost once if it is over the limit, > > > I don't think it's important to consider. > > > Do you have any scenario in your mind to consider alloc/free cost > > > when the limit is over? > > > > > >> 2) Even if this request doesn't do new allocation, it could be failed > > >> due to other's allocation. There is time gap between allocation and > > >> free, so legimate user who want to use preallocated zsmalloc memory > > >> could also see this condition true and then he will be failed. > > > > > > Yeb, we already discussed that. :) > > > Such false positive shouldn't be a severe problem if we can keep a > > > promise that zram user cannot exceed mem_limit. > > > > > If we can keep such a promise, why we need to limit memory usage? > I guess that this limit feature is useful for user who can't keep such > promise. > So, we should assume that this false positive happens frequently. The goal is to limit memory usage within some threshold. so false positive shouldn't be
[PATCH RFC v7 net-next 02/28] net: filter: split filter.h and expose eBPF to user space
eBPF can be used from user space. uapi/linux/bpf.h: eBPF instruction set definition linux/filter.h: the rest This patch only moves macro definitions, but practically it freezes existing eBPF instruction set, though new instructions can still be added in the future. These eBPF definitions cannot go into uapi/linux/filter.h, since the names may conflict with existing applications. Signed-off-by: Alexei Starovoitov --- include/linux/filter.h| 312 +-- include/uapi/linux/Kbuild |1 + include/uapi/linux/bpf.h | 321 + 3 files changed, 323 insertions(+), 311 deletions(-) create mode 100644 include/uapi/linux/bpf.h diff --git a/include/linux/filter.h b/include/linux/filter.h index f3262b598262..f04793474d16 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -9,322 +9,12 @@ #include #include #include - -/* Internally used and optimized filter representation with extended - * instruction set based on top of classic BPF. - */ - -/* instruction classes */ -#define BPF_ALU64 0x07/* alu mode in double word width */ - -/* ld/ldx fields */ -#define BPF_DW 0x18/* double word */ -#define BPF_XADD 0xc0/* exclusive add */ - -/* alu/jmp fields */ -#define BPF_MOV0xb0/* mov reg to reg */ -#define BPF_ARSH 0xc0/* sign extending arithmetic shift right */ - -/* change endianness of a register */ -#define BPF_END0xd0/* flags for endianness conversion: */ -#define BPF_TO_LE 0x00/* convert to little-endian */ -#define BPF_TO_BE 0x08/* convert to big-endian */ -#define BPF_FROM_LEBPF_TO_LE -#define BPF_FROM_BEBPF_TO_BE - -#define BPF_JNE0x50/* jump != */ -#define BPF_JSGT 0x60/* SGT is signed '>', GT in x86 */ -#define BPF_JSGE 0x70/* SGE is signed '>=', GE in x86 */ -#define BPF_CALL 0x80/* function call */ -#define BPF_EXIT 0x90/* function return */ - -/* Register numbers */ -enum { - BPF_REG_0 = 0, - BPF_REG_1, - BPF_REG_2, - BPF_REG_3, - BPF_REG_4, - BPF_REG_5, - BPF_REG_6, - BPF_REG_7, - BPF_REG_8, - BPF_REG_9, - BPF_REG_10, - __MAX_BPF_REG, -}; - -/* BPF has 10 general purpose 64-bit registers and stack frame. */ -#define MAX_BPF_REG__MAX_BPF_REG - -/* ArgX, context and stack frame pointer register positions. Note, - * Arg1, Arg2, Arg3, etc are used as argument mappings of function - * calls in BPF_CALL instruction. - */ -#define BPF_REG_ARG1 BPF_REG_1 -#define BPF_REG_ARG2 BPF_REG_2 -#define BPF_REG_ARG3 BPF_REG_3 -#define BPF_REG_ARG4 BPF_REG_4 -#define BPF_REG_ARG5 BPF_REG_5 -#define BPF_REG_CTXBPF_REG_6 -#define BPF_REG_FP BPF_REG_10 - -/* Additional register mappings for converted user programs. */ -#define BPF_REG_A BPF_REG_0 -#define BPF_REG_X BPF_REG_7 -#define BPF_REG_TMPBPF_REG_8 - -/* BPF program can access up to 512 bytes of stack space. */ -#define MAX_BPF_STACK 512 - -/* Helper macros for filter block array initializers. */ - -/* ALU ops on registers, bpf_add|sub|...: dst_reg += src_reg */ - -#define BPF_ALU64_REG(OP, DST, SRC)\ - ((struct bpf_insn) {\ - .code = BPF_ALU64 | BPF_OP(OP) | BPF_X,\ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -#define BPF_ALU32_REG(OP, DST, SRC)\ - ((struct bpf_insn) {\ - .code = BPF_ALU | BPF_OP(OP) | BPF_X, \ - .dst_reg = DST, \ - .src_reg = SRC, \ - .off = 0, \ - .imm = 0 }) - -/* ALU ops on immediates, bpf_add|sub|...: dst_reg += imm32 */ - -#define BPF_ALU64_IMM(OP, DST, IMM)\ - ((struct bpf_insn) {\ - .code = BPF_ALU64 | BPF_OP(OP) | BPF_K,\ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ - .imm = IMM }) - -#define BPF_ALU32_IMM(OP, DST, IMM)\ - ((struct bpf_insn) {\ - .code = BPF_ALU | BPF_OP(OP) | BPF_K, \ - .dst_reg = DST, \ - .src_reg = 0, \ - .off = 0, \ -
[PATCH V2 2/3] perf tools: parse the pmu event prefix and surfix
From: Kan Liang There are two types of event formats for PMU events. E.g. el-abort OR perf tools: parse the pmu event prefix and surfix There are two types of event formats for PMU events. E.g. el-abort OR cpu/el-abort/. However, the lexer mistakenly recognizes the simple style format as two events. The parse_events_pmu_check function uses bsearch to search the name in known pmu event list. It can tell the lexer that the name is a PE_NAME or a PMU event name prefix or a PMU event name suffix. All these information will be used for accurately parsing kernel PMU events. The pmu events list will be read from sysfs at runtime. Signed-off-by: Kan Liang --- V2: Read kernel PMU events from sysfs at runtime tools/perf/util/parse-events.c | 103 + tools/perf/util/parse-events.h | 15 ++ tools/perf/util/pmu.c | 10 tools/perf/util/pmu.h | 10 4 files changed, 128 insertions(+), 10 deletions(-) diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 7a0aa75..5e69e65 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -29,6 +29,9 @@ extern int parse_events_debug; #endif int parse_events_parse(void *data, void *scanner); +static struct kernel_pmu_event_symbol *kernel_pmu_events_list; +static size_t kernel_pmu_events_list_num; + static struct event_symbol event_symbols_hw[PERF_COUNT_HW_MAX] = { [PERF_COUNT_HW_CPU_CYCLES] = { .symbol = "cpu-cycles", @@ -852,6 +855,103 @@ int parse_events_name(struct list_head *list, char *name) return 0; } +static int +comp_pmu(const void *p1, const void *p2) +{ + struct kernel_pmu_event_symbol *pmu1 = + (struct kernel_pmu_event_symbol *) p1; + struct kernel_pmu_event_symbol *pmu2 = + (struct kernel_pmu_event_symbol *) p2; + + return strcmp(pmu1->symbol, pmu2->symbol); +} + +enum kernel_pmu_event_type +parse_events_pmu_check(const char *name) +{ + struct kernel_pmu_event_symbol p, *r; + + /* +* name "cpu" could be prefix of cpu-cycles or cpu// events. +* cpu-cycles has been handled by hardcode. +* So it must be cpu// events, not kernel pmu event. +*/ + if (!kernel_pmu_events_list_num || !strcmp(name, "cpu")) + return NONE_KERNEL_PMU_EVENT; + + strcpy(p.symbol, name); + r = bsearch(, kernel_pmu_events_list, + kernel_pmu_events_list_num, + sizeof(struct kernel_pmu_event_symbol), comp_pmu); + if (r == NULL) + return NONE_KERNEL_PMU_EVENT; + return r->type; +} + +/* + * Read the pmu events list from sysfs + * Save it into kernel_pmu_events_list + */ +static void scan_kernel_pmu_events_list(void) +{ + + struct perf_pmu *pmu = NULL; + struct perf_pmu_alias *alias; + int len = 0; + + while ((pmu = perf_pmu__scan(pmu)) != NULL) + list_for_each_entry(alias, >aliases, list) { + if (!strcmp(pmu->name, "cpu")) { + if (strchr(alias->name, '-')) + len++; + len++; + } + } + if (len == 0) + return; + kernel_pmu_events_list = + malloc(sizeof(struct kernel_pmu_event_symbol) * len); + kernel_pmu_events_list_num = len; + + pmu = NULL; + len = 0; + while ((pmu = perf_pmu__scan(pmu)) != NULL) + list_for_each_entry(alias, >aliases, list) { + if (!strcmp(pmu->name, "cpu")) { + struct kernel_pmu_event_symbol *p = + kernel_pmu_events_list + len; + char *tmp = strchr(alias->name, '-'); + + if (tmp != NULL) { + strncpy(p->symbol, alias->name, + tmp - alias->name); + p->type = KERNEL_PMU_EVENT_PREFIX; + tmp++; + p++; + strcpy(p->symbol, tmp); + p->type = KERNEL_PMU_EVENT_SUFFIX; + len += 2; + } else { + strcpy(p->symbol, alias->name); + p->type = KERNEL_PMU_EVENT; + len++; + } + } + } + qsort(kernel_pmu_events_list, len, + sizeof(struct kernel_pmu_event_symbol), comp_pmu); + +} + +static void release_kernel_pmu_events_list(void) +{ + if (kernel_pmu_events_list) { +
[PATCH V2 1/3] Revert "perf tools: Default to cpu// for events v5"
From: Kan Liang This reverts commit 50e200f07948 ("perf tools: Default to cpu// for events v5") The fixup cannot handle the case that new style format(which without //) mixed with other different formats. For example, group events with new style format: {mem-stores,mem-loads} some hardware event + new style event: cycles,mem-loads Cache event + new style event: LLC-loads,mem-loads Raw event + new style event: cpu/event=0xc8,umask=0x08/,mem-loads old style event and new stytle mixture: mem-stores,cpu/mem-loads/ Signed-off-by: Kan Liang --- tools/perf/util/include/linux/string.h | 1 - tools/perf/util/parse-events.c | 30 +- tools/perf/util/string.c | 24 3 files changed, 1 insertion(+), 54 deletions(-) diff --git a/tools/perf/util/include/linux/string.h b/tools/perf/util/include/linux/string.h index 97a8007..6f19c54 100644 --- a/tools/perf/util/include/linux/string.h +++ b/tools/perf/util/include/linux/string.h @@ -1,4 +1,3 @@ #include void *memdup(const void *src, size_t len); -int str_append(char **s, int *len, const char *a); diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c index 1e15df1..7a0aa75 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c @@ -6,7 +6,7 @@ #include "parse-options.h" #include "parse-events.h" #include "exec_cmd.h" -#include "linux/string.h" +#include "string.h" #include "symbol.h" #include "cache.h" #include "header.h" @@ -852,32 +852,6 @@ int parse_events_name(struct list_head *list, char *name) return 0; } -static int parse_events__scanner(const char *str, void *data, int start_token); - -static int parse_events_fixup(int ret, const char *str, void *data, - int start_token) -{ - char *o = strdup(str); - char *s = NULL; - char *t = o; - char *p; - int len = 0; - - if (!o) - return ret; - while ((p = strsep(, ",")) != NULL) { - if (s) - str_append(, , ","); - str_append(, , "cpu/"); - str_append(, , p); - str_append(, , "/"); - } - free(o); - if (!s) - return -ENOMEM; - return parse_events__scanner(s, data, start_token); -} - static int parse_events__scanner(const char *str, void *data, int start_token) { YY_BUFFER_STATE buffer; @@ -898,8 +872,6 @@ static int parse_events__scanner(const char *str, void *data, int start_token) parse_events__flush_buffer(buffer, scanner); parse_events__delete_buffer(buffer, scanner); parse_events_lex_destroy(scanner); - if (ret && !strchr(str, '/')) - ret = parse_events_fixup(ret, str, data, start_token); return ret; } diff --git a/tools/perf/util/string.c b/tools/perf/util/string.c index 2553e5b..4b0ff22 100644 --- a/tools/perf/util/string.c +++ b/tools/perf/util/string.c @@ -387,27 +387,3 @@ void *memdup(const void *src, size_t len) return p; } - -/** - * str_append - reallocate string and append another - * @s: pointer to string pointer - * @len: pointer to len (initialized) - * @a: string to append. - */ -int str_append(char **s, int *len, const char *a) -{ - int olen = *s ? strlen(*s) : 0; - int nlen = olen + strlen(a) + 1; - if (*len < nlen) { - *len = *len * 2; - if (*len < nlen) - *len = nlen; - *s = realloc(*s, *len); - if (!*s) - return -ENOMEM; - if (olen == 0) - **s = 0; - } - strcat(*s, a); - return 0; -} -- 1.8.3.2 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH V2 3/3] perf tools: Add support to new style format of kernel PMU event
From: Kan Liang Add new rules for kernel PMU event. event_pmu: PE_KERNEL_PMU_EVENT | PE_PMU_EVENT_PRE '-' PE_PMU_EVENT_SUF PE_KERNEL_PMU_EVENT token is for cycles-ct/cycles-t/mem-loads/mem-stores. The prefix cycles is mixed up with cpu-cycles. loads and stores are mixed up with cache event So they have to be hardcode in lex. PE_PMU_EVENT_PRE and PE_PMU_EVENT_SUF tokens are for other PMU events. The lex looks generic identifier up in the table and return the matched token. If there is no match, generic PE_NAME token will be return. Using the rules, kernel PMU event could use new style format without // so you can use perf record -e mem-loads ... instead of perf record -e cpu/mem-loads/ Signed-off-by: Kan Liang --- tools/perf/util/parse-events.l | 30 +- tools/perf/util/parse-events.y | 42 ++ 2 files changed, 71 insertions(+), 1 deletion(-) diff --git a/tools/perf/util/parse-events.l b/tools/perf/util/parse-events.l index 3432995..4dd7f04 100644 --- a/tools/perf/util/parse-events.l +++ b/tools/perf/util/parse-events.l @@ -51,6 +51,24 @@ static int str(yyscan_t scanner, int token) return token; } +static int pmu_str_check(yyscan_t scanner) +{ + YYSTYPE *yylval = parse_events_get_lval(scanner); + char *text = parse_events_get_text(scanner); + + yylval->str = strdup(text); + switch (parse_events_pmu_check(text)) { + case KERNEL_PMU_EVENT_PREFIX: + return PE_PMU_EVENT_PRE; + case KERNEL_PMU_EVENT_SUFFIX: + return PE_PMU_EVENT_SUF; + case KERNEL_PMU_EVENT: + return PE_KERNEL_PMU_EVENT; + default: + return PE_NAME; + } +} + static int sym(yyscan_t scanner, int type, int config) { YYSTYPE *yylval = parse_events_get_lval(scanner); @@ -178,6 +196,16 @@ alignment-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_AL emulation-faults { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_EMULATION_FAULTS); } dummy { return sym(yyscanner, PERF_TYPE_SOFTWARE, PERF_COUNT_SW_DUMMY); } + /* +* We have to handle the kernel PMU event cycles-ct/cycles-t/mem-loads/mem-stores separately. +* Because the prefix cycles is mixed up with cpu-cycles. +* loads and stores are mixed up with cache event +*/ +cycles-ct { return str(yyscanner, PE_KERNEL_PMU_EVENT); } +cycles-t { return str(yyscanner, PE_KERNEL_PMU_EVENT); } +mem-loads { return str(yyscanner, PE_KERNEL_PMU_EVENT); } +mem-stores { return str(yyscanner, PE_KERNEL_PMU_EVENT); } + L1-dcache|l1-d|l1d|L1-data | L1-icache|l1-i|l1i|L1-instruction | LLC|L2 | @@ -199,7 +227,7 @@ r{num_raw_hex} { return raw(yyscanner); } {num_hex} { return value(yyscanner, 16); } {modifier_event} { return str(yyscanner, PE_MODIFIER_EVENT); } -{name} { return str(yyscanner, PE_NAME); } +{name} { return pmu_str_check(yyscanner); } "/"{ BEGIN(config); return '/'; } - { return '-'; } , { BEGIN(event); return ','; } diff --git a/tools/perf/util/parse-events.y b/tools/perf/util/parse-events.y index 0bc87ba..77e01e5 100644 --- a/tools/perf/util/parse-events.y +++ b/tools/perf/util/parse-events.y @@ -47,6 +47,7 @@ static inc_group_count(struct list_head *list, %token PE_NAME_CACHE_TYPE PE_NAME_CACHE_OP_RESULT %token PE_PREFIX_MEM PE_PREFIX_RAW PE_PREFIX_GROUP %token PE_ERROR +%token PE_PMU_EVENT_PRE PE_PMU_EVENT_SUF PE_KERNEL_PMU_EVENT %type PE_VALUE %type PE_VALUE_SYM_HW %type PE_VALUE_SYM_SW @@ -58,6 +59,7 @@ static inc_group_count(struct list_head *list, %type PE_MODIFIER_EVENT %type PE_MODIFIER_BP %type PE_EVENT_NAME +%type PE_PMU_EVENT_PRE PE_PMU_EVENT_SUF PE_KERNEL_PMU_EVENT %type value_sym %type event_config %type event_term @@ -210,6 +212,46 @@ PE_NAME '/' event_config '/' parse_events__free_terms($3); $$ = list; } +| +PE_KERNEL_PMU_EVENT +{ + struct parse_events_evlist *data = _data; + struct list_head *head = malloc(sizeof(*head)); + struct parse_events_term *term; + struct list_head *list; + + ABORT_ON(parse_events_term__num(, PARSE_EVENTS__TERM_TYPE_USER, + $1, 1)); + ABORT_ON(!head); + INIT_LIST_HEAD(head); + list_add_tail(>list, head); + + ALLOC_LIST(list); + ABORT_ON(parse_events_add_pmu(list, >idx, "cpu", head)); + parse_events__free_terms(head); + $$ = list; +} +| +PE_PMU_EVENT_PRE '-'
Re: [PATCH v1 1/1] power: Add simple gpio-restart driver
On 08/26/2014 04:45 PM, David Riley wrote: This driver registers a restart handler to set a GPIO line high/low to reset a board based on devicetree bindings. Signed-off-by: David Riley --- .../devicetree/bindings/gpio/gpio-restart.txt | 48 +++ drivers/power/reset/Kconfig| 8 ++ drivers/power/reset/Makefile | 1 + drivers/power/reset/gpio-restart.c | 142 + 4 files changed, 199 insertions(+) create mode 100644 Documentation/devicetree/bindings/gpio/gpio-restart.txt create mode 100644 drivers/power/reset/gpio-restart.c diff --git a/Documentation/devicetree/bindings/gpio/gpio-restart.txt b/Documentation/devicetree/bindings/gpio/gpio-restart.txt new file mode 100644 index 000..7cd58788 --- /dev/null +++ b/Documentation/devicetree/bindings/gpio/gpio-restart.txt @@ -0,0 +1,48 @@ +Driver a GPIO line that can be used to restart the system as a +restart handler. + +The driver supports both level triggered and edge triggered power off. +At driver load time, the driver will request the given gpio line and +install a restart handler. If the optional properties 'input' is +not found, the GPIO line will be driven in the inactive state. +Otherwise its configured as an input. + +When do_kernel_restart is called the various restart handlers will be tried +in order. The gpio is configured as an output, and drive active, so +triggering a level triggered power off condition. This will also cause an +inactive->active edge condition, so triggering positive edge triggered +power off. After a delay of 100ms, the GPIO is set to inactive, thus +causing an active->inactive edge, triggering negative edge triggered power +off. After another 100ms delay the GPIO is driver active again. If the +power is still on and the CPU still running after a 3000ms delay, a +WARN_ON(1) is emitted. + +Required properties: +- compatible : should be "gpio-restart". +- gpios : The GPIO to set high/low, see "gpios property" in + Documentation/devicetree/bindings/gpio/gpio.txt. If the pin should be + low to power down the board set it to "Active Low", otherwise set + gpio to "Active High". + +Optional properties: +- input : Initially configure the GPIO line as an input. Only reconfigure + it to an output when the machine_restart function is called. If this optional + property is not specified, the GPIO is initialized as an output in its + inactive state. Maybe describe this as open source ? +- priority : A priority ranging from 0 to 255 (default 128) according to + the following guidelines: + 0: Restart handler of last resort, with limited restart + capabilities + 128:Default restart handler; use if no other restart handler is + expected to be available, and/or if restart functionality is + sufficient to restart the entire system + 255:Highest priority restart handler, will preempt all other + restart handlers + +Examples: + +gpio-restart { + compatible = "gpio-restart"; + gpios = < 4 0>; + priority = /bits/ 8 <200>; +}; diff --git a/drivers/power/reset/Kconfig b/drivers/power/reset/Kconfig index ca41523..f07e26c 100644 --- a/drivers/power/reset/Kconfig +++ b/drivers/power/reset/Kconfig @@ -39,6 +39,14 @@ config POWER_RESET_GPIO If your board needs a GPIO high/low to power down, say Y and create a binding in your devicetree. +config POWER_RESET_GPIO_RESTART + bool "GPIO restart driver" + depends on OF_GPIO && POWER_RESET + help + This driver supports restarting your board via a GPIO line. + If your board needs a GPIO high/low to restart, say Y and + create a binding in your devicetree. + config POWER_RESET_HISI bool "Hisilicon power-off driver" depends on POWER_RESET && ARCH_HISI diff --git a/drivers/power/reset/Makefile b/drivers/power/reset/Makefile index a42e70e..199cb6e 100644 --- a/drivers/power/reset/Makefile +++ b/drivers/power/reset/Makefile @@ -2,6 +2,7 @@ obj-$(CONFIG_POWER_RESET_AS3722) += as3722-poweroff.o obj-$(CONFIG_POWER_RESET_AXXIA) += axxia-reset.o obj-$(CONFIG_POWER_RESET_BRCMSTB) += brcmstb-reboot.o obj-$(CONFIG_POWER_RESET_GPIO) += gpio-poweroff.o +obj-$(CONFIG_POWER_RESET_GPIO_RESTART) += gpio-restart.o obj-$(CONFIG_POWER_RESET_HISI) += hisi-reboot.o obj-$(CONFIG_POWER_RESET_MSM) += msm-poweroff.o obj-$(CONFIG_POWER_RESET_QNAP) += qnap-poweroff.o diff --git a/drivers/power/reset/gpio-restart.c b/drivers/power/reset/gpio-restart.c new file mode 100644 index 000..2cbff64 --- /dev/null +++ b/drivers/power/reset/gpio-restart.c @@ -0,0 +1,142 @@ +/* + * Toggles a GPIO pin to restart a device + * + * Copyright (C) 2014 Google, Inc. + * + * This software is licensed under the terms of the GNU General Public + * License version 2, as published by the Free Software Foundation, and + * may be copied, distributed, and
[PATCH RFC v7 net-next 01/28] net: filter: add "load 64-bit immediate" eBPF instruction
add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register. All previous instructions were 8-byte. This is first 16-byte instruction. Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction: insn[0].code = BPF_LD | BPF_DW | BPF_IMM insn[0].dst_reg = destination register insn[0].imm = lower 32-bit insn[1].code = 0 insn[1].imm = upper 32-bit All unused fields must be zero. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. x64 JITs it as single 'movabsq %rax, imm64' arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn Note that old eBPF programs are binary compatible with new interpreter. In the following patches this instruction is used to store eBPF map pointers: BPF_LD_IMM64(R1, const_imm_map_ptr) BPF_CALL(BPF_FUNC_map_lookup_elem) and verifier is introduced to check validity of the programs. Later LLVM compiler is using this insn as generic load of 64-bit immediate constant and as a load of map pointer with relocation. Signed-off-by: Alexei Starovoitov --- Documentation/networking/filter.txt |8 +++- arch/x86/net/bpf_jit_comp.c | 17 + include/linux/filter.h | 18 ++ kernel/bpf/core.c |5 + lib/test_bpf.c | 21 + 5 files changed, 68 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index c48a9704bda8..81916ab5d96f 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -951,7 +951,7 @@ Size modifier is one of ... Mode modifier is one of: - BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ + BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ BPF_ABS 0x20 BPF_IND 0x40 BPF_MEM 0x60 @@ -995,6 +995,12 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and 2 byte atomic increments are not supported. +eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists +of two consecutive 'struct bpf_insn' 8-byte blocks and interpreted as single +instruction that loads 64-bit immediate value into a dst_reg. +Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads +32-bit immediate value into a register. + Testing --- diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index b08a98c59530..98837147ee57 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -393,6 +393,23 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, EMIT1_off32(add_1reg(0xB8, dst_reg), imm32); break; + case BPF_LD | BPF_IMM | BPF_DW: + if (insn[1].code != 0 || insn[1].src_reg != 0 || + insn[1].dst_reg != 0 || insn[1].off != 0) { + /* verifier must catch invalid insns */ + pr_err("invalid BPF_LD_IMM64 insn\n"); + return -EINVAL; + } + + /* movabsq %rax, imm64 */ + EMIT2(add_1mod(0x48, dst_reg), add_1reg(0xB8, dst_reg)); + EMIT(insn[0].imm, 4); + EMIT(insn[1].imm, 4); + + insn++; + i++; + break; + /* dst %= src, dst /= src, dst %= imm32, dst /= imm32 */ case BPF_ALU | BPF_MOD | BPF_X: case BPF_ALU | BPF_DIV | BPF_X: diff --git a/include/linux/filter.h b/include/linux/filter.h index a5227ab8ccb1..f3262b598262 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -161,6 +161,24 @@ enum { .off = 0, \ .imm = IMM }) +/* BPF_LD_IMM64 macro encodes single 'load 64-bit immediate' insn */ +#define BPF_LD_IMM64(DST, IMM) \ + BPF_LD_IMM64_RAW(DST, 0, IMM) + +#define BPF_LD_IMM64_RAW(DST, SRC, IMM)\ + ((struct bpf_insn) {\ + .code = BPF_LD | BPF_DW | BPF_IMM, \ + .dst_reg = DST, \ + .src_reg = SRC, \ + .off = 0, \ + .imm = (__u32) (IMM) }), \ + ((struct bpf_insn) {\ + .code = 0, /* zero is reserved opcode */ \ + .dst_reg = 0, \ + .src_reg = 0, \ + .off =
[PATCH RFC v7 net-next 06/28] bpf: add hashtable type of BPF maps
add new map type BPF_MAP_TYPE_HASH and its implementation - key/value are opaque range of bytes - user space provides 3 configuration attributes via BPF syscall: key_size, value_size, max_entries - if value_size == 0, the map is used as a set - map_update_elem() must fail to insert new element when max_entries limit is reached - map takes care of allocating/freeing key/value pairs - update/lookup/delete methods may be called from eBPF program attached to kprobes, so use spin_lock_irqsave() mechanism for concurrent updates - optimized for speed of lookup() which can be called multiple times from eBPF program which itself is triggered by high volume of events - in the future JIT compiler may recognize lookup() call and optimize it further, since key_size is constant for life of eBPF program Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h |1 + kernel/bpf/Makefile |2 +- kernel/bpf/hashtab.c | 365 ++ 3 files changed, 367 insertions(+), 1 deletion(-) create mode 100644 kernel/bpf/hashtab.c diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1602de6423b5..ad0a5a495ec3 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -359,6 +359,7 @@ enum bpf_cmd { enum bpf_map_type { BPF_MAP_TYPE_UNSPEC, + BPF_MAP_TYPE_HASH, }; union bpf_attr { diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index e9f7334ed07a..558e12712ebc 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -1 +1 @@ -obj-y := core.o syscall.o +obj-y := core.o syscall.o hashtab.o diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c new file mode 100644 index ..4d131c86821c --- /dev/null +++ b/kernel/bpf/hashtab.c @@ -0,0 +1,365 @@ +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of version 2 of the GNU General Public + * License as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include +#include +#include + +struct bpf_htab { + struct bpf_map map; + struct hlist_head *buckets; + struct kmem_cache *elem_cache; + spinlock_t lock; + u32 count; /* number of elements in this hashtable */ + u32 n_buckets; /* number of hash buckets */ + u32 elem_size; /* size of each element in bytes */ +}; + +/* each htab element is struct htab_elem + key + value */ +struct htab_elem { + struct hlist_node hash_node; + struct rcu_head rcu; + struct bpf_htab *htab; + u32 hash; + u32 pad; + char key[0]; +}; + +#define BPF_MAP_MAX_KEY_SIZE 256 +static struct bpf_map *htab_map_alloc(union bpf_attr *attr) +{ + struct bpf_htab *htab; + int err, i; + + htab = kzalloc(sizeof(*htab), GFP_USER); + if (!htab) + return ERR_PTR(-ENOMEM); + + /* mandatory map attributes */ + htab->map.key_size = attr->key_size; + htab->map.value_size = attr->value_size; + htab->map.max_entries = attr->max_entries; + + /* check sanity of attributes. +* value_size == 0 is allowed, in this case map is used as a set +*/ + err = -EINVAL; + if (htab->map.max_entries == 0 || htab->map.key_size == 0) + goto free_htab; + + /* hash table size must be power of 2 */ + htab->n_buckets = roundup_pow_of_two(htab->map.max_entries); + + err = -E2BIG; + if (htab->map.key_size > BPF_MAP_MAX_KEY_SIZE) + goto free_htab; + + err = -ENOMEM; + htab->buckets = kmalloc_array(htab->n_buckets, + sizeof(struct hlist_head), GFP_USER); + + if (!htab->buckets) + goto free_htab; + + for (i = 0; i < htab->n_buckets; i++) + INIT_HLIST_HEAD(>buckets[i]); + + spin_lock_init(>lock); + htab->count = 0; + + htab->elem_size = sizeof(struct htab_elem) + + round_up(htab->map.key_size, 8) + + htab->map.value_size; + + htab->elem_cache = kmem_cache_create("bpf_htab", htab->elem_size, 0, 0, +NULL); + if (!htab->elem_cache) + goto free_buckets; + + return >map; + +free_buckets: + kfree(htab->buckets); +free_htab: + kfree(htab); + return ERR_PTR(err); +} + +static inline u32 htab_map_hash(const void *key, u32 key_len) +{ + return jhash(key, key_len, 0); +} + +static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash) +{ + return >buckets[hash & (htab->n_buckets - 1)]; +} + +static struct htab_elem
[PATCH RFC v7 net-next 04/28] bpf: enable bpf syscall on x64 and i386
done as separate commit to ease conflict resolution Signed-off-by: Alexei Starovoitov --- arch/x86/syscalls/syscall_32.tbl |1 + arch/x86/syscalls/syscall_64.tbl |1 + include/linux/syscalls.h |3 ++- include/uapi/asm-generic/unistd.h |4 +++- kernel/sys_ni.c |3 +++ 5 files changed, 10 insertions(+), 2 deletions(-) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index 028b78168d85..9fe1b5d002f0 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -363,3 +363,4 @@ 354i386seccomp sys_seccomp 355i386getrandom sys_getrandom 356i386memfd_createsys_memfd_create +357i386bpf sys_bpf diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index 35dd922727b9..281150b539a2 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -327,6 +327,7 @@ 318common getrandom sys_getrandom 319common memfd_createsys_memfd_create 320common kexec_file_load sys_kexec_file_load +321common bpf sys_bpf # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index 0f86d85a9ce4..bda9b81357cc 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -65,6 +65,7 @@ struct old_linux_dirent; struct perf_event_attr; struct file_handle; struct sigaltstack; +union bpf_attr; #include #include @@ -875,5 +876,5 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, const char __user *uargs); asmlinkage long sys_getrandom(char __user *buf, size_t count, unsigned int flags); - +asmlinkage long sys_bpf(int cmd, union bpf_attr *attr, unsigned int size); #endif diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 11d11bc5c78f..22749c134117 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -705,9 +705,11 @@ __SYSCALL(__NR_seccomp, sys_seccomp) __SYSCALL(__NR_getrandom, sys_getrandom) #define __NR_memfd_create 279 __SYSCALL(__NR_memfd_create, sys_memfd_create) +#define __NR_bpf 280 +__SYSCALL(__NR_bpf, sys_bpf) #undef __NR_syscalls -#define __NR_syscalls 280 +#define __NR_syscalls 281 /* * All syscalls below here should go away really, diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 391d4ddb6f4b..b4b5083f5f5e 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -218,3 +218,6 @@ cond_syscall(sys_kcmp); /* operate on Secure Computing state */ cond_syscall(sys_seccomp); + +/* access BPF programs and maps */ +cond_syscall(sys_bpf); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 10/28] bpf: verifier (add ability to receive verification log)
add optional attributes for BPF_PROG_LOAD syscall: struct { ... __u32 log_level;/* verbosity level of eBPF verifier */ __u32 log_size; /* size of user buffer */ void __user *log_buf; /* user supplied buffer */ }; In such case the verifier will return its verification log in the user supplied buffer which can be used by humans to analyze why verifier rejected given program Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h |5 +- kernel/bpf/verifier.c| 235 ++ 2 files changed, 239 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ac272bd7a884..a6fa0416f2bd 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -400,7 +400,10 @@ union bpf_attr { __u32 insn_cnt; const struct bpf_insn __user *insns; const char __user *license; -#defineBPF_PROG_LOAD_LAST_FIELD license + __u32 log_level;/* verbosity level of eBPF verifier */ + __u32 log_size; /* size of user buffer */ + void __user *log_buf; /* user supplied buffer */ +#defineBPF_PROG_LOAD_LAST_FIELD log_buf }; }; diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 3d22b19c5fe0..81a64a50e48d 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -144,9 +144,244 @@ * load/store to bpf_context are checked against known fields */ +/* single container for all structs + * one verifier_env per bpf_check() call + */ +struct verifier_env { +}; + +/* verbose verifier prints what it's seeing + * bpf_check() is called under lock, so no race to access these global vars + */ +static u32 log_level, log_size, log_len; +static void *log_buf; + +static DEFINE_MUTEX(bpf_verifier_lock); + +/* log_level controls verbosity level of eBPF verifier. + * verbose() is used to dump the verification trace to the log, so the user + * can figure out what's wrong with the program + */ +static void verbose(const char *fmt, ...) +{ + va_list args; + + if (log_level == 0 || log_len >= log_size - 1) + return; + + va_start(args, fmt); + log_len += vscnprintf(log_buf + log_len, log_size - log_len, fmt, args); + va_end(args); +} + +static const char *const bpf_class_string[] = { + [BPF_LD]= "ld", + [BPF_LDX] = "ldx", + [BPF_ST]= "st", + [BPF_STX] = "stx", + [BPF_ALU] = "alu", + [BPF_JMP] = "jmp", + [BPF_RET] = "BUG", + [BPF_ALU64] = "alu64", +}; + +static const char *const bpf_alu_string[] = { + [BPF_ADD >> 4] = "+=", + [BPF_SUB >> 4] = "-=", + [BPF_MUL >> 4] = "*=", + [BPF_DIV >> 4] = "/=", + [BPF_OR >> 4] = "|=", + [BPF_AND >> 4] = "&=", + [BPF_LSH >> 4] = "<<=", + [BPF_RSH >> 4] = ">>=", + [BPF_NEG >> 4] = "neg", + [BPF_MOD >> 4] = "%=", + [BPF_XOR >> 4] = "^=", + [BPF_MOV >> 4] = "=", + [BPF_ARSH >> 4] = "s>>=", + [BPF_END >> 4] = "endian", +}; + +static const char *const bpf_ldst_string[] = { + [BPF_W >> 3] = "u32", + [BPF_H >> 3] = "u16", + [BPF_B >> 3] = "u8", + [BPF_DW >> 3] = "u64", +}; + +static const char *const bpf_jmp_string[] = { + [BPF_JA >> 4] = "jmp", + [BPF_JEQ >> 4] = "==", + [BPF_JGT >> 4] = ">", + [BPF_JGE >> 4] = ">=", + [BPF_JSET >> 4] = "&", + [BPF_JNE >> 4] = "!=", + [BPF_JSGT >> 4] = "s>", + [BPF_JSGE >> 4] = "s>=", + [BPF_CALL >> 4] = "call", + [BPF_EXIT >> 4] = "exit", +}; + +static void print_bpf_insn(struct bpf_insn *insn) +{ + u8 class = BPF_CLASS(insn->code); + + if (class == BPF_ALU || class == BPF_ALU64) { + if (BPF_SRC(insn->code) == BPF_X) + verbose("(%02x) %sr%d %s %sr%d\n", + insn->code, class == BPF_ALU ? "(u32) " : "", + insn->dst_reg, + bpf_alu_string[BPF_OP(insn->code) >> 4], + class == BPF_ALU ? "(u32) " : "", + insn->src_reg); + else + verbose("(%02x) %sr%d %s %s%d\n", + insn->code, class == BPF_ALU ? "(u32) " : "", + insn->dst_reg, + bpf_alu_string[BPF_OP(insn->code) >> 4], + class == BPF_ALU ? "(u32) " : "", + insn->imm); + } else if (class == BPF_STX) { + if (BPF_MODE(insn->code) == BPF_MEM) + verbose("(%02x) *(%s *)(r%d %+d) = r%d\n", + insn->code, + bpf_ldst_string[BPF_SIZE(insn->code) >> 3], +
[PATCH RFC v7 net-next 08/28] bpf: handle pseudo BPF_CALL insn
in native eBPF programs userspace is using pseudo BPF_CALL instructions which encode one of 'enum bpf_func_id' inside insn->imm field. Verifier checks that program using correct function arguments to given func_id. If all checks passed, kernel needs to fixup BPF_CALL->imm fields by replacing func_id with in-kernel function pointer. eBPF interpreter just calls the function. In-kernel eBPF users continue to use generic BPF_CALL. Signed-off-by: Alexei Starovoitov --- kernel/bpf/syscall.c | 37 + 1 file changed, 37 insertions(+) diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index c316f7c28895..9dbf7bd42ccf 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -339,6 +339,40 @@ void bpf_register_prog_type(struct bpf_prog_type_list *tl) list_add(>list_node, _prog_types); } +/* fixup insn->imm field of bpf_call instructions: + * if (insn->imm == BPF_FUNC_map_lookup_elem) + * insn->imm = bpf_map_lookup_elem - __bpf_call_base; + * else if (insn->imm == BPF_FUNC_map_update_elem) + * insn->imm = bpf_map_update_elem - __bpf_call_base; + * else ... + * + * this function is called after eBPF program passed verification + */ +static void fixup_bpf_calls(struct bpf_prog *prog) +{ + const struct bpf_func_proto *fn; + int i; + + for (i = 0; i < prog->len; i++) { + struct bpf_insn *insn = >insnsi[i]; + + if (insn->code == (BPF_JMP | BPF_CALL)) { + /* we reach here when program has bpf_call instructions +* and it passed bpf_check(), means that +* ops->get_func_proto must have been supplied, check it +*/ + BUG_ON(!prog->info->ops->get_func_proto); + + fn = prog->info->ops->get_func_proto(insn->imm); + /* all functions that have prototype and verifier allowed +* programs to call them, must be real in-kernel functions +*/ + BUG_ON(!fn->func); + insn->imm = fn->func - __bpf_call_base; + } + } +} + /* drop refcnt on maps used by eBPF program and free auxilary data */ static void free_bpf_prog_info(struct bpf_prog_info *info) { @@ -465,6 +499,9 @@ static int bpf_prog_load(union bpf_attr *attr) if (err < 0) goto free_prog_info; + /* fixup BPF_CALL->imm field */ + fixup_bpf_calls(prog); + /* eBPF program is ready to be JITed */ bpf_prog_select_runtime(prog); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 09/28] bpf: verifier (add docs)
this patch adds all of eBPF verfier documentation and empty bpf_check() The end goal for the verifier is to statically check safety of the program. Verifier will catch: - loops - out of range jumps - unreachable instructions - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention More details in Documentation/networking/filter.txt Signed-off-by: Alexei Starovoitov --- Documentation/networking/filter.txt | 230 +++ include/linux/bpf.h |2 + kernel/bpf/Makefile |2 +- kernel/bpf/syscall.c|2 +- kernel/bpf/verifier.c | 152 +++ 5 files changed, 386 insertions(+), 2 deletions(-) create mode 100644 kernel/bpf/verifier.c diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 30c142b58936..713e71f9f5dd 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -1001,6 +1001,105 @@ instruction that loads 64-bit immediate value into a dst_reg. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. +eBPF verifier +- +The safety of the eBPF program is determined in two steps. + +First step does DAG check to disallow loops and other CFG validation. +In particular it will detect programs that have unreachable instructions. +(though classic BPF checker allows them) + +Second step starts from the first insn and descends all possible paths. +It simulates execution of every insn and observes the state change of +registers and stack. + +At the start of the program the register R1 contains a pointer to context +and has type PTR_TO_CTX. +If verifier sees an insn that does R2=R1, then R2 has now type +PTR_TO_CTX as well and can be used on the right hand side of expression. +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE, +since addition of two valid pointers makes invalid pointer. +(In 'secure' mode verifier will reject any type of pointer arithmetic to make +sure that kernel addresses don't leak to unprivileged users) + +If register was never written to, it's not readable: + bpf_mov R0 = R2 + bpf_exit +will be rejected, since R2 is unreadable at the start of the program. + +After kernel function call, R1-R5 are reset to unreadable and +R0 has a return type of the function. + +Since R6-R9 are callee saved, their state is preserved across the call. + bpf_mov R6 = 1 + bpf_call foo + bpf_mov R0 = R6 + bpf_exit +is a correct program. If there was R1 instead of R6, it would have +been rejected. + +Classic BPF register X is mapped to eBPF register R7 inside sk_convert_filter(), +so that its state is preserved across calls. + +load/store instructions are allowed only with registers of valid types, which +are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked. +For example: + bpf_mov R1 = 1 + bpf_mov R2 = 2 + bpf_xadd *(u32 *)(R1 + 3) += R2 + bpf_exit +will be rejected, since R1 doesn't have a valid pointer type at the time of +execution of instruction bpf_xadd. + +At the start R1 contains pointer to ctx and R1 type is PTR_TO_CTX. +ctx is generic. verifier is configured to known what context is for particular +class of bpf programs. For example, context == skb (for socket filters) and +ctx == seccomp_data for seccomp filters. +A callback is used to customize verifier to restrict eBPF program access to only +certain fields within ctx structure with specified size and alignment. + +For example, the following insn: + bpf_ld R0 = *(u32 *)(R6 + 8) +intends to load a word from address R6 + 8 and store it into R0 +If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know +that offset 8 of size 4 bytes can be accessed for reading, otherwise +the verifier will reject the program. +If R6=FRAME_PTR, then access should be aligned and be within +stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, +so it will fail verification, since it's out of bounds. + +The verifier will allow eBPF program to read data from stack only after +it wrote into it. +Classic BPF verifier does similar check with M[0-15] memory slots. +For example: + bpf_ld R0 = *(u32 *)(R10 - 4) + bpf_exit +is invalid program. +Though R10 is correct read-only register and has type FRAME_PTR +and R10 - 4 is within stack bounds, there were no stores into that location. + +Pointer register spill/fill is tracked as well, since four (R6-R9) +callee saved registers may not be enough for some programs. + +Allowed function calls are customized with bpf_verifier_ops->get_func_proto() +The eBPF verifier will check that registers match argument constraints. +After the call register R0 will be set to return type of the function. + +Function calls is a main mechanism to extend functionality of eBPF programs. +Socket filters may
[PATCH RFC v7 net-next 07/28] bpf: expand BPF syscall with program load/unload
eBPF programs are safe run-to-completion functions with load/unload methods from userspace similar to kernel modules. User space API: - load eBPF program fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size) where 'attr' is struct { enum bpf_prog_type prog_type; __u32 insn_cnt; struct bpf_insn __user *insns; const char __user *license; }; insns - array of eBPF instructions license - must be GPL compatible to call helper functions marked gpl_only - unload eBPF program close(fd) User space tests and examples follow in the later patches Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 36 ++ include/linux/filter.h |9 ++- include/uapi/linux/bpf.h | 27 kernel/bpf/syscall.c | 170 ++ net/core/filter.c|2 + 5 files changed, 242 insertions(+), 2 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 2887f3f9da59..8ea6f9923ff2 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -46,4 +46,40 @@ void bpf_register_map_type(struct bpf_map_type_list *tl); void bpf_map_put(struct bpf_map *map); struct bpf_map *bpf_map_get(struct fd f); +/* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs + * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL + * instructions after verifying + */ +struct bpf_func_proto { + u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); + bool gpl_only; +}; + +struct bpf_verifier_ops { + /* return eBPF function prototype for verification */ + const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); +}; + +struct bpf_prog_type_list { + struct list_head list_node; + struct bpf_verifier_ops *ops; + enum bpf_prog_type type; +}; + +void bpf_register_prog_type(struct bpf_prog_type_list *tl); + +struct bpf_prog_info { + atomic_t refcnt; + bool is_gpl_compatible; + enum bpf_prog_type prog_type; + struct bpf_verifier_ops *ops; + struct bpf_map **used_maps; + u32 used_map_cnt; +}; + +struct bpf_prog; + +void bpf_prog_put(struct bpf_prog *prog); +struct bpf_prog *bpf_prog_get(u32 ufd); + #endif /* _LINUX_BPF_H */ diff --git a/include/linux/filter.h b/include/linux/filter.h index f04793474d16..f06913b29861 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -31,11 +31,16 @@ struct sock_fprog_kern { struct sk_buff; struct sock; struct seccomp_data; +struct bpf_prog_info; struct bpf_prog { u32 jited:1,/* Is our filter JIT'ed? */ - len:31; /* Number of filter blocks */ - struct sock_fprog_kern *orig_prog; /* Original BPF program */ + has_info:1, /* whether 'info' is valid */ + len:30; /* Number of filter blocks */ + union { + struct sock_fprog_kern *orig_prog; /* Original BPF program */ + struct bpf_prog_info*info; + }; unsigned int(*bpf_func)(const struct sk_buff *skb, const struct bpf_insn *filter); union { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index ad0a5a495ec3..ac272bd7a884 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -355,6 +355,13 @@ enum bpf_cmd { * returns zero and stores next key or negative error */ BPF_MAP_GET_NEXT_KEY, + + /* verify and load eBPF program +* prog_fd = bpf(BPF_PROG_LOAD, union bpf_attr *attr, u32 size) +* Using attr->prog_type, attr->insns, attr->license +* returns fd or negative error +*/ + BPF_PROG_LOAD, }; enum bpf_map_type { @@ -362,6 +369,10 @@ enum bpf_map_type { BPF_MAP_TYPE_HASH, }; +enum bpf_prog_type { + BPF_PROG_TYPE_UNSPEC, +}; + union bpf_attr { struct { /* anonymous struct used by BPF_MAP_CREATE command */ enum bpf_map_type map_type; @@ -383,6 +394,22 @@ union bpf_attr { #define BPF_MAP_DELETE_ELEM_LAST_FIELD key #define BPF_MAP_GET_NEXT_KEY_LAST_FIELD next_key }; + + struct { /* anonymous struct used by BPF_PROG_LOAD command */ + enum bpf_prog_type prog_type; + __u32 insn_cnt; + const struct bpf_insn __user *insns; + const char __user *license; +#defineBPF_PROG_LOAD_LAST_FIELD license + }; +}; + +/* integer value in 'imm' field of BPF_CALL instruction selects which helper + * function eBPF program intends to call + */ +enum bpf_func_id { + BPF_FUNC_unspec, + __BPF_FUNC_MAX_ID, }; #endif /* _UAPI__LINUX_BPF_H__ */ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b863976741d4..c316f7c28895 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -14,6 +14,8 @@
[PATCH RFC v7 net-next 13/28] bpf: verifier (add verifier core)
This patch adds verifier core which simulates execution of every insn and records the state of registers and program stack. Every branch instruction seen during simulation is pushed into state stack. When verifier reaches BPF_EXIT, it pops the state from the stack and continues until it reaches BPF_EXIT again. For program: 1: bpf_mov r1, xxx 2: if (r1 == 0) goto 5 3: bpf_mov r0, 1 4: goto 6 5: bpf_mov r0, 2 6: bpf_exit The verifier will walk insns: 1, 2, 3, 4, 6 then it will pop the state recorded at insn#2 and will continue: 5, 6 This way it walks all possible paths through the program and checks all possible values of registers. While doing so, it checks for: - invalid instructions - uninitialized register access - uninitialized stack access - misaligned stack access - out of range stack access - invalid calling convention - BPF_LD_ABS|IND instructions are only used in socket filters - instruction encoding is not using reserved fields Kernel subsystem configures the verifier with two callbacks: - bool (*is_valid_access)(int off, int size, enum bpf_access_type type); that provides information to the verifer which fields of 'ctx' are accessible (remember 'ctx' is the first argument to eBPF program) - const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); returns argument constraints of kernel helper functions that eBPF program may call, so that verifier can checks that R1-R5 types match the prototype More details in Documentation/networking/filter.txt and in kernel/bpf/verifier.c Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h | 47 ++ include/uapi/linux/bpf.h |1 + kernel/bpf/verifier.c| 1061 +- 3 files changed, 1108 insertions(+), 1 deletion(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 490551e17c15..ad1bda7ece35 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -46,6 +46,31 @@ void bpf_register_map_type(struct bpf_map_type_list *tl); void bpf_map_put(struct bpf_map *map); struct bpf_map *bpf_map_get(struct fd f); +/* function argument constraints */ +enum bpf_arg_type { + ARG_ANYTHING = 0, /* any argument is ok */ + + /* the following constraints used to prototype +* bpf_map_lookup/update/delete_elem() functions +*/ + ARG_CONST_MAP_PTR, /* const argument used as pointer to bpf_map */ + ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */ + ARG_PTR_TO_MAP_VALUE, /* pointer to stack used as map value */ + + /* the following constraints used to prototype bpf_memcmp() and other +* functions that access data on eBPF program stack +*/ + ARG_PTR_TO_STACK, /* any pointer to eBPF program stack */ + ARG_CONST_STACK_SIZE, /* number of bytes accessed from stack */ +}; + +/* type of values returned from helper functions */ +enum bpf_return_type { + RET_INTEGER,/* function returns integer */ + RET_VOID, /* function doesn't return anything */ + RET_PTR_TO_MAP_VALUE_OR_NULL, /* returns a pointer to map elem value or NULL */ +}; + /* eBPF function prototype used by verifier to allow BPF_CALLs from eBPF programs * to in-kernel helper functions and for adjusting imm32 field in BPF_CALL * instructions after verifying @@ -53,11 +78,33 @@ struct bpf_map *bpf_map_get(struct fd f); struct bpf_func_proto { u64 (*func)(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); bool gpl_only; + enum bpf_return_type ret_type; + enum bpf_arg_type arg1_type; + enum bpf_arg_type arg2_type; + enum bpf_arg_type arg3_type; + enum bpf_arg_type arg4_type; + enum bpf_arg_type arg5_type; +}; + +/* bpf_context is intentionally undefined structure. Pointer to bpf_context is + * the first argument to eBPF programs. + * For socket filters: 'struct bpf_context *' == 'struct sk_buff *' + */ +struct bpf_context; + +enum bpf_access_type { + BPF_READ = 1, + BPF_WRITE = 2 }; struct bpf_verifier_ops { /* return eBPF function prototype for verification */ const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id); + + /* return true if 'size' wide access at offset 'off' within bpf_context +* with 'type' (read or write) is allowed +*/ + bool (*is_valid_access)(int off, int size, enum bpf_access_type type); }; struct bpf_prog_type_list { diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 04aaaef0daa7..dee7a2469b8d 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -377,6 +377,7 @@ enum bpf_map_type { enum bpf_prog_type { BPF_PROG_TYPE_UNSPEC, + BPF_PROG_TYPE_SOCKET_FILTER, }; union bpf_attr { diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 7365a190cbd6..eb19f753d4d7 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -144,6 +144,72 @@
[PATCH RFC v7 net-next 12/28] bpf: verifier (add branch/goto checks)
check that control flow graph of eBPF program is a directed acyclic graph check_cfg() does: - detect loops - detect unreachable instructions - check that program terminates with BPF_EXIT insn - check that all branches are within program boundary Signed-off-by: Alexei Starovoitov --- kernel/bpf/verifier.c | 183 + 1 file changed, 183 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 73811d69e7be..7365a190cbd6 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -332,6 +332,185 @@ static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn) return (struct bpf_map *) (unsigned long) imm64; } +/* non-recursive DFS pseudo code + * 1 procedure DFS-iterative(G,v): + * 2 label v as discovered + * 3 let S be a stack + * 4 S.push(v) + * 5 while S is not empty + * 6t <- S.pop() + * 7if t is what we're looking for: + * 8return t + * 9for all edges e in G.adjacentEdges(t) do + * 10 if edge e is already labelled + * 11 continue with the next edge + * 12 w <- G.adjacentVertex(t,e) + * 13 if vertex w is not discovered and not explored + * 14 label e as tree-edge + * 15 label w as discovered + * 16 S.push(w) + * 17 continue at 5 + * 18 else if vertex w is discovered + * 19 label e as back-edge + * 20 else + * 21 // vertex w is explored + * 22 label e as forward- or cross-edge + * 23 label t as explored + * 24 S.pop() + * + * convention: + * 0x10 - discovered + * 0x11 - discovered and fall-through edge labelled + * 0x12 - discovered and fall-through and branch edges labelled + * 0x20 - explored + */ + +enum { + DISCOVERED = 0x10, + EXPLORED = 0x20, + FALLTHROUGH = 1, + BRANCH = 2, +}; + +#define PUSH_INT(I) \ + do { \ + if (cur_stack >= insn_cnt) { \ + ret = -E2BIG; \ + goto free_st; \ + } \ + stack[cur_stack++] = I; \ + } while (0) + +#define PEEK_INT() \ + ({ \ + int _ret; \ + if (cur_stack == 0) \ + _ret = -1; \ + else \ + _ret = stack[cur_stack - 1]; \ + _ret; \ +}) + +#define POP_INT() \ + ({ \ + int _ret; \ + if (cur_stack == 0) \ + _ret = -1; \ + else \ + _ret = stack[--cur_stack]; \ + _ret; \ +}) + +#define PUSH_INSN(T, W, E) \ + do { \ + int w = W; \ + if (E == FALLTHROUGH && st[T] >= (DISCOVERED | FALLTHROUGH)) \ + break; \ + if (E == BRANCH && st[T] >= (DISCOVERED | BRANCH)) \ + break; \ + if (w < 0 || w >= insn_cnt) { \ + verbose("jump out of range from insn %d to %d\n", T, w); \ + ret = -EINVAL; \ + goto free_st; \ + } \ + if (st[w] == 0) { \ + /* tree-edge */ \ + st[T] = DISCOVERED | E; \ + st[w] = DISCOVERED; \ + PUSH_INT(w); \ + goto peek_stack; \ + } else if ((st[w] & 0xF0) == DISCOVERED) { \ + verbose("back-edge from insn %d to %d\n", T, w); \ + ret = -EINVAL; \ + goto free_st; \ + } else if (st[w] == EXPLORED) { \ + /* forward- or cross-edge */ \ + st[T] = DISCOVERED | E; \ + } else { \ + verbose("insn state internal bug\n"); \ + ret = -EFAULT; \ + goto free_st; \ + } \ + } while (0) + +/* non-recursive depth-first-search to detect loops in BPF program + * loop == back-edge in directed graph + */ +static int check_cfg(struct verifier_env *env) +{ + struct bpf_insn *insns = env->prog->insnsi; + int insn_cnt = env->prog->len; + int cur_stack = 0; + int *stack; + int ret = 0; + int *st; + int i, t; + + st = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL); + if (!st) + return -ENOMEM; + + stack = kzalloc(sizeof(int) * insn_cnt, GFP_KERNEL); + if (!stack) { + kfree(st); + return -ENOMEM; + } + + st[0] = DISCOVERED; /* mark 1st insn as discovered */ + PUSH_INT(0); + +peek_stack: + while ((t = PEEK_INT()) != -1) { + if (BPF_CLASS(insns[t].code) == BPF_JMP) { +
[PATCH RFC v7 net-next 15/28] bpf: allow eBPF programs to use maps
expose bpf_map_lookup_elem(), bpf_map_update_elem(), bpf_map_delete_elem() map accessors to eBPF programs Signed-off-by: Alexei Starovoitov --- include/linux/bpf.h |5 include/uapi/linux/bpf.h |3 ++ kernel/bpf/syscall.c | 68 ++ 3 files changed, 76 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index ad1bda7ece35..14e23bb10b2d 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -131,4 +131,9 @@ struct bpf_prog *bpf_prog_get(u32 ufd); /* verify correctness of eBPF program */ int bpf_check(struct bpf_prog *fp, union bpf_attr *attr); +/* in-kernel helper functions called from eBPF programs */ +u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); +u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); +u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); + #endif /* _LINUX_BPF_H */ diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index dee7a2469b8d..f87b501b2e1b 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -419,6 +419,9 @@ union bpf_attr { */ enum bpf_func_id { BPF_FUNC_unspec, + BPF_FUNC_map_lookup_elem, /* void *map_lookup_elem(, ) */ + BPF_FUNC_map_update_elem, /* int map_update_elem(, , ) */ + BPF_FUNC_map_delete_elem, /* int map_delete_elem(, ) */ __BPF_FUNC_MAX_ID, }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 8f11d1549cfc..641bb9e6709c 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -575,3 +575,71 @@ free_attr: kfree(attr); return err; } + +/* called from eBPF program under rcu lock + * + * if kernel subsystem is allowing eBPF programs to call this function, + * inside its own verifier_ops->get_func_proto() callback it should return + * (struct bpf_func_proto) { + *.ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, + *.arg1_type = ARG_CONST_MAP_PTR, + *.arg2_type = ARG_PTR_TO_MAP_KEY, + * } + * so that eBPF verifier properly checks the arguments + */ +u64 bpf_map_lookup_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; + void *key = (void *) (unsigned long) r2; + void *value; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + value = map->ops->map_lookup_elem(map, key); + + return (unsigned long) value; +} + +/* called from eBPF program under rcu lock + * + * if kernel subsystem is allowing eBPF programs to call this function, + * inside its own verifier_ops->get_func_proto() callback it should return + * (struct bpf_func_proto) { + *.ret_type = RET_INTEGER, + *.arg1_type = ARG_CONST_MAP_PTR, + *.arg2_type = ARG_PTR_TO_MAP_KEY, + *.arg3_type = ARG_PTR_TO_MAP_VALUE, + * } + * so that eBPF verifier properly checks the arguments + */ +u64 bpf_map_update_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; + void *key = (void *) (unsigned long) r2; + void *value = (void *) (unsigned long) r3; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + return map->ops->map_update_elem(map, key, value); +} + +/* called from eBPF program under rcu lock + * + * if kernel subsystem is allowing eBPF programs to call this function, + * inside its own verifier_ops->get_func_proto() callback it should return + * (struct bpf_func_proto) { + *.ret_type = RET_INTEGER, + *.arg1_type = ARG_CONST_MAP_PTR, + *.arg2_type = ARG_PTR_TO_MAP_KEY, + * } + * so that eBPF verifier properly checks the arguments + */ +u64 bpf_map_delete_elem(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + struct bpf_map *map = (struct bpf_map *) (unsigned long) r1; + void *key = (void *) (unsigned long) r2; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + return map->ops->map_delete_elem(map, key); +} -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 11/28] bpf: handle pseudo BPF_LD_IMM64 insn
eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions to refer to process-local map_fd. Scan the program for such instructions and if FDs are valid, convert them to 'struct bpf_map' pointers which will be used by verifier to check access to maps in bpf_map_lookup/update() calls. If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping BPF_PSEUDO_MAP_FD flag. Note that eBPF interpreter is generic and knows nothing about pseudo insns. Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h |6 ++ kernel/bpf/verifier.c| 147 ++ 2 files changed, 153 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index a6fa0416f2bd..04aaaef0daa7 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -176,6 +176,12 @@ enum { .off = 0, \ .imm = ((__u64) (IMM)) >> 32 }) +#define BPF_PSEUDO_MAP_FD 1 + +/* pseudo BPF_LD_IMM64 insn used to refer to process-local map_fd */ +#define BPF_LD_MAP_FD(DST, MAP_FD) \ + BPF_LD_IMM64_RAW(DST, BPF_PSEUDO_MAP_FD, MAP_FD) + /* Short form of mov based on type, BPF_X: dst_reg = src_reg, BPF_K: dst_reg = imm32 */ #define BPF_MOV64_RAW(TYPE, DST, SRC, IMM) \ diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index 81a64a50e48d..73811d69e7be 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -144,10 +144,15 @@ * load/store to bpf_context are checked against known fields */ +#define MAX_USED_MAPS 64 /* max number of maps accessed by one eBPF program */ + /* single container for all structs * one verifier_env per bpf_check() call */ struct verifier_env { + struct bpf_prog *prog; /* eBPF program being verified */ + struct bpf_map *used_maps[MAX_USED_MAPS]; /* array of map's used by eBPF program */ + u32 used_map_cnt; /* number of used maps */ }; /* verbose verifier prints what it's seeing @@ -319,6 +324,115 @@ static void print_bpf_insn(struct bpf_insn *insn) } } +/* return the map pointer stored inside BPF_LD_IMM64 instruction */ +static struct bpf_map *ld_imm64_to_map_ptr(struct bpf_insn *insn) +{ + u64 imm64 = ((u64) (u32) insn[0].imm) | ((u64) (u32) insn[1].imm) << 32; + + return (struct bpf_map *) (unsigned long) imm64; +} + +/* look for pseudo eBPF instructions that access map FDs and + * replace them with actual map pointers + */ +static int replace_map_fd_with_map_ptr(struct verifier_env *env) +{ + struct bpf_insn *insn = env->prog->insnsi; + int insn_cnt = env->prog->len; + int i, j; + + for (i = 0; i < insn_cnt; i++, insn++) { + if (insn[0].code == (BPF_LD | BPF_IMM | BPF_DW)) { + struct bpf_map *map; + struct fd f; + + if (i == insn_cnt - 1 || insn[1].code != 0 || + insn[1].dst_reg != 0 || insn[1].src_reg != 0 || + insn[1].off != 0) { + verbose("invalid bpf_ld_imm64 insn\n"); + return -EINVAL; + } + + if (insn->src_reg == 0) + /* valid generic load 64-bit imm */ + goto next_insn; + + if (insn->src_reg != BPF_PSEUDO_MAP_FD) { + verbose("unrecognized bpf_ld_imm64 insn\n"); + return -EINVAL; + } + + f = fdget(insn->imm); + + map = bpf_map_get(f); + if (IS_ERR(map)) { + verbose("fd %d is not pointing to valid bpf_map\n", + insn->imm); + fdput(f); + return PTR_ERR(map); + } + + /* store map pointer inside BPF_LD_IMM64 instruction */ + insn[0].imm = (u32) (unsigned long) map; + insn[1].imm = ((u64) (unsigned long) map) >> 32; + + /* check whether we recorded this map already */ + for (j = 0; j < env->used_map_cnt; j++) + if (env->used_maps[j] == map) { + fdput(f); + goto next_insn; + } + + if (env->used_map_cnt >= MAX_USED_MAPS) { + fdput(f); + return -E2BIG; + } + + /* remember this map */ + env->used_maps[env->used_map_cnt++] = map; + + /* hold the map. If the
Re: [PATCH] net: stmmac: add dcrs parameter
On Tue, Aug 26, 2014 at 9:20 PM, Giuseppe CAVALLARO wrote: > On 8/26/2014 2:35 PM, Vince Bridgers wrote: >> >> Hi Peppe, >> In the Synopsys EMAC case, carrier sense is used to stop transmitting if no carrier is sensed during a transmission. This is only useful if the media in use is true half duplex media (like obsolete 10Base2 or 10Base5). If no one in using true half duplex media, then is it possible to set this disable by default? If we're not sure, then having an option feels like the right thing to do. >>> >>> >>> >>> Indeed this is what I had done in the patch. >>> >>> >>> http://git.stlinux.com/?p=stm/linux-sh4-2.6.32.y.git;a=commit;h=b0b863bf65c36dc593f6b7b4b418394fd880dae2 >>> >>> Also in case of carrier sense the frame will be dropped in any case >>> later. >>> >>> Let me know if you Acked this patch so I will rebase it on >>> net.git and I send it soon >>> >>> peppe >>> >> >> Yes, this looks good to me. I don't expect anyone is using 10Base2 or >> 10Base5 anymore, so it's ok to disable DCRS by default. >> >> ack >> >> All the best, > > > thx so much, I will send this patch (with your Acked-by) and ported on > net.git soon. > > Chen-Yu, Ley Foon, pls let me know if it is ok for you as well Looks good. Thanks! Cheers ChenYu -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 18/28] tracing: allow eBPF programs call printk()
limited printk() with %d %u %x %p modifiers only Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h |1 + kernel/trace/bpf_trace.c | 61 ++ 2 files changed, 62 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 55adff33083e..1ec3d293d14e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -430,6 +430,7 @@ enum bpf_func_id { BPF_FUNC_fetch_u8,/* u8 bpf_fetch_u8(void *unsafe_ptr) */ BPF_FUNC_memcmp, /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */ BPF_FUNC_dump_stack, /* void bpf_dump_stack(void) */ + BPF_FUNC_printk, /* int bpf_printk(const char *fmt, int fmt_size, ...) */ __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index b4751e2c0d52..ff98be5a24d6 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -60,6 +60,60 @@ static u64 bpf_dump_stack(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) return 0; } +/* limited printk() + * only %d %u %x %ld %lu %lx %lld %llu %llx %p conversion specifiers allowed + */ +static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5) +{ + char *fmt = (char *) (long) r1; + int fmt_cnt = 0; + bool mod_l[3] = {}; + int i; + + /* bpf_check() guarantees that fmt points to bpf program stack and +* fmt_size bytes of it were initialized by bpf program +*/ + if (fmt[fmt_size - 1] != 0) + return -EINVAL; + + /* check format string for allowed specifiers */ + for (i = 0; i < fmt_size; i++) + if (fmt[i] == '%') { + if (fmt_cnt >= 3) + return -EINVAL; + i++; + if (i >= fmt_size) + return -EINVAL; + + if (fmt[i] == 'l') { + mod_l[fmt_cnt] = true; + i++; + if (i >= fmt_size) + return -EINVAL; + } else if (fmt[i] == 'p') { + mod_l[fmt_cnt] = true; + fmt_cnt++; + continue; + } + + if (fmt[i] == 'l') { + mod_l[fmt_cnt] = true; + i++; + if (i >= fmt_size) + return -EINVAL; + } + + if (fmt[i] != 'd' && fmt[i] != 'u' && fmt[i] != 'x') + return -EINVAL; + fmt_cnt++; + } + + return __trace_printk((unsigned long) __builtin_return_address(3), fmt, + mod_l[0] ? r3 : (u32) r3, + mod_l[1] ? r4 : (u32) r4, + mod_l[2] ? r5 : (u32) r5); +} + static struct bpf_func_proto tracing_filter_funcs[] = { #define FETCH(SIZE)\ [BPF_FUNC_fetch_##SIZE] = { \ @@ -86,6 +140,13 @@ static struct bpf_func_proto tracing_filter_funcs[] = { .gpl_only = false, .ret_type = RET_VOID, }, + [BPF_FUNC_printk] = { + .func = bpf_printk, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_STACK, + .arg2_type = ARG_CONST_STACK_SIZE, + }, [BPF_FUNC_map_lookup_elem] = { .func = bpf_map_lookup_elem, .gpl_only = false, -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 20/28] tracing: allow eBPF programs to call ktime_get_ns() and get_current()
Signed-off-by: Alexei Starovoitov --- include/uapi/linux/bpf.h |2 ++ kernel/trace/bpf_trace.c | 20 2 files changed, 22 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 1ec3d293d14e..e14e147c8899 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -431,6 +431,8 @@ enum bpf_func_id { BPF_FUNC_memcmp, /* int bpf_memcmp(void *unsafe_ptr, void *safe_ptr, int size) */ BPF_FUNC_dump_stack, /* void bpf_dump_stack(void) */ BPF_FUNC_printk, /* int bpf_printk(const char *fmt, int fmt_size, ...) */ + BPF_FUNC_ktime_get_ns,/* u64 bpf_ktime_get_ns(void) */ + BPF_FUNC_get_current, /* struct task_struct *bpf_get_current(void) */ __BPF_FUNC_MAX_ID, }; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index ff98be5a24d6..a98e13e1131b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -114,6 +114,16 @@ static u64 bpf_printk(u64 r1, u64 fmt_size, u64 r3, u64 r4, u64 r5) mod_l[2] ? r5 : (u32) r5); } +static u64 bpf_ktime_get_ns(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + return ktime_get_ns(); +} + +static u64 bpf_get_current(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) +{ + return (u64) (long) current; +} + static struct bpf_func_proto tracing_filter_funcs[] = { #define FETCH(SIZE)\ [BPF_FUNC_fetch_##SIZE] = { \ @@ -169,6 +179,16 @@ static struct bpf_func_proto tracing_filter_funcs[] = { .arg1_type = ARG_CONST_MAP_PTR, .arg2_type = ARG_PTR_TO_MAP_KEY, }, + [BPF_FUNC_ktime_get_ns] = { + .func = bpf_ktime_get_ns, + .gpl_only = true, + .ret_type = RET_INTEGER, + }, + [BPF_FUNC_get_current] = { + .func = bpf_get_current, + .gpl_only = true, + .ret_type = RET_INTEGER, + }, }; static const struct bpf_func_proto *tracing_filter_func_proto(enum bpf_func_id func_id) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 16/28] bpf: split eBPF out of NET
let eBPF have its own CONFIG_BPF, so that tracing and other subsystems don't need to depend on all of NET Signed-off-by: Alexei Starovoitov --- arch/Kconfig |3 +++ kernel/Makefile |2 +- kernel/bpf/core.c | 12 net/Kconfig |1 + 4 files changed, 17 insertions(+), 1 deletion(-) diff --git a/arch/Kconfig b/arch/Kconfig index 0eae9df35b88..80a72f6f6b60 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -333,6 +333,9 @@ config SECCOMP_FILTER See Documentation/prctl/seccomp_filter.txt for details. +config BPF + boolean + config HAVE_CC_STACKPROTECTOR bool help diff --git a/kernel/Makefile b/kernel/Makefile index dc5c77544fd6..17ea6d4a9a24 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -86,7 +86,7 @@ obj-$(CONFIG_RING_BUFFER) += trace/ obj-$(CONFIG_TRACEPOINTS) += trace/ obj-$(CONFIG_IRQ_WORK) += irq_work.o obj-$(CONFIG_CPU_PM) += cpu_pm.o -obj-$(CONFIG_NET) += bpf/ +obj-$(CONFIG_BPF) += bpf/ obj-$(CONFIG_PERF_EVENTS) += events/ diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 0434c2170f2b..c17ba0ef3dcf 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -537,3 +537,15 @@ void bpf_prog_free(struct bpf_prog *fp) bpf_jit_free(fp); } EXPORT_SYMBOL_GPL(bpf_prog_free); + +/* To emulate LD_ABS/LD_IND instructions __sk_run_filter() may call + * skb_copy_bits(), so provide a weak definition for it in NET-less config. + * seccomp_check_filter() verifies that seccomp filters are not using + * LD_ABS/LD_IND instructions. Other BPF users (like tracing filters) + * must not use these instructions unless ctx==skb + */ +int __weak skb_copy_bits(const struct sk_buff *skb, int offset, void *to, +int len) +{ + return -EFAULT; +} diff --git a/net/Kconfig b/net/Kconfig index 4051fdfa4367..9a99e16d6f28 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -6,6 +6,7 @@ menuconfig NET bool "Networking support" select NLATTR select GENERIC_NET_UTILS + select BPF ---help--- Unless you really know what you are doing, you should say Y here. The reason is that some programs need kernel networking support even -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH RFC v7 net-next 19/28] tracing: allow eBPF programs to be attached to kprobe/kretprobe
Signed-off-by: Alexei Starovoitov --- kernel/trace/trace_kprobe.c | 28 1 file changed, 28 insertions(+) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 282f6e4e5539..b6db92207c99 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -19,6 +19,7 @@ #include #include +#include #include "trace_probe.h" @@ -930,6 +931,22 @@ __kprobe_trace_func(struct trace_kprobe *tk, struct pt_regs *regs, if (ftrace_trigger_soft_disabled(ftrace_file)) return; + if (call->flags & TRACE_EVENT_FL_BPF) { + struct bpf_context ctx = {}; + unsigned long args[3]; + /* get first 3 arguments of the function. x64 syscall ABI uses +* the same 3 registers as x64 calling convention. +* todo: implement it cleanly via arch specific +* regs_get_argument_nth() helper +*/ + syscall_get_arguments(current, regs, 0, 3, args); + ctx.arg1 = args[0]; + ctx.arg2 = args[1]; + ctx.arg3 = args[2]; + trace_filter_call_bpf(ftrace_file->filter, ); + return; + } + local_save_flags(irq_flags); pc = preempt_count(); @@ -978,6 +995,17 @@ __kretprobe_trace_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, if (ftrace_trigger_soft_disabled(ftrace_file)) return; + if (call->flags & TRACE_EVENT_FL_BPF) { + struct bpf_context ctx = {}; + /* assume that register used to return a value from syscall is +* the same as register used to return a value from a function +* todo: provide arch specific helper +*/ + ctx.ret = syscall_get_return_value(current, regs); + trace_filter_call_bpf(ftrace_file->filter, ); + return; + } + local_save_flags(irq_flags); pc = preempt_count(); -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/