date:20180907

Re: [PATCH 3.18 00/29] 3.18.122-stable review

2018-09-07 Thread Harsh 'Shandilya

On 8 September 2018 2:40:21 AM IST, Greg Kroah-Hartman 
 wrote:
>This is the start of the stable review cycle for the 3.18.122 release.
>There are 29 patches in this series, all will be posted as a response
>to this one.  If anyone has any issues with these being applied, please
>let me know.
>
>Responses should be made by Sun Sep  9 21:08:52 UTC 2018.
>Anything received after that time might be too late.
>
>The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.122-rc1.gz
>or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
>linux-3.18.y
>and the diffstat can be found below.
>
>thanks,
>
>greg k-h
Cleanly merges and builds with Werror in my oneplus3 tree. No immediate 
regressions noticed. Thanks for the update.
--
Harsh Shandilya, PRJKT Development LLC

Re: [PATCH 3.18 00/29] 3.18.122-stable review

2018-09-07 Thread Harsh 'Shandilya

On 8 September 2018 2:40:21 AM IST, Greg Kroah-Hartman 
 wrote:
>This is the start of the stable review cycle for the 3.18.122 release.
>There are 29 patches in this series, all will be posted as a response
>to this one.  If anyone has any issues with these being applied, please
>let me know.
>
>Responses should be made by Sun Sep  9 21:08:52 UTC 2018.
>Anything received after that time might be too late.
>
>The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.122-rc1.gz
>or in the git tree and branch at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
>linux-3.18.y
>and the diffstat can be found below.
>
>thanks,
>
>greg k-h
Cleanly merges and builds with Werror in my oneplus3 tree. No immediate 
regressions noticed. Thanks for the update.
--
Harsh Shandilya, PRJKT Development LLC

Re: [PATCH v8 1/2] leds: core: Introduce LED pattern trigger

2018-09-07 Thread Bjorn Andersson

On Tue 04 Sep 04:01 PDT 2018, Baolin Wang wrote:

> diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern 
> b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern
[..]
> +What:/sys/class/leds//hw_pattern
> +Date:September 2018
> +KernelVersion:   4.20
> +Description:
> + Specify a hardware pattern for the LED, for LED hardware that
> + supports autonomously controlling brightness over time, 
> according
> + to some preprogrammed hardware patterns.
> +
> + Since different LED hardware can have different semantics of
> + hardware patterns, each driver is expected to provide its own
> + description for the hardware patterns in their ABI documentation
> + file.
> +

So, after a full circle we're back at drivers with support for hardware
patterns should have their own ABI for setting that pattern.

The controls for my hardware is:
* a list of brightness values
* the rate of the pattern
* a flag to indicate that the pattern should be played from start
  to end, end to start or start to end to start
* a boolean indicating if the pattern should be played once or repeated
  indefinitely.

Given that the interface now is hw specific, what benefit is there to
attempt to cram these 4 knobs into "hw_pattern"? Or am I allowed to
create additional files for the latter three?

> +What:/sys/class/leds//repeat
> +Date:September 2018
> +KernelVersion:   4.20
> +Description:
> + Specify a pattern repeat number. 0 means repeat indefinitely.
> +
> + This file will always return the originally written repeat
> + number.

I'm still convinced that this will confuse our users and to me it would
be more logical if this denotes the number of times the pattern should
be repeated, with e.g. negative numbers denoting infinite.

In particular I expect to have to explain why my driver expects that you
write 0 in the file named "repeat" to make it repeat and 1 to make it
not repeat.

Regards,
Bjorn

Re: [PATCH v8 1/2] leds: core: Introduce LED pattern trigger

2018-09-07 Thread Bjorn Andersson

On Tue 04 Sep 04:01 PDT 2018, Baolin Wang wrote:

> diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-pattern 
> b/Documentation/ABI/testing/sysfs-class-led-trigger-pattern
[..]
> +What:/sys/class/leds//hw_pattern
> +Date:September 2018
> +KernelVersion:   4.20
> +Description:
> + Specify a hardware pattern for the LED, for LED hardware that
> + supports autonomously controlling brightness over time, 
> according
> + to some preprogrammed hardware patterns.
> +
> + Since different LED hardware can have different semantics of
> + hardware patterns, each driver is expected to provide its own
> + description for the hardware patterns in their ABI documentation
> + file.
> +

So, after a full circle we're back at drivers with support for hardware
patterns should have their own ABI for setting that pattern.

The controls for my hardware is:
* a list of brightness values
* the rate of the pattern
* a flag to indicate that the pattern should be played from start
  to end, end to start or start to end to start
* a boolean indicating if the pattern should be played once or repeated
  indefinitely.

Given that the interface now is hw specific, what benefit is there to
attempt to cram these 4 knobs into "hw_pattern"? Or am I allowed to
create additional files for the latter three?

> +What:/sys/class/leds//repeat
> +Date:September 2018
> +KernelVersion:   4.20
> +Description:
> + Specify a pattern repeat number. 0 means repeat indefinitely.
> +
> + This file will always return the originally written repeat
> + number.

I'm still convinced that this will confuse our users and to me it would
be more logical if this denotes the number of times the pattern should
be repeated, with e.g. negative numbers denoting infinite.

In particular I expect to have to explain why my driver expects that you
write 0 in the file named "repeat" to make it repeat and 1 to make it
not repeat.

Regards,
Bjorn

[PATCH v2] asm-generic: bug: unify hints for BUG_ON()

2018-09-07 Thread Igor Stoppa

If BUG_ON() is used instead of BUG(), it means that probably the
preferred outcome is to not BUG(), therefore the condition tested should
be unlikely().

However, when CONFIG_PROFILE_ANNOTATED_BRANCHES is enabled, the hint is
disabled, to avoid generating false-positive warnings caused by
-Wmaybe-uninitialized.

Signed-off-by: Igor Stoppa 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/asm-generic/bug.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 20561a60db9c..b91273c36baf 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -58,8 +58,12 @@ struct bug_entry {
 #endif
 
 #ifndef HAVE_ARCH_BUG_ON
+#ifdef CONFIG_PROFILE_ANNOTATED_BRANCHES
+#define BUG_ON(condition) do { if (condition) BUG(); } while (0)
+#else
 #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
 #endif
+#endif
 
 #ifdef __WARN_FLAGS
 #define __WARN_TAINT(taint)__WARN_FLAGS(BUGFLAG_TAINT(taint))
@@ -183,7 +187,11 @@ void __warn(const char *file, int line, void *caller, 
unsigned taint,
 #endif
 
 #ifndef HAVE_ARCH_BUG_ON
+#ifdef CONFIG_PROFILE_ANNOTATED_BRANCHES
 #define BUG_ON(condition) do { if (condition) BUG(); } while (0)
+#else
+#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
+#endif
 #endif
 
 #ifndef HAVE_ARCH_WARN_ON
-- 
2.17.1

[PATCH v2] asm-generic: bug: unify hints for BUG_ON()

2018-09-07 Thread Igor Stoppa

If BUG_ON() is used instead of BUG(), it means that probably the
preferred outcome is to not BUG(), therefore the condition tested should
be unlikely().

However, when CONFIG_PROFILE_ANNOTATED_BRANCHES is enabled, the hint is
disabled, to avoid generating false-positive warnings caused by
-Wmaybe-uninitialized.

Signed-off-by: Igor Stoppa 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 include/asm-generic/bug.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
index 20561a60db9c..b91273c36baf 100644
--- a/include/asm-generic/bug.h
+++ b/include/asm-generic/bug.h
@@ -58,8 +58,12 @@ struct bug_entry {
 #endif
 
 #ifndef HAVE_ARCH_BUG_ON
+#ifdef CONFIG_PROFILE_ANNOTATED_BRANCHES
+#define BUG_ON(condition) do { if (condition) BUG(); } while (0)
+#else
 #define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
 #endif
+#endif
 
 #ifdef __WARN_FLAGS
 #define __WARN_TAINT(taint)__WARN_FLAGS(BUGFLAG_TAINT(taint))
@@ -183,7 +187,11 @@ void __warn(const char *file, int line, void *caller, 
unsigned taint,
 #endif
 
 #ifndef HAVE_ARCH_BUG_ON
+#ifdef CONFIG_PROFILE_ANNOTATED_BRANCHES
 #define BUG_ON(condition) do { if (condition) BUG(); } while (0)
+#else
+#define BUG_ON(condition) do { if (unlikely(condition)) BUG(); } while (0)
+#endif
 #endif
 
 #ifndef HAVE_ARCH_WARN_ON
-- 
2.17.1

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Andy Lutomirski

On Fri, Sep 7, 2018 at 9:40 AM, Josh Poimboeuf  wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>>set up RSP is somewhat slow because the GS-relative accesses need
>>to wait for SWAPGS to finish.  The trampoline approach allows
>>RIP-relative accesses to set up RSP, which avoids the stall.
>>
>>  - The trampoline avoids any percpu access before CR3 is set up,
>>which means that no percpu memory needs to be mapped in the user
>>page tables.  This prevents using Meltdown to read any percpu memory
>>outside the cpu_entry_area and prevents using timing leaks
>>to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text.  With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.  It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless.  As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU.  Then we could use a direct jump to rejoin the normal
>> entry path.
>>
>> Signed-off-by: Andy Lutomirski 
>
> The following commit should also be reverted:
>
>   4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry 
> trampolines")

I think we shouldn't, since the perf folks want new perf versions to
be fully functional on older kernels.  My plan was to let the perf
crew do whatever reversions they felt appropriate.

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Andy Lutomirski

On Fri, Sep 7, 2018 at 9:40 AM, Josh Poimboeuf  wrote:
> On Mon, Sep 03, 2018 at 03:59:44PM -0700, Andy Lutomirski wrote:
>> The SYSCALL64 trampoline has a couple of nice properties:
>>
>>  - The usual sequence of SWAPGS followed by two GS-relative accesses to
>>set up RSP is somewhat slow because the GS-relative accesses need
>>to wait for SWAPGS to finish.  The trampoline approach allows
>>RIP-relative accesses to set up RSP, which avoids the stall.
>>
>>  - The trampoline avoids any percpu access before CR3 is set up,
>>which means that no percpu memory needs to be mapped in the user
>>page tables.  This prevents using Meltdown to read any percpu memory
>>outside the cpu_entry_area and prevents using timing leaks
>>to directly locate the percpu areas.
>>
>> The downsides of using a trampoline may outweigh the upsides, however.
>> It adds an extra non-contiguous I$ cache line to system calls, and it
>> forces an indirect jump to transfer control back to the normal kernel
>> text after CR3 is set up.  The latter is because x86 lacks a 64-bit
>> direct jump instruction that could jump from the trampoline to the entry
>> text.  With retpolines enabled, the indirect jump is extremely slow.
>>
>> This patch changes the code to map the percpu TSS into the user page
>> tables to allow the non-trampoline SYSCALL64 path to work under PTI.
>> This does not add a new direct information leak, since the TSS is
>> readable by Meltdown from the cpu_entry_area alias regardless.  It
>> does allow a timing attack to locate the percpu area, but KASLR is
>> more or less a lost cause against local attack on CPUs vulnerable to
>> Meltdown regardless.  As far as I'm concerned, on current hardware,
>> KASLR is only useful to mitigate remote attacks that try to attack
>> the kernel without first gaining RCE against a vulnerable user
>> process.
>>
>> On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces
>> syscall overhead from ~237ns to ~228ns.
>>
>> There is a possible alternative approach: we could instead move the
>> trampoline within 2G of the entry text and make a separate copy for
>> each CPU.  Then we could use a direct jump to rejoin the normal
>> entry path.
>>
>> Signed-off-by: Andy Lutomirski 
>
> The following commit should also be reverted:
>
>   4d99e4136580 ("perf machine: Workaround missing maps for x86 PTI entry 
> trampolines")

I think we shouldn't, since the perf folks want new perf versions to
be fully functional on older kernels.  My plan was to let the perf
crew do whatever reversions they felt appropriate.

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Andy Lutomirski

On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
 wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner  wrote:
>>
>> >  - We execute from an extra page and read from another extra page
>> > during the syscall.  (The latter is because we need to use a relative
>> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
>> > anyway, but we're accessing it using an alias, so it's an extra TLB
>> > entry.)
>>
>> Ok, but is this really an issue with PTI?
>
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
>
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
>
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.
>
> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.
>
> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.
>

If (and this is a *big* if) all the percpu data is within 2GB of the
entry text, then we could avoid this extra TLB reference by accessing
it directly instead of using an alias.

I suppose the summary is that the retpoline-free trampoline variant is
even more complicated than the code I'm removing in this series, and
that it would be at best a teeny tiny win.  Once all the Spectre dust
settles and we get fixed CPUs, we could consider re-optimizing this.

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Andy Lutomirski

On Fri, Sep 7, 2018 at 5:04 PM, Linus Torvalds
 wrote:
> On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner  wrote:
>>
>> >  - We execute from an extra page and read from another extra page
>> > during the syscall.  (The latter is because we need to use a relative
>> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
>> > anyway, but we're accessing it using an alias, so it's an extra TLB
>> > entry.)
>>
>> Ok, but is this really an issue with PTI?
>
> I'd expect it to be *more* of an issue with PTI, since you're already
> wasting TLB entries due to the whole "two different page tables".
>
> Sure, address space ID's save you from reloading them all the time,
> but don't help with capacity.
>
> But yeah, in the sense of "with PTI, all kernel entries are slow
> anyway, so none of this matters" is probably correct in a very real
> sense.
>
> That said, the real reason I like Andy's patch series is that I think
> it's simpler than the alternatives (including the current setup). No
> subtle mappings, no nothing. It removes a lot more lines than it adds,
> and half the lines that it *does* add are comments.
>
> Virtual mapping tricks may be cool, but in the end, not having to use
> them is better still, I think.
>

If (and this is a *big* if) all the percpu data is within 2GB of the
entry text, then we could avoid this extra TLB reference by accessing
it directly instead of using an alias.

I suppose the summary is that the retpoline-free trampoline variant is
even more complicated than the code I'm removing in this series, and
that it would be at best a teeny tiny win.  Once all the Spectre dust
settles and we get fixed CPUs, we could consider re-optimizing this.

Re: [PATCH] asm-generic: bug: add unlikely() to BUG_ON()

2018-09-07 Thread Igor Stoppa





On 07/09/18 23:41, Arnd Bergmann wrote:


Could you add a comment about -Wmaybe-uninitialized next to the definition?
Otherwise that is easily lost.

Also, I see that the file has two separate definitions of BUG_ON(), and the
other one does have an unlikely() in it already. Can you change both
to do the same thing with unlikely() or not unlikely()?


ack

--
igor

Re: [PATCH] asm-generic: bug: add unlikely() to BUG_ON()

2018-09-07 Thread Igor Stoppa





On 07/09/18 23:41, Arnd Bergmann wrote:


Could you add a comment about -Wmaybe-uninitialized next to the definition?
Otherwise that is easily lost.

Also, I see that the file has two separate definitions of BUG_ON(), and the
other one does have an unlikely() in it already. Can you change both
to do the same thing with unlikely() or not unlikely()?


ack

--
igor

Re: Plumbers 2018 - Performance and Scalability Microconference

2018-09-07 Thread John Hubbard

On 9/4/18 2:28 PM, Daniel Jordan wrote:
> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance 
> and scalability microconference this year at Plumbers[*], which is happening 
> in Vancouver this year.  The microconference is scheduled for the morning of 
> the second day (Wed, Nov 14).
> 
> We have a preliminary agenda and a list of confirmed and interested attendees 
> (cc'ed), and are seeking more of both!
> 
> Some of the items on the agenda as it stands now are:
> 
>  - Promoting huge page usage:  With memory sizes becoming ever larger, huge 
> pages are becoming more and more important to reduce TLB misses and the 
> overhead of memory management itself--that is, to make the system scalable 
> with the memory size.  But there are still some remaining gaps that prevent 
> huge pages from being deployed in some situations, such as huge page 
> allocation latency and memory fragmentation.
> 
>  - Reducing the number of users of mmap_sem:  This semaphore is frequently 
> used throughout the kernel.  In order to facilitate scaling this longstanding 
> bottleneck, these uses should be documented and unnecessary users should be 
> fixed.
> 
>  - Parallelizing cpu-intensive kernel work:  Resolve problems of past 
> approaches including extra threads interfering with other processes, playing 
> well with power management, and proper cgroup accounting for the extra 
> threads.  Bonus topic: proper accounting of workqueue threads running on 
> behalf of cgroups.
> 
>  - Preserving userland during kexec with a hibernation-like mechanism.
> 
> These center around our interests, but having lots of topics to choose from 
> ensures we cover what's most important to the community, so we would like to 
> hear about additional topics and extensions to those listed here.  This 
> includes, but is certainly not limited to, work in progress that would 
> benefit from in-person discussion, real-world performance problems, and 
> experimental and academic work.
> 
> If you haven't already done so, please let us know if you are interested in 
> attending, or have suggestions for other attendees.

Hi Daniel and all,

I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with 
HMM topics or
fix-gup-with-dma topics, I'd like to attend. GPUs generally need to access 
large chunks of
memory, and that includes migrating (dma-copying) pages around.  

So for example a multi-threaded migration of huge pages between normal RAM and 
GPU memory is an 
intriguing direction (and I realize that it's a well-known topic, already). 
Doing that properly
(how many threads to use?) seems like it requires scheduler interaction.

It's also interesting that there are two main huge page systems (THP and 
Hugetlbfs), and I sometimes
wonder the obvious thing to wonder: are these sufficiently different to warrant 
remaining separate,
long-term?  Yes, I realize they're quite different in some ways, but still, one 
wonders. :)


thanks,
-- 
John Hubbard
NVIDIA

Re: Plumbers 2018 - Performance and Scalability Microconference

2018-09-07 Thread John Hubbard

On 9/4/18 2:28 PM, Daniel Jordan wrote:
> Pavel Tatashin, Ying Huang, and I are excited to be organizing a performance 
> and scalability microconference this year at Plumbers[*], which is happening 
> in Vancouver this year.  The microconference is scheduled for the morning of 
> the second day (Wed, Nov 14).
> 
> We have a preliminary agenda and a list of confirmed and interested attendees 
> (cc'ed), and are seeking more of both!
> 
> Some of the items on the agenda as it stands now are:
> 
>  - Promoting huge page usage:  With memory sizes becoming ever larger, huge 
> pages are becoming more and more important to reduce TLB misses and the 
> overhead of memory management itself--that is, to make the system scalable 
> with the memory size.  But there are still some remaining gaps that prevent 
> huge pages from being deployed in some situations, such as huge page 
> allocation latency and memory fragmentation.
> 
>  - Reducing the number of users of mmap_sem:  This semaphore is frequently 
> used throughout the kernel.  In order to facilitate scaling this longstanding 
> bottleneck, these uses should be documented and unnecessary users should be 
> fixed.
> 
>  - Parallelizing cpu-intensive kernel work:  Resolve problems of past 
> approaches including extra threads interfering with other processes, playing 
> well with power management, and proper cgroup accounting for the extra 
> threads.  Bonus topic: proper accounting of workqueue threads running on 
> behalf of cgroups.
> 
>  - Preserving userland during kexec with a hibernation-like mechanism.
> 
> These center around our interests, but having lots of topics to choose from 
> ensures we cover what's most important to the community, so we would like to 
> hear about additional topics and extensions to those listed here.  This 
> includes, but is certainly not limited to, work in progress that would 
> benefit from in-person discussion, real-world performance problems, and 
> experimental and academic work.
> 
> If you haven't already done so, please let us know if you are interested in 
> attending, or have suggestions for other attendees.

Hi Daniel and all,

I'm interested in the first 3 of those 4 topics, so if it doesn't conflict with 
HMM topics or
fix-gup-with-dma topics, I'd like to attend. GPUs generally need to access 
large chunks of
memory, and that includes migrating (dma-copying) pages around.  

So for example a multi-threaded migration of huge pages between normal RAM and 
GPU memory is an 
intriguing direction (and I realize that it's a well-known topic, already). 
Doing that properly
(how many threads to use?) seems like it requires scheduler interaction.

It's also interesting that there are two main huge page systems (THP and 
Hugetlbfs), and I sometimes
wonder the obvious thing to wonder: are these sufficiently different to warrant 
remaining separate,
long-term?  Yes, I realize they're quite different in some ways, but still, one 
wonders. :)


thanks,
-- 
John Hubbard
NVIDIA

Re: linux-next: build warnings from the build of Linus' tree

2018-09-07 Thread Masami Hiramatsu

On Fri, 7 Sep 2018 14:50:59 +0200
Peter Oberparleiter  wrote:

> On 06.09.2018 18:42, Masami Hiramatsu wrote:
> > Peter Oberparleiter  wrote:
> >> I've attached a quick fix that should address both problems. I'd
> >> appreciate if this patch could get some testing before I post proper fix
> >> patches.
> > 
> > Hmm, I'm still not able to reproduce it on powerpc cross build even with
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y ... So, sorry I couldn't test this 
> > patch.
> 
> Maybe this is related to the compiler/binutils versions used. I'm using
> Fedora 28's gcc-powerpc64 and binutils-powerpc64 packages:
> 
> powerpc64-linux-gnu-gcc (GCC) 7.1.1 20170622 (Red Hat Cross 7.1.1-3)
> GNU ld version 2.29.1-4.fc28
> 
> For reference I'm also attaching the config that I used to reproduce the
> problem on kernel v4.18.
> 
> >> diff --git a/arch/arm/kernel/vmlinux.lds.h b/arch/arm/kernel/vmlinux.lds.h
> >> index ae5fdff18406..2ca33277a28b 100644
> >> --- a/arch/arm/kernel/vmlinux.lds.h
> >> +++ b/arch/arm/kernel/vmlinux.lds.h
> >> @@ -48,6 +48,7 @@
> >>
> >>  #define ARM_DISCARD   
> >> \
> >>*(.ARM.exidx.exit.text) \
> >> +  *(.ARM.exidx.text.exit) \
> > 
> > BTW, why would we need this?
> 
> That's necessary to fix one of the two ARM linker failures reported via
> https://lkml.org/lkml/2018/8/24/345
> 
> >>> `.text.exit' referenced in section `.ARM.exidx.text.exit' of
> >>> kernel/trace/trace_clock.o: defined in discarded section `.text.exit'
> >>> of kernel/trace/trace_clock.o
> 
> Section ".ARM.exidx.text.exit" refers to ".text.exit" which was
> discarded. With the change above, the extraneous section
> ".ARM.exidx.text.exit" is also discarded, resolving the linker failure.

OK, so your patch fixes following issue too?

https://patchwork.kernel.org/patch/10584685/

In that case, we should drop above patch.

Thank you,


-- 
Masami Hiramatsu

Re: linux-next: build warnings from the build of Linus' tree

2018-09-07 Thread Masami Hiramatsu

On Fri, 7 Sep 2018 14:50:59 +0200
Peter Oberparleiter  wrote:

> On 06.09.2018 18:42, Masami Hiramatsu wrote:
> > Peter Oberparleiter  wrote:
> >> I've attached a quick fix that should address both problems. I'd
> >> appreciate if this patch could get some testing before I post proper fix
> >> patches.
> > 
> > Hmm, I'm still not able to reproduce it on powerpc cross build even with
> > CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y ... So, sorry I couldn't test this 
> > patch.
> 
> Maybe this is related to the compiler/binutils versions used. I'm using
> Fedora 28's gcc-powerpc64 and binutils-powerpc64 packages:
> 
> powerpc64-linux-gnu-gcc (GCC) 7.1.1 20170622 (Red Hat Cross 7.1.1-3)
> GNU ld version 2.29.1-4.fc28
> 
> For reference I'm also attaching the config that I used to reproduce the
> problem on kernel v4.18.
> 
> >> diff --git a/arch/arm/kernel/vmlinux.lds.h b/arch/arm/kernel/vmlinux.lds.h
> >> index ae5fdff18406..2ca33277a28b 100644
> >> --- a/arch/arm/kernel/vmlinux.lds.h
> >> +++ b/arch/arm/kernel/vmlinux.lds.h
> >> @@ -48,6 +48,7 @@
> >>
> >>  #define ARM_DISCARD   
> >> \
> >>*(.ARM.exidx.exit.text) \
> >> +  *(.ARM.exidx.text.exit) \
> > 
> > BTW, why would we need this?
> 
> That's necessary to fix one of the two ARM linker failures reported via
> https://lkml.org/lkml/2018/8/24/345
> 
> >>> `.text.exit' referenced in section `.ARM.exidx.text.exit' of
> >>> kernel/trace/trace_clock.o: defined in discarded section `.text.exit'
> >>> of kernel/trace/trace_clock.o
> 
> Section ".ARM.exidx.text.exit" refers to ".text.exit" which was
> discarded. With the change above, the extraneous section
> ".ARM.exidx.text.exit" is also discarded, resolving the linker failure.

OK, so your patch fixes following issue too?

https://patchwork.kernel.org/patch/10584685/

In that case, we should drop above patch.

Thank you,


-- 
Masami Hiramatsu

Re: [PATCH V3 06/26] csky: Cache and TLB routines

2018-09-07 Thread Guo Ren

On Fri, Sep 07, 2018 at 04:13:35PM +0200, Arnd Bergmann wrote:
> On Fri, Sep 7, 2018 at 2:55 PM Guo Ren  wrote:
> >
> > On Fri, Sep 07, 2018 at 10:14:38AM +0200, Arnd Bergmann wrote:
> > > On Fri, Sep 7, 2018 at 5:04 AM Guo Ren  wrote:
> > > > On Thu, Sep 06, 2018 at 04:31:16PM +0200, Arnd Bergmann wrote:
> > > Similarly, an MMIO read may be used to see if a DMA has completed
> > > and the device register tells you that the DMA has left the device,
> > > but without a barrier, the CPU may have prefetched the DMA
> > > data while waiting for the MMIO-read to complete. The __io_ar()
> > > barrier() in asm-generic/io.h prevents the compiler from reordering
> > > the two reads, but if an weakly ordered read (in coherent DMA buffer)
> > > can bypass a strongly ordered read (MMIO), then it's still still
> > > broken.
> > __io_ar() barrier()? not rmb() ?! I've defined the rmb in asm/barrier, So
> > I got rmb() here not barrier().
> >
> > Only __io_br() is barrier().
> 
> Ah right, I misremembered the defaults. It's probably ok then.
Thx for the review and comments. These let me re-consider the mmio
issues and help to improve the csky asm/io.h in future. 
> 
> > > > > - How does endianess work? Are there any buses that flip bytes around
> > > > >   when running big-endian, or do you always do that in software?
> > > > Currently we only support little-endian and soc will follow it.
> > >
> > > Ok, that makes it easier. If you think that you won't even need big-endian
> > > support in the long run, you could also remove your asm/byteorder.h
> > > header. If you're not sure, it doesn't hurt to keep it of course.
> > Em... I'm not sure, so let me keep it for a while.
> 
> Ok. I think overall the trend is to be little-endian only for most
> architectures: powerpc64 moved from big-endian only to little-endian
> by default, ARM rarely uses big-endian (basically only for legacy
> applications ported from BE MIPS or ppc), and all new architectures
> we added in the last years are little-endian (OpenRISC being the
> main exception).
Good news, I really don't want to support big-endian and it makes CI
double.

Best Regards
 Guo Ren

Re: [PATCH V3 06/26] csky: Cache and TLB routines

2018-09-07 Thread Guo Ren

On Fri, Sep 07, 2018 at 04:13:35PM +0200, Arnd Bergmann wrote:
> On Fri, Sep 7, 2018 at 2:55 PM Guo Ren  wrote:
> >
> > On Fri, Sep 07, 2018 at 10:14:38AM +0200, Arnd Bergmann wrote:
> > > On Fri, Sep 7, 2018 at 5:04 AM Guo Ren  wrote:
> > > > On Thu, Sep 06, 2018 at 04:31:16PM +0200, Arnd Bergmann wrote:
> > > Similarly, an MMIO read may be used to see if a DMA has completed
> > > and the device register tells you that the DMA has left the device,
> > > but without a barrier, the CPU may have prefetched the DMA
> > > data while waiting for the MMIO-read to complete. The __io_ar()
> > > barrier() in asm-generic/io.h prevents the compiler from reordering
> > > the two reads, but if an weakly ordered read (in coherent DMA buffer)
> > > can bypass a strongly ordered read (MMIO), then it's still still
> > > broken.
> > __io_ar() barrier()? not rmb() ?! I've defined the rmb in asm/barrier, So
> > I got rmb() here not barrier().
> >
> > Only __io_br() is barrier().
> 
> Ah right, I misremembered the defaults. It's probably ok then.
Thx for the review and comments. These let me re-consider the mmio
issues and help to improve the csky asm/io.h in future. 
> 
> > > > > - How does endianess work? Are there any buses that flip bytes around
> > > > >   when running big-endian, or do you always do that in software?
> > > > Currently we only support little-endian and soc will follow it.
> > >
> > > Ok, that makes it easier. If you think that you won't even need big-endian
> > > support in the long run, you could also remove your asm/byteorder.h
> > > header. If you're not sure, it doesn't hurt to keep it of course.
> > Em... I'm not sure, so let me keep it for a while.
> 
> Ok. I think overall the trend is to be little-endian only for most
> architectures: powerpc64 moved from big-endian only to little-endian
> by default, ARM rarely uses big-endian (basically only for legacy
> applications ported from BE MIPS or ppc), and all new architectures
> we added in the last years are little-endian (OpenRISC being the
> main exception).
Good news, I really don't want to support big-endian and it makes CI
double.

Best Regards
 Guo Ren

[Patch v3 02/16] CIFS: Use offset when reading pages

2018-09-07 Thread Long Li

From: Long Li 

With offset defined in rdata, transport functions need to look at this
offset when reading data into the correct places in pages.

Signed-off-by: Long Li 
---
 fs/cifs/cifsproto.h |  4 +++-
 fs/cifs/cifssmb.c   |  1 +
 fs/cifs/connect.c   |  5 +++--
 fs/cifs/file.c  | 52 +---
 fs/cifs/smb2ops.c   |  2 +-
 fs/cifs/smb2pdu.c   |  1 +
 6 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index dc80f84..1f27d8e 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -203,7 +203,9 @@ extern void dequeue_mid(struct mid_q_entry *mid, bool 
malformed);
 extern int cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
 unsigned int to_read);
 extern int cifs_read_page_from_socket(struct TCP_Server_Info *server,
- struct page *page, unsigned int to_read);
+   struct page *page,
+   unsigned int page_offset,
+   unsigned int to_read);
 extern int cifs_setup_cifs_sb(struct smb_vol *pvolume_info,
   struct cifs_sb_info *cifs_sb);
 extern int cifs_match_super(struct super_block *, void *);
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index c8e4278..a1af258 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1596,6 +1596,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
struct smb_rqst rqst = { .rq_iov = rdata->iov,
 .rq_nvec = 2,
 .rq_pages = rdata->pages,
+.rq_offset = rdata->page_offset,
 .rq_npages = rdata->nr_pages,
 .rq_pagesz = rdata->pagesz,
 .rq_tailsz = rdata->tailsz };
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 83b0234..8501da0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -594,10 +594,11 @@ cifs_read_from_socket(struct TCP_Server_Info *server, 
char *buf,
 
 int
 cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
- unsigned int to_read)
+   unsigned int page_offset, unsigned int to_read)
 {
struct msghdr smb_msg;
-   struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+   struct bio_vec bv = {
+   .bv_page = page, .bv_len = to_read, .bv_offset = page_offset};
iov_iter_bvec(_msg.msg_iter, READ | ITER_BVEC, , 1, to_read);
return cifs_readv_from_socket(server, _msg);
 }
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1c98293..87eece6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3026,12 +3026,20 @@ uncached_fill_pages(struct TCP_Server_Info *server,
int result = 0;
unsigned int i;
unsigned int nr_pages = rdata->nr_pages;
+   unsigned int page_offset = rdata->page_offset;
 
rdata->got_bytes = 0;
rdata->tailsz = PAGE_SIZE;
for (i = 0; i < nr_pages; i++) {
struct page *page = rdata->pages[i];
size_t n;
+   unsigned int segment_size = rdata->pagesz;
+
+   if (i == 0)
+   segment_size -= page_offset;
+   else
+   page_offset = 0;
+
 
if (len <= 0) {
/* no need to hold page hostage */
@@ -3040,24 +3048,25 @@ uncached_fill_pages(struct TCP_Server_Info *server,
put_page(page);
continue;
}
+
n = len;
-   if (len >= PAGE_SIZE) {
+   if (len >= segment_size)
/* enough data to fill the page */
-   n = PAGE_SIZE;
-   len -= n;
-   } else {
-   zero_user(page, len, PAGE_SIZE - len);
+   n = segment_size;
+   else
rdata->tailsz = len;
-   len = 0;
-   }
+   len -= n;
+
if (iter)
-   result = copy_page_from_iter(page, 0, n, iter);
+   result = copy_page_from_iter(
+   page, page_offset, n, iter);
 #ifdef CONFIG_CIFS_SMB_DIRECT
else if (rdata->mr)
result = n;
 #endif
else
-   result = cifs_read_page_from_socket(server, page, n);
+   result = cifs_read_page_from_socket(
+   server, page, page_offset, n);
if (result < 0)
break;
 
@@ -3130,6 +3139,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
rdata->bytes = cur_len;
rdata->pid =

[Patch v3 02/16] CIFS: Use offset when reading pages

2018-09-07 Thread Long Li

From: Long Li 

With offset defined in rdata, transport functions need to look at this
offset when reading data into the correct places in pages.

Signed-off-by: Long Li 
---
 fs/cifs/cifsproto.h |  4 +++-
 fs/cifs/cifssmb.c   |  1 +
 fs/cifs/connect.c   |  5 +++--
 fs/cifs/file.c  | 52 +---
 fs/cifs/smb2ops.c   |  2 +-
 fs/cifs/smb2pdu.c   |  1 +
 6 files changed, 46 insertions(+), 19 deletions(-)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index dc80f84..1f27d8e 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -203,7 +203,9 @@ extern void dequeue_mid(struct mid_q_entry *mid, bool 
malformed);
 extern int cifs_read_from_socket(struct TCP_Server_Info *server, char *buf,
 unsigned int to_read);
 extern int cifs_read_page_from_socket(struct TCP_Server_Info *server,
- struct page *page, unsigned int to_read);
+   struct page *page,
+   unsigned int page_offset,
+   unsigned int to_read);
 extern int cifs_setup_cifs_sb(struct smb_vol *pvolume_info,
   struct cifs_sb_info *cifs_sb);
 extern int cifs_match_super(struct super_block *, void *);
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index c8e4278..a1af258 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1596,6 +1596,7 @@ cifs_readv_callback(struct mid_q_entry *mid)
struct smb_rqst rqst = { .rq_iov = rdata->iov,
 .rq_nvec = 2,
 .rq_pages = rdata->pages,
+.rq_offset = rdata->page_offset,
 .rq_npages = rdata->nr_pages,
 .rq_pagesz = rdata->pagesz,
 .rq_tailsz = rdata->tailsz };
diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 83b0234..8501da0 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -594,10 +594,11 @@ cifs_read_from_socket(struct TCP_Server_Info *server, 
char *buf,
 
 int
 cifs_read_page_from_socket(struct TCP_Server_Info *server, struct page *page,
- unsigned int to_read)
+   unsigned int page_offset, unsigned int to_read)
 {
struct msghdr smb_msg;
-   struct bio_vec bv = {.bv_page = page, .bv_len = to_read};
+   struct bio_vec bv = {
+   .bv_page = page, .bv_len = to_read, .bv_offset = page_offset};
iov_iter_bvec(_msg.msg_iter, READ | ITER_BVEC, , 1, to_read);
return cifs_readv_from_socket(server, _msg);
 }
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 1c98293..87eece6 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -3026,12 +3026,20 @@ uncached_fill_pages(struct TCP_Server_Info *server,
int result = 0;
unsigned int i;
unsigned int nr_pages = rdata->nr_pages;
+   unsigned int page_offset = rdata->page_offset;
 
rdata->got_bytes = 0;
rdata->tailsz = PAGE_SIZE;
for (i = 0; i < nr_pages; i++) {
struct page *page = rdata->pages[i];
size_t n;
+   unsigned int segment_size = rdata->pagesz;
+
+   if (i == 0)
+   segment_size -= page_offset;
+   else
+   page_offset = 0;
+
 
if (len <= 0) {
/* no need to hold page hostage */
@@ -3040,24 +3048,25 @@ uncached_fill_pages(struct TCP_Server_Info *server,
put_page(page);
continue;
}
+
n = len;
-   if (len >= PAGE_SIZE) {
+   if (len >= segment_size)
/* enough data to fill the page */
-   n = PAGE_SIZE;
-   len -= n;
-   } else {
-   zero_user(page, len, PAGE_SIZE - len);
+   n = segment_size;
+   else
rdata->tailsz = len;
-   len = 0;
-   }
+   len -= n;
+
if (iter)
-   result = copy_page_from_iter(page, 0, n, iter);
+   result = copy_page_from_iter(
+   page, page_offset, n, iter);
 #ifdef CONFIG_CIFS_SMB_DIRECT
else if (rdata->mr)
result = n;
 #endif
else
-   result = cifs_read_page_from_socket(server, page, n);
+   result = cifs_read_page_from_socket(
+   server, page, page_offset, n);
if (result < 0)
break;
 
@@ -3130,6 +3139,7 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
rdata->bytes = cur_len;
rdata->pid =

[Patch v3 05/16] CIFS: Calculate the correct request length based on page offset and tail size

2018-09-07 Thread Long Li

From: Long Li 

It's possible that the page offset is non-zero in the pages in a request,
change the function to calculate the correct data buffer length.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 927226a..d6b5523 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -212,10 +212,24 @@ rqst_len(struct smb_rqst *rqst)
for (i = 0; i < rqst->rq_nvec; i++)
buflen += iov[i].iov_len;
 
-   /* add in the page array if there is one */
+   /*
+* Add in the page array if there is one. The caller needs to make
+* sure rq_offset and rq_tailsz are set correctly. If a buffer of
+* multiple pages ends at page boundary, rq_tailsz needs to be set to
+* PAGE_SIZE.
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else {
+   /*
+* If there is more than one page, calculate the
+* buffer length based on rq_offset and rq_tailsz
+*/
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset;
+   buflen += rqst->rq_tailsz;
+   }
}
 
return buflen;
-- 
2.7.4

[Patch v3 03/16] CIFS: Add support for direct pages in wdata

2018-09-07 Thread Long Li

From: Long Li 

Add a function to allocate wdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages that
point to the data buffer to write to.

wdata is reponsible for free those pages after it's done.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h  |  3 +--
 fs/cifs/cifsproto.h |  2 ++
 fs/cifs/cifssmb.c   | 17 ++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 166e140..7f62c98 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1199,14 +1199,13 @@ struct cifs_writedata {
int result;
 #ifdef CONFIG_CIFS_SMB_DIRECT
struct smbd_mr  *mr;
-   struct page **direct_pages;
 #endif
unsigned intpagesz;
unsigned intpage_offset;
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;
 };
 
 /*
diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 1f27d8e..7933c5f 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -533,6 +533,8 @@ int cifs_async_writev(struct cifs_writedata *wdata,
 void cifs_writev_complete(struct work_struct *work);
 struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
work_func_t complete);
+struct cifs_writedata *cifs_writedata_direct_alloc(struct page **pages,
+   work_func_t complete);
 void cifs_writedata_release(struct kref *refcount);
 int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
  struct cifs_sb_info *cifs_sb,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index a1af258..503e0ed 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1953,6 +1953,7 @@ cifs_writedata_release(struct kref *refcount)
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
 
+   kvfree(wdata->pages);
kfree(wdata);
 }
 
@@ -2076,12 +2077,22 @@ cifs_writev_complete(struct work_struct *work)
 struct cifs_writedata *
 cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
 {
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
+   if (pages)
+   return cifs_writedata_direct_alloc(pages, complete);
+
+   return NULL;
+}
+
+struct cifs_writedata *
+cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
+{
struct cifs_writedata *wdata;
 
-   /* writedata + number of page pointers */
-   wdata = kzalloc(sizeof(*wdata) +
-   sizeof(struct page *) * nr_pages, GFP_NOFS);
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
if (wdata != NULL) {
+   wdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
-- 
2.7.4

[Patch v3 05/16] CIFS: Calculate the correct request length based on page offset and tail size

2018-09-07 Thread Long Li

From: Long Li 

It's possible that the page offset is non-zero in the pages in a request,
change the function to calculate the correct data buffer length.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 927226a..d6b5523 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -212,10 +212,24 @@ rqst_len(struct smb_rqst *rqst)
for (i = 0; i < rqst->rq_nvec; i++)
buflen += iov[i].iov_len;
 
-   /* add in the page array if there is one */
+   /*
+* Add in the page array if there is one. The caller needs to make
+* sure rq_offset and rq_tailsz are set correctly. If a buffer of
+* multiple pages ends at page boundary, rq_tailsz needs to be set to
+* PAGE_SIZE.
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else {
+   /*
+* If there is more than one page, calculate the
+* buffer length based on rq_offset and rq_tailsz
+*/
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset;
+   buflen += rqst->rq_tailsz;
+   }
}
 
return buflen;
-- 
2.7.4

[Patch v3 03/16] CIFS: Add support for direct pages in wdata

2018-09-07 Thread Long Li

From: Long Li 

Add a function to allocate wdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages that
point to the data buffer to write to.

wdata is reponsible for free those pages after it's done.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h  |  3 +--
 fs/cifs/cifsproto.h |  2 ++
 fs/cifs/cifssmb.c   | 17 ++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 166e140..7f62c98 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1199,14 +1199,13 @@ struct cifs_writedata {
int result;
 #ifdef CONFIG_CIFS_SMB_DIRECT
struct smbd_mr  *mr;
-   struct page **direct_pages;
 #endif
unsigned intpagesz;
unsigned intpage_offset;
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;
 };
 
 /*
diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 1f27d8e..7933c5f 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -533,6 +533,8 @@ int cifs_async_writev(struct cifs_writedata *wdata,
 void cifs_writev_complete(struct work_struct *work);
 struct cifs_writedata *cifs_writedata_alloc(unsigned int nr_pages,
work_func_t complete);
+struct cifs_writedata *cifs_writedata_direct_alloc(struct page **pages,
+   work_func_t complete);
 void cifs_writedata_release(struct kref *refcount);
 int cifs_query_mf_symlink(unsigned int xid, struct cifs_tcon *tcon,
  struct cifs_sb_info *cifs_sb,
diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index a1af258..503e0ed 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -1953,6 +1953,7 @@ cifs_writedata_release(struct kref *refcount)
if (wdata->cfile)
cifsFileInfo_put(wdata->cfile);
 
+   kvfree(wdata->pages);
kfree(wdata);
 }
 
@@ -2076,12 +2077,22 @@ cifs_writev_complete(struct work_struct *work)
 struct cifs_writedata *
 cifs_writedata_alloc(unsigned int nr_pages, work_func_t complete)
 {
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_NOFS);
+   if (pages)
+   return cifs_writedata_direct_alloc(pages, complete);
+
+   return NULL;
+}
+
+struct cifs_writedata *
+cifs_writedata_direct_alloc(struct page **pages, work_func_t complete)
+{
struct cifs_writedata *wdata;
 
-   /* writedata + number of page pointers */
-   wdata = kzalloc(sizeof(*wdata) +
-   sizeof(struct page *) * nr_pages, GFP_NOFS);
+   wdata = kzalloc(sizeof(*wdata), GFP_NOFS);
if (wdata != NULL) {
+   wdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
-- 
2.7.4

[Patch v3 16/16] CIFS: Add direct I/O functions to file_operations

2018-09-07 Thread Long Li

From: Long Li 

With direct read/write functions implemented, add them to file_operations.

Dircet I/O is used under two conditions:
1. When mounting with "cache=none", CIFS uses direct I/O for all user file
data transfer.
2. When opening a file with O_DIRECT, CIFS uses direct I/O for all data
transfer on this file.

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 62f1662..f18091b 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1113,9 +1113,8 @@ const struct file_operations cifs_file_strict_ops = {
 };
 
 const struct file_operations cifs_file_direct_ops = {
-   /* BB reevaluate whether they can be done with directio, no cache */
-   .read_iter = cifs_user_readv,
-   .write_iter = cifs_user_writev,
+   .read_iter = cifs_direct_readv,
+   .write_iter = cifs_direct_writev,
.open = cifs_open,
.release = cifs_close,
.lock = cifs_lock,
@@ -1169,9 +1168,8 @@ const struct file_operations cifs_file_strict_nobrl_ops = 
{
 };
 
 const struct file_operations cifs_file_direct_nobrl_ops = {
-   /* BB reevaluate whether they can be done with directio, no cache */
-   .read_iter = cifs_user_readv,
-   .write_iter = cifs_user_writev,
+   .read_iter = cifs_direct_readv,
+   .write_iter = cifs_direct_writev,
.open = cifs_open,
.release = cifs_close,
.fsync = cifs_fsync,
-- 
2.7.4

[Patch v3 16/16] CIFS: Add direct I/O functions to file_operations

2018-09-07 Thread Long Li

From: Long Li 

With direct read/write functions implemented, add them to file_operations.

Dircet I/O is used under two conditions:
1. When mounting with "cache=none", CIFS uses direct I/O for all user file
data transfer.
2. When opening a file with O_DIRECT, CIFS uses direct I/O for all data
transfer on this file.

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index 62f1662..f18091b 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1113,9 +1113,8 @@ const struct file_operations cifs_file_strict_ops = {
 };
 
 const struct file_operations cifs_file_direct_ops = {
-   /* BB reevaluate whether they can be done with directio, no cache */
-   .read_iter = cifs_user_readv,
-   .write_iter = cifs_user_writev,
+   .read_iter = cifs_direct_readv,
+   .write_iter = cifs_direct_writev,
.open = cifs_open,
.release = cifs_close,
.lock = cifs_lock,
@@ -1169,9 +1168,8 @@ const struct file_operations cifs_file_strict_nobrl_ops = 
{
 };
 
 const struct file_operations cifs_file_direct_nobrl_ops = {
-   /* BB reevaluate whether they can be done with directio, no cache */
-   .read_iter = cifs_user_readv,
-   .write_iter = cifs_user_writev,
+   .read_iter = cifs_direct_readv,
+   .write_iter = cifs_direct_writev,
.open = cifs_open,
.release = cifs_close,
.fsync = cifs_fsync,
-- 
2.7.4

[Patch v3 14/16] CIFS: Add support for direct I/O read

2018-09-07 Thread Long Li

From: Long Li 

With direct I/O read, we transfer the data directly from transport layer to
the user data buffer.

Change in v3: added support for kernel AIO

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.h   |   1 +
 fs/cifs/cifsglob.h |   5 ++
 fs/cifs/file.c | 209 +
 3 files changed, 186 insertions(+), 29 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 5f02318..7fba9aa 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -102,6 +102,7 @@ extern int cifs_open(struct inode *inode, struct file 
*file);
 extern int cifs_close(struct inode *inode, struct file *file);
 extern int cifs_closedir(struct inode *inode, struct file *file);
 extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
+extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 7f62c98..52248dd 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1146,6 +1146,11 @@ struct cifs_aio_ctx {
unsigned intlen;
unsigned inttotal_len;
boolshould_dirty;
+   /*
+* Indicates if this aio_ctx is for direct_io,
+* If yes, iter is a copy of the user passed iov_iter
+*/
+   booldirect_io;
 };
 
 struct cifs_readdata;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 87eece6..476b2a1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2965,7 +2965,6 @@ cifs_uncached_readdata_release(struct kref *refcount)
kref_put(>ctx->refcount, cifs_aio_ctx_release);
for (i = 0; i < rdata->nr_pages; i++) {
put_page(rdata->pages[i]);
-   rdata->pages[i] = NULL;
}
cifs_readdata_release(refcount);
 }
@@ -3004,7 +3003,7 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, struct 
iov_iter *iter)
return remaining ? -EFAULT : 0;
 }
 
-static void collect_uncached_read_data(struct cifs_aio_ctx *ctx);
+static void collect_uncached_read_data(struct cifs_readdata *rdata, struct 
cifs_aio_ctx *ctx);
 
 static void
 cifs_uncached_readv_complete(struct work_struct *work)
@@ -3013,7 +3012,7 @@ cifs_uncached_readv_complete(struct work_struct *work)
struct cifs_readdata, work);
 
complete(>done);
-   collect_uncached_read_data(rdata->ctx);
+   collect_uncached_read_data(rdata, rdata->ctx);
/* the below call can possibly free the last ref to aio ctx */
kref_put(>refcount, cifs_uncached_readdata_release);
 }
@@ -3103,6 +3102,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
int rc;
pid_t pid;
struct TCP_Server_Info *server;
+   struct page **pagevec;
+   size_t start;
+   struct iov_iter direct_iov = ctx->iter;
 
server = tlink_tcon(open_file->tlink)->ses->server;
 
@@ -3111,6 +3113,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
else
pid = current->tgid;
 
+   if (ctx->direct_io)
+   iov_iter_advance(_iov, offset - ctx->pos);
+
do {
rc = server->ops->wait_mtu_credits(server, cifs_sb->rsize,
   , );
@@ -3118,20 +3123,56 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
break;
 
cur_len = min_t(const size_t, len, rsize);
-   npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
 
-   /* allocate a readdata struct */
-   rdata = cifs_readdata_alloc(npages,
+   if (ctx->direct_io) {
+
+   cur_len = iov_iter_get_pages_alloc(
+   _iov, ,
+   cur_len, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "couldn't get user pages (cur_len=%zd)"
+   " iter type %d"
+   " iov_offset %lu count %lu\n",
+   cur_len, direct_iov.type, 
direct_iov.iov_offset,
+   direct_iov.count);
+   dump_stack();
+   break;
+   }
+   iov_iter_advance(_iov, cur_len);
+
+   rdata = cifs_readdata_direct_alloc(
+   pagevec, cifs_uncached_readv_complete);
+   if (!rdata) {
+   add_credits_and_wake_if(server,

[Patch v3 04/16] CIFS: pass page offset when issuing SMB write

2018-09-07 Thread Long Li

From: Long Li 

When issuing SMB writes, pass along the write data page offset to transport.

Signed-off-by: Long Li 
---
 fs/cifs/cifssmb.c | 1 +
 fs/cifs/smb2pdu.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 503e0ed..0a57c61 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -2200,6 +2200,7 @@ cifs_async_writev(struct cifs_writedata *wdata,
rqst.rq_iov = iov;
rqst.rq_nvec = 2;
rqst.rq_pages = wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6c22da8..f603fbe 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -3046,6 +3046,7 @@ smb2_async_writev(struct cifs_writedata *wdata,
rqst.rq_iov = iov;
rqst.rq_nvec = 2;
rqst.rq_pages = wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
-- 
2.7.4

[Patch v3 08/16] CIFS: SMBD: Support page offset in RDMA send

2018-09-07 Thread Long Li

From: Long Li 

The RDMA send function needs to look at offset in the request pages, and
send data starting from there.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index c62f7c9..6141e3c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -17,6 +17,7 @@
 #include 
 #include "smbdirect.h"
 #include "cifs_debug.h"
+#include "cifsproto.h"
 
 static struct smbd_response *get_empty_queue_buffer(
struct smbd_connection *info);
@@ -2082,7 +2083,7 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
struct kvec vec;
int nvecs;
int size;
-   int buflen = 0, remaining_data_length;
+   unsigned int buflen = 0, remaining_data_length;
int start, i, j;
int max_iov_size =
info->max_send_size - sizeof(struct smbd_data_transfer);
@@ -2113,10 +2114,17 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
buflen += iov[i].iov_len;
}
 
-   /* add in the page array if there is one */
+   /*
+* Add in the page array if there is one. The caller needs to set
+* rq_tailsz to PAGE_SIZE when the buffer has multiple pages and
+* ends at page boundary
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset + rqst->rq_tailsz;
}
 
if (buflen + sizeof(struct smbd_data_transfer) >
@@ -2213,8 +2221,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
 
/* now sending pages if there are any */
for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset;
+
+   rqst_page_get_length(rqst, i, , );
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2225,9 +2234,11 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size,
+   remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i],
+   j*max_iov_size + offset,
size, remaining_data_length);
if (rc)
goto done;
-- 
2.7.4

[Patch v3 04/16] CIFS: pass page offset when issuing SMB write

2018-09-07 Thread Long Li

From: Long Li 

When issuing SMB writes, pass along the write data page offset to transport.

Signed-off-by: Long Li 
---
 fs/cifs/cifssmb.c | 1 +
 fs/cifs/smb2pdu.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/cifs/cifssmb.c b/fs/cifs/cifssmb.c
index 503e0ed..0a57c61 100644
--- a/fs/cifs/cifssmb.c
+++ b/fs/cifs/cifssmb.c
@@ -2200,6 +2200,7 @@ cifs_async_writev(struct cifs_writedata *wdata,
rqst.rq_iov = iov;
rqst.rq_nvec = 2;
rqst.rq_pages = wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index 6c22da8..f603fbe 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -3046,6 +3046,7 @@ smb2_async_writev(struct cifs_writedata *wdata,
rqst.rq_iov = iov;
rqst.rq_nvec = 2;
rqst.rq_pages = wdata->pages;
+   rqst.rq_offset = wdata->page_offset;
rqst.rq_npages = wdata->nr_pages;
rqst.rq_pagesz = wdata->pagesz;
rqst.rq_tailsz = wdata->tailsz;
-- 
2.7.4

[Patch v3 08/16] CIFS: SMBD: Support page offset in RDMA send

2018-09-07 Thread Long Li

From: Long Li 

The RDMA send function needs to look at offset in the request pages, and
send data starting from there.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 27 +++
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index c62f7c9..6141e3c 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -17,6 +17,7 @@
 #include 
 #include "smbdirect.h"
 #include "cifs_debug.h"
+#include "cifsproto.h"
 
 static struct smbd_response *get_empty_queue_buffer(
struct smbd_connection *info);
@@ -2082,7 +2083,7 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
struct kvec vec;
int nvecs;
int size;
-   int buflen = 0, remaining_data_length;
+   unsigned int buflen = 0, remaining_data_length;
int start, i, j;
int max_iov_size =
info->max_send_size - sizeof(struct smbd_data_transfer);
@@ -2113,10 +2114,17 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
buflen += iov[i].iov_len;
}
 
-   /* add in the page array if there is one */
+   /*
+* Add in the page array if there is one. The caller needs to set
+* rq_tailsz to PAGE_SIZE when the buffer has multiple pages and
+* ends at page boundary
+*/
if (rqst->rq_npages) {
-   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1);
-   buflen += rqst->rq_tailsz;
+   if (rqst->rq_npages == 1)
+   buflen += rqst->rq_tailsz;
+   else
+   buflen += rqst->rq_pagesz * (rqst->rq_npages - 1) -
+   rqst->rq_offset + rqst->rq_tailsz;
}
 
if (buflen + sizeof(struct smbd_data_transfer) >
@@ -2213,8 +2221,9 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
 
/* now sending pages if there are any */
for (i = 0; i < rqst->rq_npages; i++) {
-   buflen = (i == rqst->rq_npages-1) ?
-   rqst->rq_tailsz : rqst->rq_pagesz;
+   unsigned int offset;
+
+   rqst_page_get_length(rqst, i, , );
nvecs = (buflen + max_iov_size - 1) / max_iov_size;
log_write(INFO, "sending pages buflen=%d nvecs=%d\n",
buflen, nvecs);
@@ -2225,9 +2234,11 @@ int smbd_send(struct smbd_connection *info, struct 
smb_rqst *rqst)
remaining_data_length -= size;
log_write(INFO, "sending pages i=%d offset=%d size=%d"
" remaining_data_length=%d\n",
-   i, j*max_iov_size, size, remaining_data_length);
+   i, j*max_iov_size+offset, size,
+   remaining_data_length);
rc = smbd_post_send_page(
-   info, rqst->rq_pages[i], j*max_iov_size,
+   info, rqst->rq_pages[i],
+   j*max_iov_size + offset,
size, remaining_data_length);
if (rc)
goto done;
-- 
2.7.4

[Patch v3 14/16] CIFS: Add support for direct I/O read

2018-09-07 Thread Long Li

From: Long Li 

With direct I/O read, we transfer the data directly from transport layer to
the user data buffer.

Change in v3: added support for kernel AIO

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.h   |   1 +
 fs/cifs/cifsglob.h |   5 ++
 fs/cifs/file.c | 209 +
 3 files changed, 186 insertions(+), 29 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 5f02318..7fba9aa 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -102,6 +102,7 @@ extern int cifs_open(struct inode *inode, struct file 
*file);
 extern int cifs_close(struct inode *inode, struct file *file);
 extern int cifs_closedir(struct inode *inode, struct file *file);
 extern ssize_t cifs_user_readv(struct kiocb *iocb, struct iov_iter *to);
+extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 7f62c98..52248dd 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1146,6 +1146,11 @@ struct cifs_aio_ctx {
unsigned intlen;
unsigned inttotal_len;
boolshould_dirty;
+   /*
+* Indicates if this aio_ctx is for direct_io,
+* If yes, iter is a copy of the user passed iov_iter
+*/
+   booldirect_io;
 };
 
 struct cifs_readdata;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 87eece6..476b2a1 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2965,7 +2965,6 @@ cifs_uncached_readdata_release(struct kref *refcount)
kref_put(>ctx->refcount, cifs_aio_ctx_release);
for (i = 0; i < rdata->nr_pages; i++) {
put_page(rdata->pages[i]);
-   rdata->pages[i] = NULL;
}
cifs_readdata_release(refcount);
 }
@@ -3004,7 +3003,7 @@ cifs_readdata_to_iov(struct cifs_readdata *rdata, struct 
iov_iter *iter)
return remaining ? -EFAULT : 0;
 }
 
-static void collect_uncached_read_data(struct cifs_aio_ctx *ctx);
+static void collect_uncached_read_data(struct cifs_readdata *rdata, struct 
cifs_aio_ctx *ctx);
 
 static void
 cifs_uncached_readv_complete(struct work_struct *work)
@@ -3013,7 +3012,7 @@ cifs_uncached_readv_complete(struct work_struct *work)
struct cifs_readdata, work);
 
complete(>done);
-   collect_uncached_read_data(rdata->ctx);
+   collect_uncached_read_data(rdata, rdata->ctx);
/* the below call can possibly free the last ref to aio ctx */
kref_put(>refcount, cifs_uncached_readdata_release);
 }
@@ -3103,6 +3102,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
int rc;
pid_t pid;
struct TCP_Server_Info *server;
+   struct page **pagevec;
+   size_t start;
+   struct iov_iter direct_iov = ctx->iter;
 
server = tlink_tcon(open_file->tlink)->ses->server;
 
@@ -3111,6 +3113,9 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
else
pid = current->tgid;
 
+   if (ctx->direct_io)
+   iov_iter_advance(_iov, offset - ctx->pos);
+
do {
rc = server->ops->wait_mtu_credits(server, cifs_sb->rsize,
   , );
@@ -3118,20 +3123,56 @@ cifs_send_async_read(loff_t offset, size_t len, struct 
cifsFileInfo *open_file,
break;
 
cur_len = min_t(const size_t, len, rsize);
-   npages = DIV_ROUND_UP(cur_len, PAGE_SIZE);
 
-   /* allocate a readdata struct */
-   rdata = cifs_readdata_alloc(npages,
+   if (ctx->direct_io) {
+
+   cur_len = iov_iter_get_pages_alloc(
+   _iov, ,
+   cur_len, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "couldn't get user pages (cur_len=%zd)"
+   " iter type %d"
+   " iov_offset %lu count %lu\n",
+   cur_len, direct_iov.type, 
direct_iov.iov_offset,
+   direct_iov.count);
+   dump_stack();
+   break;
+   }
+   iov_iter_advance(_iov, cur_len);
+
+   rdata = cifs_readdata_direct_alloc(
+   pagevec, cifs_uncached_readv_complete);
+   if (!rdata) {
+   add_credits_and_wake_if(server,

[Patch v3 12/16] CIFS: Pass page offset for calculating signature

2018-09-07 Thread Long Li

From: Long Li 

When calculating signature for the packet, it needs to read into the
correct page offset for the data.

Signed-off-by: Long Li 
---
 fs/cifs/cifsencrypt.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index a6ef088..e88303c 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -68,11 +68,12 @@ int __cifs_calc_signature(struct smb_rqst *rqst,
 
/* now hash over the rq_pages array */
for (i = 0; i < rqst->rq_npages; i++) {
-   void *kaddr = kmap(rqst->rq_pages[i]);
-   size_t len = rqst->rq_pagesz;
+   void *kaddr;
+   unsigned int len, offset;
 
-   if (i == rqst->rq_npages - 1)
-   len = rqst->rq_tailsz;
+   rqst_page_get_length(rqst, i, , );
+
+   kaddr = (char *) kmap(rqst->rq_pages[i]) + offset;
 
crypto_shash_update(shash, kaddr, len);
 
-- 
2.7.4

[Patch v3 13/16] CIFS: Pass page offset for encrypting

2018-09-07 Thread Long Li

From: Long Li 

Encryption function needs to read data starting page offset from input
buffer.

This doesn't affect decryption path since it allocates its own page
buffers.

Signed-off-by: Long Li 
---
 fs/cifs/smb2ops.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 1fa1c29..38d19b6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2189,9 +2189,10 @@ init_sg(struct smb_rqst *rqst, u8 *sign)
smb2_sg_set_buf([i], rqst->rq_iov[i].iov_base,
rqst->rq_iov[i].iov_len);
for (j = 0; i < sg_len - 1; i++, j++) {
-   unsigned int len = (j < rqst->rq_npages - 1) ? rqst->rq_pagesz
-   : rqst->rq_tailsz;
-   sg_set_page([i], rqst->rq_pages[j], len, 0);
+   unsigned int len, offset;
+
+   rqst_page_get_length(rqst, j, , );
+   sg_set_page([i], rqst->rq_pages[j], len, offset);
}
smb2_sg_set_buf([sg_len - 1], sign, SMB2_SIGNATURE_SIZE);
return sg;
@@ -2332,6 +2333,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
return rc;
 
new_rq->rq_pages = pages;
+   new_rq->rq_offset = old_rq->rq_offset;
new_rq->rq_npages = old_rq->rq_npages;
new_rq->rq_pagesz = old_rq->rq_pagesz;
new_rq->rq_tailsz = old_rq->rq_tailsz;
@@ -2363,10 +2365,14 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
 
/* copy pages form the old */
for (i = 0; i < npages; i++) {
-   char *dst = kmap(new_rq->rq_pages[i]);
-   char *src = kmap(old_rq->rq_pages[i]);
-   unsigned int len = (i < npages - 1) ? new_rq->rq_pagesz :
-   new_rq->rq_tailsz;
+   char *dst, *src;
+   unsigned int offset, len;
+
+   rqst_page_get_length(new_rq, i, , );
+
+   dst = (char *) kmap(new_rq->rq_pages[i]) + offset;
+   src = (char *) kmap(old_rq->rq_pages[i]) + offset;
+
memcpy(dst, src, len);
kunmap(new_rq->rq_pages[i]);
kunmap(old_rq->rq_pages[i]);
-- 
2.7.4

[Patch v3 01/16] CIFS: Add support for direct pages in rdata

2018-09-07 Thread Long Li

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is reponsible for free those pages after it's done.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  3 +--
 fs/cifs/file.c | 23 ---
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 80a34ce..166e140 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1173,14 +1173,13 @@ struct cifs_readdata {
struct kvec iov[2];
 #ifdef CONFIG_CIFS_SMB_DIRECT
struct smbd_mr  *mr;
-   struct page **direct_pages;
 #endif
unsigned intpagesz;
unsigned intpage_offset;
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;
 };
 
 struct cifs_writedata;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..1c98293 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2880,13 +2880,13 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter 
*from)
 }
 
 static struct cifs_readdata *
-cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
 {
struct cifs_readdata *rdata;
 
-   rdata = kzalloc(sizeof(*rdata) + (sizeof(struct page *) * nr_pages),
-   GFP_KERNEL);
+   rdata = kzalloc(sizeof(*rdata), GFP_KERNEL);
if (rdata != NULL) {
+   rdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
@@ -2896,6 +2896,22 @@ cifs_readdata_alloc(unsigned int nr_pages, work_func_t 
complete)
return rdata;
 }
 
+static struct cifs_readdata *
+cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+{
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL);
+   struct cifs_readdata *ret = NULL;
+
+   if (pages) {
+   ret = cifs_readdata_direct_alloc(pages, complete);
+   if (!ret)
+   kfree(pages);
+   }
+
+   return ret;
+}
+
 void
 cifs_readdata_release(struct kref *refcount)
 {
@@ -2910,6 +2926,7 @@ cifs_readdata_release(struct kref *refcount)
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
 
+   kvfree(rdata->pages);
kfree(rdata);
 }
 
-- 
2.7.4

[Patch v3 09/16] CIFS: SMBD: Support page offset in RDMA recv

2018-09-07 Thread Long Li

From: Long Li 

RDMA recv function needs to place data to the correct place starting at
page offset.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 6141e3c..ba53c52 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2004,10 +2004,12 @@ static int smbd_recv_buf(struct smbd_connection *info, 
char *buf,
  * return value: actual data read
  */
 static int smbd_recv_page(struct smbd_connection *info,
-   struct page *page, unsigned int to_read)
+   struct page *page, unsigned int page_offset,
+   unsigned int to_read)
 {
int ret;
char *to_address;
+   void *page_address;
 
/* make sure we have the page ready for read */
ret = wait_event_interruptible(
@@ -2015,16 +2017,17 @@ static int smbd_recv_page(struct smbd_connection *info,
info->reassembly_data_length >= to_read ||
info->transport_status != SMBD_CONNECTED);
if (ret)
-   return 0;
+   return ret;
 
/* now we can read from reassembly queue and not sleep */
-   to_address = kmap_atomic(page);
+   page_address = kmap_atomic(page);
+   to_address = (char *) page_address + page_offset;
 
log_read(INFO, "reading from page=%p address=%p to_read=%d\n",
page, to_address, to_read);
 
ret = smbd_recv_buf(info, to_address, to_read);
-   kunmap_atomic(to_address);
+   kunmap_atomic(page_address);
 
return ret;
 }
@@ -2038,7 +2041,7 @@ int smbd_recv(struct smbd_connection *info, struct msghdr 
*msg)
 {
char *buf;
struct page *page;
-   unsigned int to_read;
+   unsigned int to_read, page_offset;
int rc;
 
info->smbd_recv_pending++;
@@ -2052,15 +2055,16 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
 
case READ | ITER_BVEC:
page = msg->msg_iter.bvec->bv_page;
+   page_offset = msg->msg_iter.bvec->bv_offset;
to_read = msg->msg_iter.bvec->bv_len;
-   rc = smbd_recv_page(info, page, to_read);
+   rc = smbd_recv_page(info, page, page_offset, to_read);
break;
 
default:
/* It's a bug in upper layer to get there */
cifs_dbg(VFS, "CIFS: invalid msg type %d\n",
msg->msg_iter.type);
-   rc = -EIO;
+   rc = -EINVAL;
}
 
info->smbd_recv_pending--;
-- 
2.7.4

[Patch v3 07/16] CIFS: When sending data on socket, pass the correct page offset

2018-09-07 Thread Long Li

From: Long Li 

It's possible that the offset is non-zero in the page to send, change the
function to pass this offset to socket.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index d6b5523..5c96ee8 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -288,15 +288,13 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
 
/* now walk the page array and send each page in it */
for (i = 0; i < rqst->rq_npages; i++) {
-   size_t len = i == rqst->rq_npages - 1
-   ? rqst->rq_tailsz
-   : rqst->rq_pagesz;
-   struct bio_vec bvec = {
-   .bv_page = rqst->rq_pages[i],
-   .bv_len = len
-   };
+   struct bio_vec bvec;
+
+   bvec.bv_page = rqst->rq_pages[i];
+   rqst_page_get_length(rqst, i, _len, _offset);
+
iov_iter_bvec(_msg.msg_iter, WRITE | ITER_BVEC,
- , 1, len);
+ , 1, bvec.bv_len);
rc = smb_send_kvec(server, _msg, );
if (rc < 0)
break;
-- 
2.7.4

[Patch v3 10/16] CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration

2018-09-07 Thread Long Li

From: Long Li 

It is not necessary to deregister a memory registration after it has been
successfully invalidated.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 82 ++---
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index ba53c52..b470cd0 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2296,50 +2296,50 @@ static void smbd_mr_recovery_work(struct work_struct 
*work)
int rc;
 
list_for_each_entry(smbdirect_mr, >mr_list, list) {
-   if (smbdirect_mr->state == MR_INVALIDATED ||
-   smbdirect_mr->state == MR_ERROR) {
-
-   if (smbdirect_mr->state == MR_INVALIDATED) {
-   ib_dma_unmap_sg(
-   info->id->device, smbdirect_mr->sgl,
-   smbdirect_mr->sgl_count,
-   smbdirect_mr->dir);
-   smbdirect_mr->state = MR_READY;
-   } else if (smbdirect_mr->state == MR_ERROR) {
-
-   /* recover this MR entry */
-   rc = ib_dereg_mr(smbdirect_mr->mr);
-   if (rc) {
-   log_rdma_mr(ERR,
-   "ib_dereg_mr failed rc=%x\n",
-   rc);
-   smbd_disconnect_rdma_connection(info);
-   }
+   if (smbdirect_mr->state == MR_INVALIDATED)
+   ib_dma_unmap_sg(
+   info->id->device, smbdirect_mr->sgl,
+   smbdirect_mr->sgl_count,
+   smbdirect_mr->dir);
+   else if (smbdirect_mr->state == MR_ERROR) {
+
+   /* recover this MR entry */
+   rc = ib_dereg_mr(smbdirect_mr->mr);
+   if (rc) {
+   log_rdma_mr(ERR,
+   "ib_dereg_mr failed rc=%x\n",
+   rc);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }
 
-   smbdirect_mr->mr = ib_alloc_mr(
-   info->pd, info->mr_type,
+   smbdirect_mr->mr = ib_alloc_mr(
+   info->pd, info->mr_type,
+   info->max_frmr_depth);
+   if (IS_ERR(smbdirect_mr->mr)) {
+   log_rdma_mr(ERR,
+   "ib_alloc_mr failed mr_type=%x "
+   "max_frmr_depth=%x\n",
+   info->mr_type,
info->max_frmr_depth);
-   if (IS_ERR(smbdirect_mr->mr)) {
-   log_rdma_mr(ERR,
-   "ib_alloc_mr failed mr_type=%x "
-   "max_frmr_depth=%x\n",
-   info->mr_type,
-   info->max_frmr_depth);
-   smbd_disconnect_rdma_connection(info);
-   }
-
-   smbdirect_mr->state = MR_READY;
+   smbd_disconnect_rdma_connection(info);
+   continue;
}
-   /* smbdirect_mr->state is updated by this function
-* and is read and updated by I/O issuing CPUs trying
-* to get a MR, the call to atomic_inc_return
-* implicates a memory barrier and guarantees this
-* value is updated before waking up any calls to
-* get_mr() from the I/O issuing CPUs
-*/
-   if (atomic_inc_return(>mr_ready_count) == 1)
-   wake_up_interruptible(>wait_mr);
-   }
+   } else
+   /* This MR is being used, don't recover it */
+   continue;
+
+   smbdirect_mr->state = MR_READY;
+
+   /* smbdirect_mr->state is updated by this function
+* and is read and updated by I/O issuing CPUs trying
+* to get a MR, the call to atomic_inc_return
+* implicates a memory barrier and guarantees this
+* value is updated before waking up any calls to
+* get_mr() from the I/O issuing CPUs
+*/
+

[Patch v3 06/16] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-09-07 Thread Long Li

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset and
length for a given page in smb_rqst. This function is to be used by
following patches.

Signed-off-by: Long Li 
---
 fs/cifs/cifsproto.h |  3 +++
 fs/cifs/misc.c  | 17 +
 2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct crypto_shash 
**shash,
struct sdesc **sdesc);
 void cifs_free_hash(struct crypto_shash **shash, struct sdesc **sdesc);
 
+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset);
+
 #endif /* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 96849b5..e951417 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash, struct sdesc 
**sdesc)
crypto_free_shash(*shash);
*shash = NULL;
 }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for this page
+ */
+void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset)
+{
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;
+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset;
+}
-- 
2.7.4

[Patch v3 00/16] CIFS: add support for direct I/O

2018-09-07 Thread Long Li

From: Long Li 

This patch set implements direct I/O.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons of possible protocol
required signing and encryption on user data.

With this patch set, CIFS passes the I/O data directly from user-space
buffer to the transport layer, when file system is mounted with
"cache-none".

Patch v2 addressed comments from Christoph Hellwig  and
Tom Talpey  to implement direct I/O for both
socket and RDMA.

Patch v3 added support for kernel AIO.


Long Li (16):
  CIFS: Add support for direct pages in rdata
  CIFS: Use offset when reading pages
  CIFS: Add support for direct pages in wdata
  CIFS: pass page offset when issuing SMB write
  CIFS: Calculate the correct request length based on page offset and
tail size
  CIFS: Introduce helper function to get page offset and length in
smb_rqst
  CIFS: When sending data on socket, pass the correct page offset
  CIFS: SMBD: Support page offset in RDMA send
  CIFS: SMBD: Support page offset in RDMA recv
  CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration
  CIFS: SMBD: Support page offset in memory registration
  CIFS: Pass page offset for calculating signature
  CIFS: Pass page offset for encrypting
  CIFS: Add support for direct I/O read
  CIFS: Add support for direct I/O write
  CIFS: Add direct I/O functions to file_operations

 fs/cifs/cifsencrypt.c |   9 +-
 fs/cifs/cifsfs.c  |  10 +-
 fs/cifs/cifsfs.h  |   2 +
 fs/cifs/cifsglob.h|  11 +-
 fs/cifs/cifsproto.h   |   9 +-
 fs/cifs/cifssmb.c |  19 +-
 fs/cifs/connect.c |   5 +-
 fs/cifs/file.c| 477 ++
 fs/cifs/misc.c|  17 ++
 fs/cifs/smb2ops.c |  22 ++-
 fs/cifs/smb2pdu.c |  20 ++-
 fs/cifs/smbdirect.c   | 156 ++---
 fs/cifs/smbdirect.h   |   2 +-
 fs/cifs/transport.c   |  34 ++--
 14 files changed, 606 insertions(+), 187 deletions(-)

-- 
2.7.4

[Patch v3 11/16] CIFS: SMBD: Support page offset in memory registration

2018-09-07 Thread Long Li

From: Long Li 

Change code to pass the correct page offset during memory registration for
RDMA read/write.

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.c   | 18 --
 fs/cifs/smbdirect.c | 29 +
 fs/cifs/smbdirect.h |  2 +-
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index f603fbe..fc30774 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2623,8 +2623,8 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
 
rdata->mr = smbd_register_mr(
server->smbd_conn, rdata->pages,
-   rdata->nr_pages, rdata->tailsz,
-   true, need_invalidate);
+   rdata->nr_pages, rdata->page_offset,
+   rdata->tailsz, true, need_invalidate);
if (!rdata->mr)
return -ENOBUFS;
 
@@ -3013,16 +3013,22 @@ smb2_async_writev(struct cifs_writedata *wdata,
 
wdata->mr = smbd_register_mr(
server->smbd_conn, wdata->pages,
-   wdata->nr_pages, wdata->tailsz,
-   false, need_invalidate);
+   wdata->nr_pages, wdata->page_offset,
+   wdata->tailsz, false, need_invalidate);
if (!wdata->mr) {
rc = -ENOBUFS;
goto async_writev_out;
}
req->Length = 0;
req->DataOffset = 0;
-   req->RemainingBytes =
-   cpu_to_le32((wdata->nr_pages-1)*PAGE_SIZE + 
wdata->tailsz);
+   if (wdata->nr_pages > 1)
+   req->RemainingBytes =
+   cpu_to_le32(
+   (wdata->nr_pages - 1) * wdata->pagesz -
+   wdata->page_offset + wdata->tailsz
+   );
+   else
+   req->RemainingBytes = cpu_to_le32(wdata->tailsz);
req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
if (need_invalidate)
req->Channel = SMB2_CHANNEL_RDMA_V1;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b470cd0..f61daa9 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2475,7 +2475,7 @@ static struct smbd_mr *get_mr(struct smbd_connection 
*info)
  */
 struct smbd_mr *smbd_register_mr(
struct smbd_connection *info, struct page *pages[], int num_pages,
-   int tailsz, bool writing, bool need_invalidate)
+   int offset, int tailsz, bool writing, bool need_invalidate)
 {
struct smbd_mr *smbdirect_mr;
int rc, i;
@@ -2498,17 +2498,30 @@ struct smbd_mr *smbd_register_mr(
smbdirect_mr->sgl_count = num_pages;
sg_init_table(smbdirect_mr->sgl, num_pages);
 
-   for (i = 0; i < num_pages - 1; i++)
-   sg_set_page(_mr->sgl[i], pages[i], PAGE_SIZE, 0);
+   log_rdma_mr(INFO, "num_pages=0x%x offset=0x%x tailsz=0x%x\n",
+   num_pages, offset, tailsz);
+
+   if (num_pages == 1) {
+   sg_set_page(_mr->sgl[0], pages[0], tailsz, offset);
+   goto skip_multiple_pages;
+   }
 
-   sg_set_page(_mr->sgl[i], pages[i],
-   tailsz ? tailsz : PAGE_SIZE, 0);
+   /* We have at least two pages to register */
+   sg_set_page(
+   _mr->sgl[0], pages[0], PAGE_SIZE - offset, offset);
+   i = 1;
+   while (i < num_pages - 1) {
+   sg_set_page(_mr->sgl[i], pages[i], PAGE_SIZE, 0);
+   i++;
+   }
+   sg_set_page(_mr->sgl[i], pages[i], tailsz, 0);
 
+skip_multiple_pages:
dir = writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
smbdirect_mr->dir = dir;
rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
if (!rc) {
-   log_rdma_mr(INFO, "ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
+   log_rdma_mr(ERR, "ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
num_pages, dir, rc);
goto dma_map_error;
}
@@ -2516,8 +2529,8 @@ struct smbd_mr *smbd_register_mr(
rc = ib_map_mr_sg(smbdirect_mr->mr, smbdirect_mr->sgl, num_pages,
NULL, PAGE_SIZE);
if (rc != num_pages) {
-   log_rdma_mr(INFO,
-   "ib_map_mr_sg failed rc = %x num_pages = %x\n",
+   log_rdma_mr(ERR,
+   "ib_map_mr_sg failed rc = %d num_pages = %x\n",
rc, num_pages);
goto map_mr_error;
}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index f9038da..1e419c2 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -321,7 +321,7 @@ struct smbd_mr {
 /* Interfaces to register and

[Patch v3 12/16] CIFS: Pass page offset for calculating signature

2018-09-07 Thread Long Li

From: Long Li 

When calculating signature for the packet, it needs to read into the
correct page offset for the data.

Signed-off-by: Long Li 
---
 fs/cifs/cifsencrypt.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/cifs/cifsencrypt.c b/fs/cifs/cifsencrypt.c
index a6ef088..e88303c 100644
--- a/fs/cifs/cifsencrypt.c
+++ b/fs/cifs/cifsencrypt.c
@@ -68,11 +68,12 @@ int __cifs_calc_signature(struct smb_rqst *rqst,
 
/* now hash over the rq_pages array */
for (i = 0; i < rqst->rq_npages; i++) {
-   void *kaddr = kmap(rqst->rq_pages[i]);
-   size_t len = rqst->rq_pagesz;
+   void *kaddr;
+   unsigned int len, offset;
 
-   if (i == rqst->rq_npages - 1)
-   len = rqst->rq_tailsz;
+   rqst_page_get_length(rqst, i, , );
+
+   kaddr = (char *) kmap(rqst->rq_pages[i]) + offset;
 
crypto_shash_update(shash, kaddr, len);
 
-- 
2.7.4

[Patch v3 13/16] CIFS: Pass page offset for encrypting

2018-09-07 Thread Long Li

From: Long Li 

Encryption function needs to read data starting page offset from input
buffer.

This doesn't affect decryption path since it allocates its own page
buffers.

Signed-off-by: Long Li 
---
 fs/cifs/smb2ops.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smb2ops.c b/fs/cifs/smb2ops.c
index 1fa1c29..38d19b6 100644
--- a/fs/cifs/smb2ops.c
+++ b/fs/cifs/smb2ops.c
@@ -2189,9 +2189,10 @@ init_sg(struct smb_rqst *rqst, u8 *sign)
smb2_sg_set_buf([i], rqst->rq_iov[i].iov_base,
rqst->rq_iov[i].iov_len);
for (j = 0; i < sg_len - 1; i++, j++) {
-   unsigned int len = (j < rqst->rq_npages - 1) ? rqst->rq_pagesz
-   : rqst->rq_tailsz;
-   sg_set_page([i], rqst->rq_pages[j], len, 0);
+   unsigned int len, offset;
+
+   rqst_page_get_length(rqst, j, , );
+   sg_set_page([i], rqst->rq_pages[j], len, offset);
}
smb2_sg_set_buf([sg_len - 1], sign, SMB2_SIGNATURE_SIZE);
return sg;
@@ -2332,6 +2333,7 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
return rc;
 
new_rq->rq_pages = pages;
+   new_rq->rq_offset = old_rq->rq_offset;
new_rq->rq_npages = old_rq->rq_npages;
new_rq->rq_pagesz = old_rq->rq_pagesz;
new_rq->rq_tailsz = old_rq->rq_tailsz;
@@ -2363,10 +2365,14 @@ smb3_init_transform_rq(struct TCP_Server_Info *server, 
struct smb_rqst *new_rq,
 
/* copy pages form the old */
for (i = 0; i < npages; i++) {
-   char *dst = kmap(new_rq->rq_pages[i]);
-   char *src = kmap(old_rq->rq_pages[i]);
-   unsigned int len = (i < npages - 1) ? new_rq->rq_pagesz :
-   new_rq->rq_tailsz;
+   char *dst, *src;
+   unsigned int offset, len;
+
+   rqst_page_get_length(new_rq, i, , );
+
+   dst = (char *) kmap(new_rq->rq_pages[i]) + offset;
+   src = (char *) kmap(old_rq->rq_pages[i]) + offset;
+
memcpy(dst, src, len);
kunmap(new_rq->rq_pages[i]);
kunmap(old_rq->rq_pages[i]);
-- 
2.7.4

[Patch v3 01/16] CIFS: Add support for direct pages in rdata

2018-09-07 Thread Long Li

From: Long Li 

Add a function to allocate rdata without allocating pages for data
transfer. This gives the caller an option to pass a number of pages
that point to the data buffer.

rdata is reponsible for free those pages after it's done.

Signed-off-by: Long Li 
---
 fs/cifs/cifsglob.h |  3 +--
 fs/cifs/file.c | 23 ---
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/cifs/cifsglob.h b/fs/cifs/cifsglob.h
index 80a34ce..166e140 100644
--- a/fs/cifs/cifsglob.h
+++ b/fs/cifs/cifsglob.h
@@ -1173,14 +1173,13 @@ struct cifs_readdata {
struct kvec iov[2];
 #ifdef CONFIG_CIFS_SMB_DIRECT
struct smbd_mr  *mr;
-   struct page **direct_pages;
 #endif
unsigned intpagesz;
unsigned intpage_offset;
unsigned inttailsz;
unsigned intcredits;
unsigned intnr_pages;
-   struct page *pages[];
+   struct page **pages;
 };
 
 struct cifs_writedata;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 23fd430..1c98293 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2880,13 +2880,13 @@ cifs_strict_writev(struct kiocb *iocb, struct iov_iter 
*from)
 }
 
 static struct cifs_readdata *
-cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+cifs_readdata_direct_alloc(struct page **pages, work_func_t complete)
 {
struct cifs_readdata *rdata;
 
-   rdata = kzalloc(sizeof(*rdata) + (sizeof(struct page *) * nr_pages),
-   GFP_KERNEL);
+   rdata = kzalloc(sizeof(*rdata), GFP_KERNEL);
if (rdata != NULL) {
+   rdata->pages = pages;
kref_init(>refcount);
INIT_LIST_HEAD(>list);
init_completion(>done);
@@ -2896,6 +2896,22 @@ cifs_readdata_alloc(unsigned int nr_pages, work_func_t 
complete)
return rdata;
 }
 
+static struct cifs_readdata *
+cifs_readdata_alloc(unsigned int nr_pages, work_func_t complete)
+{
+   struct page **pages =
+   kzalloc(sizeof(struct page *) * nr_pages, GFP_KERNEL);
+   struct cifs_readdata *ret = NULL;
+
+   if (pages) {
+   ret = cifs_readdata_direct_alloc(pages, complete);
+   if (!ret)
+   kfree(pages);
+   }
+
+   return ret;
+}
+
 void
 cifs_readdata_release(struct kref *refcount)
 {
@@ -2910,6 +2926,7 @@ cifs_readdata_release(struct kref *refcount)
if (rdata->cfile)
cifsFileInfo_put(rdata->cfile);
 
+   kvfree(rdata->pages);
kfree(rdata);
 }
 
-- 
2.7.4

[Patch v3 09/16] CIFS: SMBD: Support page offset in RDMA recv

2018-09-07 Thread Long Li

From: Long Li 

RDMA recv function needs to place data to the correct place starting at
page offset.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 18 +++---
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index 6141e3c..ba53c52 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2004,10 +2004,12 @@ static int smbd_recv_buf(struct smbd_connection *info, 
char *buf,
  * return value: actual data read
  */
 static int smbd_recv_page(struct smbd_connection *info,
-   struct page *page, unsigned int to_read)
+   struct page *page, unsigned int page_offset,
+   unsigned int to_read)
 {
int ret;
char *to_address;
+   void *page_address;
 
/* make sure we have the page ready for read */
ret = wait_event_interruptible(
@@ -2015,16 +2017,17 @@ static int smbd_recv_page(struct smbd_connection *info,
info->reassembly_data_length >= to_read ||
info->transport_status != SMBD_CONNECTED);
if (ret)
-   return 0;
+   return ret;
 
/* now we can read from reassembly queue and not sleep */
-   to_address = kmap_atomic(page);
+   page_address = kmap_atomic(page);
+   to_address = (char *) page_address + page_offset;
 
log_read(INFO, "reading from page=%p address=%p to_read=%d\n",
page, to_address, to_read);
 
ret = smbd_recv_buf(info, to_address, to_read);
-   kunmap_atomic(to_address);
+   kunmap_atomic(page_address);
 
return ret;
 }
@@ -2038,7 +2041,7 @@ int smbd_recv(struct smbd_connection *info, struct msghdr 
*msg)
 {
char *buf;
struct page *page;
-   unsigned int to_read;
+   unsigned int to_read, page_offset;
int rc;
 
info->smbd_recv_pending++;
@@ -2052,15 +2055,16 @@ int smbd_recv(struct smbd_connection *info, struct 
msghdr *msg)
 
case READ | ITER_BVEC:
page = msg->msg_iter.bvec->bv_page;
+   page_offset = msg->msg_iter.bvec->bv_offset;
to_read = msg->msg_iter.bvec->bv_len;
-   rc = smbd_recv_page(info, page, to_read);
+   rc = smbd_recv_page(info, page, page_offset, to_read);
break;
 
default:
/* It's a bug in upper layer to get there */
cifs_dbg(VFS, "CIFS: invalid msg type %d\n",
msg->msg_iter.type);
-   rc = -EIO;
+   rc = -EINVAL;
}
 
info->smbd_recv_pending--;
-- 
2.7.4

[Patch v3 07/16] CIFS: When sending data on socket, pass the correct page offset

2018-09-07 Thread Long Li

From: Long Li 

It's possible that the offset is non-zero in the page to send, change the
function to pass this offset to socket.

Signed-off-by: Long Li 
---
 fs/cifs/transport.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index d6b5523..5c96ee8 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -288,15 +288,13 @@ __smb_send_rqst(struct TCP_Server_Info *server, struct 
smb_rqst *rqst)
 
/* now walk the page array and send each page in it */
for (i = 0; i < rqst->rq_npages; i++) {
-   size_t len = i == rqst->rq_npages - 1
-   ? rqst->rq_tailsz
-   : rqst->rq_pagesz;
-   struct bio_vec bvec = {
-   .bv_page = rqst->rq_pages[i],
-   .bv_len = len
-   };
+   struct bio_vec bvec;
+
+   bvec.bv_page = rqst->rq_pages[i];
+   rqst_page_get_length(rqst, i, _len, _offset);
+
iov_iter_bvec(_msg.msg_iter, WRITE | ITER_BVEC,
- , 1, len);
+ , 1, bvec.bv_len);
rc = smb_send_kvec(server, _msg, );
if (rc < 0)
break;
-- 
2.7.4

[Patch v3 10/16] CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration

2018-09-07 Thread Long Li

From: Long Li 

It is not necessary to deregister a memory registration after it has been
successfully invalidated.

Signed-off-by: Long Li 
---
 fs/cifs/smbdirect.c | 82 ++---
 1 file changed, 41 insertions(+), 41 deletions(-)

diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index ba53c52..b470cd0 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2296,50 +2296,50 @@ static void smbd_mr_recovery_work(struct work_struct 
*work)
int rc;
 
list_for_each_entry(smbdirect_mr, >mr_list, list) {
-   if (smbdirect_mr->state == MR_INVALIDATED ||
-   smbdirect_mr->state == MR_ERROR) {
-
-   if (smbdirect_mr->state == MR_INVALIDATED) {
-   ib_dma_unmap_sg(
-   info->id->device, smbdirect_mr->sgl,
-   smbdirect_mr->sgl_count,
-   smbdirect_mr->dir);
-   smbdirect_mr->state = MR_READY;
-   } else if (smbdirect_mr->state == MR_ERROR) {
-
-   /* recover this MR entry */
-   rc = ib_dereg_mr(smbdirect_mr->mr);
-   if (rc) {
-   log_rdma_mr(ERR,
-   "ib_dereg_mr failed rc=%x\n",
-   rc);
-   smbd_disconnect_rdma_connection(info);
-   }
+   if (smbdirect_mr->state == MR_INVALIDATED)
+   ib_dma_unmap_sg(
+   info->id->device, smbdirect_mr->sgl,
+   smbdirect_mr->sgl_count,
+   smbdirect_mr->dir);
+   else if (smbdirect_mr->state == MR_ERROR) {
+
+   /* recover this MR entry */
+   rc = ib_dereg_mr(smbdirect_mr->mr);
+   if (rc) {
+   log_rdma_mr(ERR,
+   "ib_dereg_mr failed rc=%x\n",
+   rc);
+   smbd_disconnect_rdma_connection(info);
+   continue;
+   }
 
-   smbdirect_mr->mr = ib_alloc_mr(
-   info->pd, info->mr_type,
+   smbdirect_mr->mr = ib_alloc_mr(
+   info->pd, info->mr_type,
+   info->max_frmr_depth);
+   if (IS_ERR(smbdirect_mr->mr)) {
+   log_rdma_mr(ERR,
+   "ib_alloc_mr failed mr_type=%x "
+   "max_frmr_depth=%x\n",
+   info->mr_type,
info->max_frmr_depth);
-   if (IS_ERR(smbdirect_mr->mr)) {
-   log_rdma_mr(ERR,
-   "ib_alloc_mr failed mr_type=%x "
-   "max_frmr_depth=%x\n",
-   info->mr_type,
-   info->max_frmr_depth);
-   smbd_disconnect_rdma_connection(info);
-   }
-
-   smbdirect_mr->state = MR_READY;
+   smbd_disconnect_rdma_connection(info);
+   continue;
}
-   /* smbdirect_mr->state is updated by this function
-* and is read and updated by I/O issuing CPUs trying
-* to get a MR, the call to atomic_inc_return
-* implicates a memory barrier and guarantees this
-* value is updated before waking up any calls to
-* get_mr() from the I/O issuing CPUs
-*/
-   if (atomic_inc_return(>mr_ready_count) == 1)
-   wake_up_interruptible(>wait_mr);
-   }
+   } else
+   /* This MR is being used, don't recover it */
+   continue;
+
+   smbdirect_mr->state = MR_READY;
+
+   /* smbdirect_mr->state is updated by this function
+* and is read and updated by I/O issuing CPUs trying
+* to get a MR, the call to atomic_inc_return
+* implicates a memory barrier and guarantees this
+* value is updated before waking up any calls to
+* get_mr() from the I/O issuing CPUs
+*/
+

[Patch v3 06/16] CIFS: Introduce helper function to get page offset and length in smb_rqst

2018-09-07 Thread Long Li

From: Long Li 

Introduce a function rqst_page_get_length to return the page offset and
length for a given page in smb_rqst. This function is to be used by
following patches.

Signed-off-by: Long Li 
---
 fs/cifs/cifsproto.h |  3 +++
 fs/cifs/misc.c  | 17 +
 2 files changed, 20 insertions(+)

diff --git a/fs/cifs/cifsproto.h b/fs/cifs/cifsproto.h
index 7933c5f..89dda14 100644
--- a/fs/cifs/cifsproto.h
+++ b/fs/cifs/cifsproto.h
@@ -557,4 +557,7 @@ int cifs_alloc_hash(const char *name, struct crypto_shash 
**shash,
struct sdesc **sdesc);
 void cifs_free_hash(struct crypto_shash **shash, struct sdesc **sdesc);
 
+extern void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset);
+
 #endif /* _CIFSPROTO_H */
diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
index 96849b5..e951417 100644
--- a/fs/cifs/misc.c
+++ b/fs/cifs/misc.c
@@ -905,3 +905,20 @@ cifs_free_hash(struct crypto_shash **shash, struct sdesc 
**sdesc)
crypto_free_shash(*shash);
*shash = NULL;
 }
+
+/**
+ * rqst_page_get_length - obtain the length and offset for a page in smb_rqst
+ * Input: rqst - a smb_rqst, page - a page index for rqst
+ * Output: *len - the length for this page, *offset - the offset for this page
+ */
+void rqst_page_get_length(struct smb_rqst *rqst, unsigned int page,
+   unsigned int *len, unsigned int *offset)
+{
+   *len = rqst->rq_pagesz;
+   *offset = (page == 0) ? rqst->rq_offset : 0;
+
+   if (rqst->rq_npages == 1 || page == rqst->rq_npages-1)
+   *len = rqst->rq_tailsz;
+   else if (page == 0)
+   *len = rqst->rq_pagesz - rqst->rq_offset;
+}
-- 
2.7.4

[Patch v3 00/16] CIFS: add support for direct I/O

2018-09-07 Thread Long Li

From: Long Li 

This patch set implements direct I/O.

In normal code path (even with cache=none), CIFS copies I/O data from
user-space to kernel-space for security reasons of possible protocol
required signing and encryption on user data.

With this patch set, CIFS passes the I/O data directly from user-space
buffer to the transport layer, when file system is mounted with
"cache-none".

Patch v2 addressed comments from Christoph Hellwig  and
Tom Talpey  to implement direct I/O for both
socket and RDMA.

Patch v3 added support for kernel AIO.


Long Li (16):
  CIFS: Add support for direct pages in rdata
  CIFS: Use offset when reading pages
  CIFS: Add support for direct pages in wdata
  CIFS: pass page offset when issuing SMB write
  CIFS: Calculate the correct request length based on page offset and
tail size
  CIFS: Introduce helper function to get page offset and length in
smb_rqst
  CIFS: When sending data on socket, pass the correct page offset
  CIFS: SMBD: Support page offset in RDMA send
  CIFS: SMBD: Support page offset in RDMA recv
  CIFS: SMBD: Do not call ib_dereg_mr on invalidated memory registration
  CIFS: SMBD: Support page offset in memory registration
  CIFS: Pass page offset for calculating signature
  CIFS: Pass page offset for encrypting
  CIFS: Add support for direct I/O read
  CIFS: Add support for direct I/O write
  CIFS: Add direct I/O functions to file_operations

 fs/cifs/cifsencrypt.c |   9 +-
 fs/cifs/cifsfs.c  |  10 +-
 fs/cifs/cifsfs.h  |   2 +
 fs/cifs/cifsglob.h|  11 +-
 fs/cifs/cifsproto.h   |   9 +-
 fs/cifs/cifssmb.c |  19 +-
 fs/cifs/connect.c |   5 +-
 fs/cifs/file.c| 477 ++
 fs/cifs/misc.c|  17 ++
 fs/cifs/smb2ops.c |  22 ++-
 fs/cifs/smb2pdu.c |  20 ++-
 fs/cifs/smbdirect.c   | 156 ++---
 fs/cifs/smbdirect.h   |   2 +-
 fs/cifs/transport.c   |  34 ++--
 14 files changed, 606 insertions(+), 187 deletions(-)

-- 
2.7.4

[Patch v3 11/16] CIFS: SMBD: Support page offset in memory registration

2018-09-07 Thread Long Li

From: Long Li 

Change code to pass the correct page offset during memory registration for
RDMA read/write.

Signed-off-by: Long Li 
---
 fs/cifs/smb2pdu.c   | 18 --
 fs/cifs/smbdirect.c | 29 +
 fs/cifs/smbdirect.h |  2 +-
 3 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c
index f603fbe..fc30774 100644
--- a/fs/cifs/smb2pdu.c
+++ b/fs/cifs/smb2pdu.c
@@ -2623,8 +2623,8 @@ smb2_new_read_req(void **buf, unsigned int *total_len,
 
rdata->mr = smbd_register_mr(
server->smbd_conn, rdata->pages,
-   rdata->nr_pages, rdata->tailsz,
-   true, need_invalidate);
+   rdata->nr_pages, rdata->page_offset,
+   rdata->tailsz, true, need_invalidate);
if (!rdata->mr)
return -ENOBUFS;
 
@@ -3013,16 +3013,22 @@ smb2_async_writev(struct cifs_writedata *wdata,
 
wdata->mr = smbd_register_mr(
server->smbd_conn, wdata->pages,
-   wdata->nr_pages, wdata->tailsz,
-   false, need_invalidate);
+   wdata->nr_pages, wdata->page_offset,
+   wdata->tailsz, false, need_invalidate);
if (!wdata->mr) {
rc = -ENOBUFS;
goto async_writev_out;
}
req->Length = 0;
req->DataOffset = 0;
-   req->RemainingBytes =
-   cpu_to_le32((wdata->nr_pages-1)*PAGE_SIZE + 
wdata->tailsz);
+   if (wdata->nr_pages > 1)
+   req->RemainingBytes =
+   cpu_to_le32(
+   (wdata->nr_pages - 1) * wdata->pagesz -
+   wdata->page_offset + wdata->tailsz
+   );
+   else
+   req->RemainingBytes = cpu_to_le32(wdata->tailsz);
req->Channel = SMB2_CHANNEL_RDMA_V1_INVALIDATE;
if (need_invalidate)
req->Channel = SMB2_CHANNEL_RDMA_V1;
diff --git a/fs/cifs/smbdirect.c b/fs/cifs/smbdirect.c
index b470cd0..f61daa9 100644
--- a/fs/cifs/smbdirect.c
+++ b/fs/cifs/smbdirect.c
@@ -2475,7 +2475,7 @@ static struct smbd_mr *get_mr(struct smbd_connection 
*info)
  */
 struct smbd_mr *smbd_register_mr(
struct smbd_connection *info, struct page *pages[], int num_pages,
-   int tailsz, bool writing, bool need_invalidate)
+   int offset, int tailsz, bool writing, bool need_invalidate)
 {
struct smbd_mr *smbdirect_mr;
int rc, i;
@@ -2498,17 +2498,30 @@ struct smbd_mr *smbd_register_mr(
smbdirect_mr->sgl_count = num_pages;
sg_init_table(smbdirect_mr->sgl, num_pages);
 
-   for (i = 0; i < num_pages - 1; i++)
-   sg_set_page(_mr->sgl[i], pages[i], PAGE_SIZE, 0);
+   log_rdma_mr(INFO, "num_pages=0x%x offset=0x%x tailsz=0x%x\n",
+   num_pages, offset, tailsz);
+
+   if (num_pages == 1) {
+   sg_set_page(_mr->sgl[0], pages[0], tailsz, offset);
+   goto skip_multiple_pages;
+   }
 
-   sg_set_page(_mr->sgl[i], pages[i],
-   tailsz ? tailsz : PAGE_SIZE, 0);
+   /* We have at least two pages to register */
+   sg_set_page(
+   _mr->sgl[0], pages[0], PAGE_SIZE - offset, offset);
+   i = 1;
+   while (i < num_pages - 1) {
+   sg_set_page(_mr->sgl[i], pages[i], PAGE_SIZE, 0);
+   i++;
+   }
+   sg_set_page(_mr->sgl[i], pages[i], tailsz, 0);
 
+skip_multiple_pages:
dir = writing ? DMA_FROM_DEVICE : DMA_TO_DEVICE;
smbdirect_mr->dir = dir;
rc = ib_dma_map_sg(info->id->device, smbdirect_mr->sgl, num_pages, dir);
if (!rc) {
-   log_rdma_mr(INFO, "ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
+   log_rdma_mr(ERR, "ib_dma_map_sg num_pages=%x dir=%x rc=%x\n",
num_pages, dir, rc);
goto dma_map_error;
}
@@ -2516,8 +2529,8 @@ struct smbd_mr *smbd_register_mr(
rc = ib_map_mr_sg(smbdirect_mr->mr, smbdirect_mr->sgl, num_pages,
NULL, PAGE_SIZE);
if (rc != num_pages) {
-   log_rdma_mr(INFO,
-   "ib_map_mr_sg failed rc = %x num_pages = %x\n",
+   log_rdma_mr(ERR,
+   "ib_map_mr_sg failed rc = %d num_pages = %x\n",
rc, num_pages);
goto map_mr_error;
}
diff --git a/fs/cifs/smbdirect.h b/fs/cifs/smbdirect.h
index f9038da..1e419c2 100644
--- a/fs/cifs/smbdirect.h
+++ b/fs/cifs/smbdirect.h
@@ -321,7 +321,7 @@ struct smbd_mr {
 /* Interfaces to register and

[Patch v3 15/16] CIFS: Add support for direct I/O write

2018-09-07 Thread Long Li

From: Long Li 

With direct I/O write, user supplied buffers are pinned to the memory and data
are transferred directly from user buffers to the transport layer.

Change in v3: added support for kernel AIO

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.h |   1 +
 fs/cifs/file.c   | 195 ++-
 2 files changed, 165 insertions(+), 31 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb, struct 
iov_iter *to);
 extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 476b2a1..76e0266 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2537,6 +2537,8 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
loff_t saved_offset = offset;
pid_t pid;
struct TCP_Server_Info *server;
+   struct page **pagevec;
+   size_t start;
 
if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
pid = open_file->pid;
@@ -2553,38 +2555,74 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
if (rc)
break;
 
-   nr_pages = get_numpages(wsize, len, _len);
-   wdata = cifs_writedata_alloc(nr_pages,
+   if (ctx->direct_io) {
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu 
count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   iov_iter_advance(from, cur_len);
+
+   nr_pages = (cur_len + start + PAGE_SIZE - 1) / 
PAGE_SIZE;
+
+   wdata = cifs_writedata_direct_alloc(pagevec,
 cifs_uncached_writev_complete);
-   if (!wdata) {
-   rc = -ENOMEM;
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
+   if (!wdata) {
+   rc = -ENOMEM;
+   add_credits_and_wake_if(server, credits, 0);
+   break;
+   }
 
-   rc = cifs_write_allocate_pages(wdata->pages, nr_pages);
-   if (rc) {
-   kfree(wdata);
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
 
-   num_pages = nr_pages;
-   rc = wdata_fill_from_iovec(wdata, from, _len, _pages);
-   if (rc) {
-   for (i = 0; i < nr_pages; i++)
-   put_page(wdata->pages[i]);
-   kfree(wdata);
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
+   wdata->page_offset = start;
+   wdata->tailsz =
+   nr_pages > 1 ?
+   cur_len - (PAGE_SIZE - start) -
+   (nr_pages - 2) * PAGE_SIZE :
+   cur_len;
+   } else {
+   nr_pages = get_numpages(wsize, len, _len);
+   wdata = cifs_writedata_alloc(nr_pages,
+cifs_uncached_writev_complete);
+   if (!wdata) {
+   rc = -ENOMEM;
+   add_credits_and_wake_if(server, credits, 0);
+   break;
+   }
 
-   /*
-* Bring nr_pages down to the number of pages we actually used,
-* and free any pages that we didn't use.
-*/
-   for ( ; nr_pages > num_pages; nr_pages--)
-   put_page(wdata->pages[nr_pages -

[Patch v3 15/16] CIFS: Add support for direct I/O write

2018-09-07 Thread Long Li

From: Long Li 

With direct I/O write, user supplied buffers are pinned to the memory and data
are transferred directly from user buffers to the transport layer.

Change in v3: added support for kernel AIO

Signed-off-by: Long Li 
---
 fs/cifs/cifsfs.h |   1 +
 fs/cifs/file.c   | 195 ++-
 2 files changed, 165 insertions(+), 31 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index 7fba9aa..e9c5103 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -105,6 +105,7 @@ extern ssize_t cifs_user_readv(struct kiocb *iocb, struct 
iov_iter *to);
 extern ssize_t cifs_direct_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to);
 extern ssize_t cifs_user_writev(struct kiocb *iocb, struct iov_iter *from);
+extern ssize_t cifs_direct_writev(struct kiocb *iocb, struct iov_iter *from);
 extern ssize_t cifs_strict_writev(struct kiocb *iocb, struct iov_iter *from);
 extern int cifs_lock(struct file *, int, struct file_lock *);
 extern int cifs_fsync(struct file *, loff_t, loff_t, int);
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index 476b2a1..76e0266 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2537,6 +2537,8 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
loff_t saved_offset = offset;
pid_t pid;
struct TCP_Server_Info *server;
+   struct page **pagevec;
+   size_t start;
 
if (cifs_sb->mnt_cifs_flags & CIFS_MOUNT_RWPIDFORWARD)
pid = open_file->pid;
@@ -2553,38 +2555,74 @@ cifs_write_from_iter(loff_t offset, size_t len, struct 
iov_iter *from,
if (rc)
break;
 
-   nr_pages = get_numpages(wsize, len, _len);
-   wdata = cifs_writedata_alloc(nr_pages,
+   if (ctx->direct_io) {
+   cur_len = iov_iter_get_pages_alloc(
+   from, , wsize, );
+   if (cur_len < 0) {
+   cifs_dbg(VFS,
+   "direct_writev couldn't get user pages "
+   "(rc=%zd) iter type %d iov_offset %lu 
count"
+   " %lu\n",
+   cur_len, from->type,
+   from->iov_offset, from->count);
+   dump_stack();
+   break;
+   }
+   iov_iter_advance(from, cur_len);
+
+   nr_pages = (cur_len + start + PAGE_SIZE - 1) / 
PAGE_SIZE;
+
+   wdata = cifs_writedata_direct_alloc(pagevec,
 cifs_uncached_writev_complete);
-   if (!wdata) {
-   rc = -ENOMEM;
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
+   if (!wdata) {
+   rc = -ENOMEM;
+   add_credits_and_wake_if(server, credits, 0);
+   break;
+   }
 
-   rc = cifs_write_allocate_pages(wdata->pages, nr_pages);
-   if (rc) {
-   kfree(wdata);
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
 
-   num_pages = nr_pages;
-   rc = wdata_fill_from_iovec(wdata, from, _len, _pages);
-   if (rc) {
-   for (i = 0; i < nr_pages; i++)
-   put_page(wdata->pages[i]);
-   kfree(wdata);
-   add_credits_and_wake_if(server, credits, 0);
-   break;
-   }
+   wdata->page_offset = start;
+   wdata->tailsz =
+   nr_pages > 1 ?
+   cur_len - (PAGE_SIZE - start) -
+   (nr_pages - 2) * PAGE_SIZE :
+   cur_len;
+   } else {
+   nr_pages = get_numpages(wsize, len, _len);
+   wdata = cifs_writedata_alloc(nr_pages,
+cifs_uncached_writev_complete);
+   if (!wdata) {
+   rc = -ENOMEM;
+   add_credits_and_wake_if(server, credits, 0);
+   break;
+   }
 
-   /*
-* Bring nr_pages down to the number of pages we actually used,
-* and free any pages that we didn't use.
-*/
-   for ( ; nr_pages > num_pages; nr_pages--)
-   put_page(wdata->pages[nr_pages -

Re: [PATCH V3 21/26] dt-bindings: interrupt-controller: C-SKY APB intc

2018-09-07 Thread Guo Ren

On Fri, Sep 07, 2018 at 10:13:13AM -0500, Rob Herring wrote:
> On Thu, Sep 6, 2018 at 8:05 AM Arnd Bergmann  wrote:
> >
> > On Thu, Sep 6, 2018 at 4:13 AM Guo Ren  wrote:
> > >
> > > On Wed, Sep 05, 2018 at 07:43:10PM -0500, Rob Herring wrote:
> > > > On Wed, Sep 5, 2018 at 7:10 AM Guo Ren  wrote:
> > > > >
> > > > > Signed-off-by: Guo Ren 
> > > > > +
> > > > > +   intc: interrupt-controller {
> > > >
> > > > Needs a unit-address.
> > > Ok, change it to:
> > > intc: interrupt-controller@0x0050 {
> >
> > The unit address has no leading 0x or leading zeroes, so
> > interrupt-controller@50
> 
> Please build your dts files with W=12 and it will tell you this and
> other errors. And then update any examples based on that.
Got it, thx for the tip.

 Guo Ren

Re: [PATCH V3 21/26] dt-bindings: interrupt-controller: C-SKY APB intc

2018-09-07 Thread Guo Ren

On Fri, Sep 07, 2018 at 10:13:13AM -0500, Rob Herring wrote:
> On Thu, Sep 6, 2018 at 8:05 AM Arnd Bergmann  wrote:
> >
> > On Thu, Sep 6, 2018 at 4:13 AM Guo Ren  wrote:
> > >
> > > On Wed, Sep 05, 2018 at 07:43:10PM -0500, Rob Herring wrote:
> > > > On Wed, Sep 5, 2018 at 7:10 AM Guo Ren  wrote:
> > > > >
> > > > > Signed-off-by: Guo Ren 
> > > > > +
> > > > > +   intc: interrupt-controller {
> > > >
> > > > Needs a unit-address.
> > > Ok, change it to:
> > > intc: interrupt-controller@0x0050 {
> >
> > The unit address has no leading 0x or leading zeroes, so
> > interrupt-controller@50
> 
> Please build your dts files with W=12 and it will tell you this and
> other errors. And then update any examples based on that.
Got it, thx for the tip.

 Guo Ren

Re: [PATCH] Input: reserve 2 events code because of HID

2018-09-07 Thread Peter Hutterer

On Fri, Sep 07, 2018 at 10:51:15AM +0200, Benjamin Tissoires wrote:
> From: Benjamin Tissoires 
> 
> Prior to commit 190d7f02ce8e ("HID: input: do not increment usages when
> a duplicate is found") from the v4.18 kernel, HID used to shift the
> event codes if a duplicate usage was found. This ended up in a situation
> where a device would export a ton of ABS_MISC+n event codes, or a ton
> of REL_MISC+n event codes.
> 
> This is now fixed, however userspace needs to detect those situation.
> Fortunately, ABS_MISC+1 was never assigned a code, and so libinput
> can detect fake multitouch devices from genuine ones by checking is
> ABS_MISC+1 is set.

sorry, this isn't quite correct. we use ABS_MT_SLOT - 1 (0x2e) for the
detection of fake MT devices, i.e.
  if (ABS_MT_SLOT and not ABS_MT_SLOT-1) then multitouch

That gives you up to ABS_MISC + 6 for legitimate usage. this is handled by
libevdev, not libinput directly. libevdev adjusts the various bits on "is
this device an MT device" based on whether ABS_MT_SLOT-1 is set.

I can change this to also check for (ABS_MISC and not ABS_MISC+1) but that
obviously will depend on updated libraries then. Though I guess it won't
really be an issue until we fill up the other codes up to including 0x2e
with real values and expect userspace to handle those.

None of the bits I maintain have special code for REL_MISC+n so that bit
works fine, IMO.

One request though: instead of just having the value reserved, can we make
it REL_RESERVED and ABS_RESERVED please? Or ABS_CANARY :) Much easier than
hardcoding the numeric value.

Cheers,
   Peter


> Now that we have REL_WHEEL_HI_RES, libinput won't be able to differentiate
> true high res mice from some other device in a pre-v4.18 kernel.
> 
> Set in stone that the ABS_MISC+1 and REL_MISC+1 are reserved and should not
> be used so userspace can properly work around those old kernels.
> 
> Signed-off-by: Benjamin Tissoires 
> ---
> 
> Hi,
> 
> while reviewing my local tree, I realize that we might want to be able
> to differentiate older kernels from new ones that export REL_WHEEL_HI_RES.
> 
> I know Dmitry was against adding several REL_MISC, so I hope just moving
> REL_WHEEL_HI_RES by one and reserving the faulty event codes would be good
> this time.
> 
> This patch applies on top of the branch for-4.20/logitech-highres from
> Jiri's tree. It should go through Jiri's tree as well.
> 
> Cheers,
> Benjamin
> 
>  include/uapi/linux/input-event-codes.h | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/input-event-codes.h 
> b/include/uapi/linux/input-event-codes.h
> index 29fb891ea337..30149939249a 100644
> --- a/include/uapi/linux/input-event-codes.h
> +++ b/include/uapi/linux/input-event-codes.h
> @@ -708,7 +708,12 @@
>  #define REL_DIAL 0x07
>  #define REL_WHEEL0x08
>  #define REL_MISC 0x09
> -#define REL_WHEEL_HI_RES 0x0a
> +/*
> + * 0x0a is reserved and should not be used.
> + * It was used by HID as REL_MISC+1 and usersapce needs to detect if
> + * the next REL_* event is correct or is just REL_MISC + n.
> + */
> +#define REL_WHEEL_HI_RES 0x0b
>  #define REL_MAX  0x0f
>  #define REL_CNT  (REL_MAX+1)
>  
> @@ -745,6 +750,12 @@
>  
>  #define ABS_MISC 0x28
>  
> +/*
> + * 0x29 is reserved and should not be used.
> + * It was used by HID as ABS_MISC+1 and usersapce needs to detect if
> + * the next ABS_* event is correct or is just ABS_MISC + n.
> + */
> +
>  #define ABS_MT_SLOT  0x2f/* MT slot being modified */
>  #define ABS_MT_TOUCH_MAJOR   0x30/* Major axis of touching ellipse */
>  #define ABS_MT_TOUCH_MINOR   0x31/* Minor axis (omit if circular) */
> -- 
> 2.14.3
>

Re: [PATCH] Input: reserve 2 events code because of HID

2018-09-07 Thread Peter Hutterer

On Fri, Sep 07, 2018 at 10:51:15AM +0200, Benjamin Tissoires wrote:
> From: Benjamin Tissoires 
> 
> Prior to commit 190d7f02ce8e ("HID: input: do not increment usages when
> a duplicate is found") from the v4.18 kernel, HID used to shift the
> event codes if a duplicate usage was found. This ended up in a situation
> where a device would export a ton of ABS_MISC+n event codes, or a ton
> of REL_MISC+n event codes.
> 
> This is now fixed, however userspace needs to detect those situation.
> Fortunately, ABS_MISC+1 was never assigned a code, and so libinput
> can detect fake multitouch devices from genuine ones by checking is
> ABS_MISC+1 is set.

sorry, this isn't quite correct. we use ABS_MT_SLOT - 1 (0x2e) for the
detection of fake MT devices, i.e.
  if (ABS_MT_SLOT and not ABS_MT_SLOT-1) then multitouch

That gives you up to ABS_MISC + 6 for legitimate usage. this is handled by
libevdev, not libinput directly. libevdev adjusts the various bits on "is
this device an MT device" based on whether ABS_MT_SLOT-1 is set.

I can change this to also check for (ABS_MISC and not ABS_MISC+1) but that
obviously will depend on updated libraries then. Though I guess it won't
really be an issue until we fill up the other codes up to including 0x2e
with real values and expect userspace to handle those.

None of the bits I maintain have special code for REL_MISC+n so that bit
works fine, IMO.

One request though: instead of just having the value reserved, can we make
it REL_RESERVED and ABS_RESERVED please? Or ABS_CANARY :) Much easier than
hardcoding the numeric value.

Cheers,
   Peter


> Now that we have REL_WHEEL_HI_RES, libinput won't be able to differentiate
> true high res mice from some other device in a pre-v4.18 kernel.
> 
> Set in stone that the ABS_MISC+1 and REL_MISC+1 are reserved and should not
> be used so userspace can properly work around those old kernels.
> 
> Signed-off-by: Benjamin Tissoires 
> ---
> 
> Hi,
> 
> while reviewing my local tree, I realize that we might want to be able
> to differentiate older kernels from new ones that export REL_WHEEL_HI_RES.
> 
> I know Dmitry was against adding several REL_MISC, so I hope just moving
> REL_WHEEL_HI_RES by one and reserving the faulty event codes would be good
> this time.
> 
> This patch applies on top of the branch for-4.20/logitech-highres from
> Jiri's tree. It should go through Jiri's tree as well.
> 
> Cheers,
> Benjamin
> 
>  include/uapi/linux/input-event-codes.h | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/input-event-codes.h 
> b/include/uapi/linux/input-event-codes.h
> index 29fb891ea337..30149939249a 100644
> --- a/include/uapi/linux/input-event-codes.h
> +++ b/include/uapi/linux/input-event-codes.h
> @@ -708,7 +708,12 @@
>  #define REL_DIAL 0x07
>  #define REL_WHEEL0x08
>  #define REL_MISC 0x09
> -#define REL_WHEEL_HI_RES 0x0a
> +/*
> + * 0x0a is reserved and should not be used.
> + * It was used by HID as REL_MISC+1 and usersapce needs to detect if
> + * the next REL_* event is correct or is just REL_MISC + n.
> + */
> +#define REL_WHEEL_HI_RES 0x0b
>  #define REL_MAX  0x0f
>  #define REL_CNT  (REL_MAX+1)
>  
> @@ -745,6 +750,12 @@
>  
>  #define ABS_MISC 0x28
>  
> +/*
> + * 0x29 is reserved and should not be used.
> + * It was used by HID as ABS_MISC+1 and usersapce needs to detect if
> + * the next ABS_* event is correct or is just ABS_MISC + n.
> + */
> +
>  #define ABS_MT_SLOT  0x2f/* MT slot being modified */
>  #define ABS_MT_TOUCH_MAJOR   0x30/* Major axis of touching ellipse */
>  #define ABS_MT_TOUCH_MINOR   0x31/* Minor axis (omit if circular) */
> -- 
> 2.14.3
>

RE: [PATCH v2 6/8] perf/x86/intel/lbr: guest requesting KVM for lbr stack save/restore

2018-09-07 Thread Wang, Wei W

On Saturday, September 8, 2018 4:05 AM, Andi Kleen wrote:
> > How would you realize the function of saving/restoring the lbr stack on the
> host?
> >
> > Here, we create a perf event on the host (please see
> guest_lbr_event_create on patch 7), which essentially satisfies all the
> conditions (e.g. increases cpuc->lbr_users) that are required to have the lbr
> stack saved/restored on the vCPU switching.
> >
> > If we want to stop the host side lbr stack save/restore for the vCPU, we
> need accordingly to call guest_lbr_event_release (in patch 7) to destroy that
> perf event (the host doesn't automatically stop saving the lbr stack for the
> vCPU if that perf event is still there).
> >
> > When would you call that release function? (we all know that the lbr
> > doesn't need to be saved when the guest is not using it, but we need
> > to destroy that perf event to achieve "doesn't need to be saved")
> 
> Maybe set a timer on DEBUGCTL LBR=0 ? A timer would provide hysteresis,
> so that quick toggles (like in a PMI handler) wouldn't do anything expensive.

I'm not sure if this would solve the key problem. What would be the frequency 
to have the timer fired?

- Let's say every 10ms, for example. The guest is disturbed by a timer 
interrupt (cause VMExit) every 10ms, though the guest doesn't use the lbr any 
more after the first use. The problem is switched to when do we call the 
release function to cancel the timer if we want to avoid that unnecessary 
disturbance to the guest.

- When the timer fires, and it finds "DEBUGCTL LBR=0", it destroys the host 
side perf event, then the lbr stack isn't saved when the vCPU is scheduled out. 
As also mentioned in the commit log, perf_pmu_sched_task in the guest disables 
that bit before reading out the lbr stack (pmi is another example). Now, 
DEBUGCTL LBR=0", and before the guest read out the lbr stack, the vCPU may 
happen to be scheduled out, and another thread on the host is scheduled in and 
will get the lbr stack overwritten. So, before the guest reads out the lbr 
stack, the stack has already been polluted in this case.

Best,
Wei

RE: [PATCH v2 6/8] perf/x86/intel/lbr: guest requesting KVM for lbr stack save/restore

2018-09-07 Thread Wang, Wei W

On Saturday, September 8, 2018 4:05 AM, Andi Kleen wrote:
> > How would you realize the function of saving/restoring the lbr stack on the
> host?
> >
> > Here, we create a perf event on the host (please see
> guest_lbr_event_create on patch 7), which essentially satisfies all the
> conditions (e.g. increases cpuc->lbr_users) that are required to have the lbr
> stack saved/restored on the vCPU switching.
> >
> > If we want to stop the host side lbr stack save/restore for the vCPU, we
> need accordingly to call guest_lbr_event_release (in patch 7) to destroy that
> perf event (the host doesn't automatically stop saving the lbr stack for the
> vCPU if that perf event is still there).
> >
> > When would you call that release function? (we all know that the lbr
> > doesn't need to be saved when the guest is not using it, but we need
> > to destroy that perf event to achieve "doesn't need to be saved")
> 
> Maybe set a timer on DEBUGCTL LBR=0 ? A timer would provide hysteresis,
> so that quick toggles (like in a PMI handler) wouldn't do anything expensive.

I'm not sure if this would solve the key problem. What would be the frequency 
to have the timer fired?

- Let's say every 10ms, for example. The guest is disturbed by a timer 
interrupt (cause VMExit) every 10ms, though the guest doesn't use the lbr any 
more after the first use. The problem is switched to when do we call the 
release function to cancel the timer if we want to avoid that unnecessary 
disturbance to the guest.

- When the timer fires, and it finds "DEBUGCTL LBR=0", it destroys the host 
side perf event, then the lbr stack isn't saved when the vCPU is scheduled out. 
As also mentioned in the commit log, perf_pmu_sched_task in the guest disables 
that bit before reading out the lbr stack (pmi is another example). Now, 
DEBUGCTL LBR=0", and before the guest read out the lbr stack, the vCPU may 
happen to be scheduled out, and another thread on the host is scheduled in and 
will get the lbr stack overwritten. So, before the guest reads out the lbr 
stack, the stack has already been polluted in this case.

Best,
Wei

Re: [RFC][PATCH 7/8] x86/mm/vsyscall: consider vsyscall page part of user address space

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:28 AM Dave Hansen  wrote:
> The vsyscall page is weird.  It is in what is traditionally part of the
> kernel address space.  But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
>
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults.  Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().
[...]
>  static int fault_in_kernel_space(unsigned long address)
>  {
> +   /*
> +* The vsyscall page is at an address above TASK_SIZE_MAX,
> +* but is not considered part of the kernel address space.
> +*/
> +   if (is_vsyscall_vaddr(address))
> +   return false;

I think something should check for "#ifdef CONFIG_X86_64"? 32-bit
doesn't have a vsyscall page, right? And this code probably shouldn't
veer off into the userspace-area fault handling code for addresses in
the range 0xff60-0xff600fff... what is in that region on 32-bit?
Modules or something like that?
Maybe change is_vsyscall_vaddr() so that it always returns false on
32-bit, or put both the definition of is_vsyscall_vaddr() and this
code behind #ifdef guards.
And, in a separate patch, maybe also #ifdef-guard the definition of
VSYSCALL_ADDR in vsyscall.h? Nothing good is going to result from
making a garbage VSYSCALL_ADDR available to 32-bit code.

> +#ifdef CONFIG_X86_64
> +   /*
> +* Instruction fetch faults in the vsyscall page might need
> +* emulation.  The vsyscall page is at a high address
> +* (>PAGE_OFFSET), but is considered to be part of the user
> +* address space.
> +*
> +* The vsyscall page does not have a "real" VMA, so do this
> +* emulation before we go searching for VMAse

"VMAse"? Is that a typo?

Re: [RFC][PATCH 7/8] x86/mm/vsyscall: consider vsyscall page part of user address space

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:28 AM Dave Hansen  wrote:
> The vsyscall page is weird.  It is in what is traditionally part of the
> kernel address space.  But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
>
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults.  Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().
[...]
>  static int fault_in_kernel_space(unsigned long address)
>  {
> +   /*
> +* The vsyscall page is at an address above TASK_SIZE_MAX,
> +* but is not considered part of the kernel address space.
> +*/
> +   if (is_vsyscall_vaddr(address))
> +   return false;

I think something should check for "#ifdef CONFIG_X86_64"? 32-bit
doesn't have a vsyscall page, right? And this code probably shouldn't
veer off into the userspace-area fault handling code for addresses in
the range 0xff60-0xff600fff... what is in that region on 32-bit?
Modules or something like that?
Maybe change is_vsyscall_vaddr() so that it always returns false on
32-bit, or put both the definition of is_vsyscall_vaddr() and this
code behind #ifdef guards.
And, in a separate patch, maybe also #ifdef-guard the definition of
VSYSCALL_ADDR in vsyscall.h? Nothing good is going to result from
making a garbage VSYSCALL_ADDR available to 32-bit code.

> +#ifdef CONFIG_X86_64
> +   /*
> +* Instruction fetch faults in the vsyscall page might need
> +* emulation.  The vsyscall page is at a high address
> +* (>PAGE_OFFSET), but is considered to be part of the user
> +* address space.
> +*
> +* The vsyscall page does not have a "real" VMA, so do this
> +* emulation before we go searching for VMAse

"VMAse"? Is that a typo?

Re: [RFC][PATCH 6/8] x86/mm: add vsyscall address helper

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:25 AM Dave Hansen  wrote:
> We will shortly be using this check in two locations.  Put it in
> a helper before we do so.
[...]
> +/*
> + * The (legacy) vsyscall page is the long page in the kernel portion
> + * of the address space that has user-accessible permissions.
> + */
> +static bool is_vsyscall_vaddr(unsigned long vaddr)
> +{
> +   return (vaddr & ~0xfff) == VSYSCALL_ADDR;
> +}

Since you're touching this code anyway: Would it make sense to change
that constant to a more explicit "~0xfffUL" (or alternatively
PAGE_MASK)? I tend to end up staring at code like this forever, trying
to figure out whether the upper 32 bits of the constant end up being
set or clear. As a reader, looking at the current code, it's quite
annoying to see what actually happens - first there's a signed 32-bit
literal 0xfff, then a 32-bit negation happens, and then the number is
converted to 64 bits with sign extension.

Re: [RFC][PATCH 6/8] x86/mm: add vsyscall address helper

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:25 AM Dave Hansen  wrote:
> We will shortly be using this check in two locations.  Put it in
> a helper before we do so.
[...]
> +/*
> + * The (legacy) vsyscall page is the long page in the kernel portion
> + * of the address space that has user-accessible permissions.
> + */
> +static bool is_vsyscall_vaddr(unsigned long vaddr)
> +{
> +   return (vaddr & ~0xfff) == VSYSCALL_ADDR;
> +}

Since you're touching this code anyway: Would it make sense to change
that constant to a more explicit "~0xfffUL" (or alternatively
PAGE_MASK)? I tend to end up staring at code like this forever, trying
to figure out whether the upper 32 bits of the constant end up being
set or clear. As a reader, looking at the current code, it's quite
annoying to see what actually happens - first there's a signed 32-bit
literal 0xfff, then a 32-bit negation happens, and then the number is
converted to 64 bits with sign extension.

Re: [RFC][PATCH 5/8] x86/mm: fix exception table comments

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:22 AM Dave Hansen  wrote:
> +* Kernel-mode access to the user address space should only occur
> +* inside well-defined areas of code listed in the exception

Actually, not areas, but single whitelisted instructions. It would
probably be nice to say that more clearly.

>From arch/x86/include/asm/extable.h:

/*
 * The exception table consists of triples of addresses relative to the
 * exception table entry itself. The first address is of an instruction
 * that is allowed to fault, the second is the target at which the program
 * should continue. The third is a handler function to deal with the fault
 * caused by the instruction in the first field.
 *
 * All the routines below use bits of fixup code that are out of line
 * with the main instruction path.  This means when everything is well,
 * we don't even have to jump over them.  Further, they do not intrude
 * on our cache or tlb entries.
 */

struct exception_table_entry {
int insn, fixup, handler;
};

> +* tables.  But, an erroneous kernel fault occurring outside one of
> +* those areas which also holds mmap_sem might deadlock attempting
> +* to validate the fault against the address space.
>  *
> -* As the vast majority of faults will be valid we will only perform
> -* the source reference check when there is a possibility of a
> -* deadlock. Attempt to lock the address space, if we cannot we then
> -* validate the source. If this is invalid we can skip the address
> -* space check, thus avoiding the deadlock:
> +* Only do the expensive exception table search when we might be at
> +* risk of a deadlock:
> +* 1. We failed to acquire mmap_sem, and
> +* 2. The access was an explicit kernel-mode access
> +*(X86_PF_USER=0).
>  */
> if (unlikely(!down_read_trylock(>mmap_sem))) {
> if (!(sw_error_code & X86_PF_USER) &&
> !search_exception_tables(regs->ip)) {
> +   /*
> +* Fault from code in kernel from
> +* which we do not expect faults.
> +*/
> bad_area_nosemaphore(regs, sw_error_code, address, 
> NULL);
> return;
> }
> _
>

Re: [RFC][PATCH 5/8] x86/mm: fix exception table comments

2018-09-07 Thread Jann Horn

On Sat, Sep 8, 2018 at 2:22 AM Dave Hansen  wrote:
> +* Kernel-mode access to the user address space should only occur
> +* inside well-defined areas of code listed in the exception

Actually, not areas, but single whitelisted instructions. It would
probably be nice to say that more clearly.

>From arch/x86/include/asm/extable.h:

/*
 * The exception table consists of triples of addresses relative to the
 * exception table entry itself. The first address is of an instruction
 * that is allowed to fault, the second is the target at which the program
 * should continue. The third is a handler function to deal with the fault
 * caused by the instruction in the first field.
 *
 * All the routines below use bits of fixup code that are out of line
 * with the main instruction path.  This means when everything is well,
 * we don't even have to jump over them.  Further, they do not intrude
 * on our cache or tlb entries.
 */

struct exception_table_entry {
int insn, fixup, handler;
};

> +* tables.  But, an erroneous kernel fault occurring outside one of
> +* those areas which also holds mmap_sem might deadlock attempting
> +* to validate the fault against the address space.
>  *
> -* As the vast majority of faults will be valid we will only perform
> -* the source reference check when there is a possibility of a
> -* deadlock. Attempt to lock the address space, if we cannot we then
> -* validate the source. If this is invalid we can skip the address
> -* space check, thus avoiding the deadlock:
> +* Only do the expensive exception table search when we might be at
> +* risk of a deadlock:
> +* 1. We failed to acquire mmap_sem, and
> +* 2. The access was an explicit kernel-mode access
> +*(X86_PF_USER=0).
>  */
> if (unlikely(!down_read_trylock(>mmap_sem))) {
> if (!(sw_error_code & X86_PF_USER) &&
> !search_exception_tables(regs->ip)) {
> +   /*
> +* Fault from code in kernel from
> +* which we do not expect faults.
> +*/
> bad_area_nosemaphore(regs, sw_error_code, address, 
> NULL);
> return;
> }
> _
>

Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig

On 9/7/2018 10:38 AM, Alan Stern wrote:
> On Fri, 7 Sep 2018, Daniel Lustig wrote:
> 
>> On 9/7/2018 9:09 AM, Will Deacon wrote:
>>> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
 On Thu, 6 Sep 2018, Andrea Parri wrote:

>> Have you noticed any part of the generic code that relies on ordinary 
>> acquire-release (rather than atomic RMW acquire-release) in order to 
>> implement locking constructs?
>
> There are several places in code where the "lock-acquire" seems to be
> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
> provide other examples (grep for the primitives...).
>
> As long as we don't consider these primitive as RMW (which would seem
> odd...) or as acquire for which "most people expect strong ordering"
> (see above), these provides other examples for the _gap_ I mentioned.

 Okay, now I understand your objection.  It does appear that on RISC-V,
 if nowhere else, the current implementations of qspinlock, qrwlock,
 etc. will not provide "RCtso" ordering.

 The discussions surrounding this topic have been so lengthy and 
 confusing that I have lost track of any comments Palmer or Daniel may 
 have made concerning this potential problem.

 One possible resolution would be to define smp_cond_load_acquire() 
 specially on RISC-V so that it provided the same ordering guarantees as 
 RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
 pointing out the necessity for the stronger guarantee on all 
 architectures.)

 Another would be to replace the usages of atomic/smp_cond_load_acquire 
 in the locking constructs with a new function that would otherwise be 
 the same but would provide the ordering guarantee we want.

 Do you think either of these would be an adequate fix?
>>>
>>> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
>>> actually anything to fix, is there?
>>>
>>> Will
>>
>> I've also lost track of whether the current preference is or is not for
>> RCtso, or in which subset of cases RCtso is currently preferred.  For
>> whichever cases do in fact need to be RCtso, the RISC-V approach would
>> still be the same as what I've written in the past, as far as I can
>> tell [1].
> 
> The patch which Paul plans to send in for the next merge window makes 
> the LKMM require RCtso ordering for spinlocks, and by extension, for 
> all locking operations.  As I understand it, the current RISC-V 
> implementation of spinlocks does provide this ordering.
> 
> We have discussed creating another patch for the LKMM which would
> require RMW-acquire/ordinary-release also to have RCtso ordering.  
> Nobody has written the patch yet, but it would be straightfoward.  The
> rationale is that many types of locks are implemented in terms of
> RMW-acquire, so if the locks are required to be RCtso then so should
> the lower-level operations they are built from.
> 
> Will feels strongly (and Linus agrees) that the LKMM should not require
> ordinary acquire and release to be any stronger than RCpc.
> 
> The issue that Andrea raised has to do with qspinlock, qrwlock, and
> mcs_spinlock, which are implemented using smp_cond_load_acquire()  
> instead of RMW-acquire.  This provides only the ordering properties of
> smp_load_acquire(), namely RCpc, which means that qspinlocks etc. might
> not be RCtso.
> 
> Since we do want locks to be RCtso, the question is how to resolve this 
> discrepancy.

Thanks for the summary Alan!

I think RISC-V might actually get RCtso with smp_cond_load_acquire()
implemented using fence r,rw, believe it or not :)

The read->read and read->write requirements are covered by the fence r,rw, so
what we need to add on is the write->write ordering requirement.  On RISC-V,
we can get release semantics in three ways: fence rw,w, AMO.rl, and SC.rl.

If we use fence rw,w for release, then the "w,w" part covers it.

If we use AMO.rl for release, then the prior stores are ordered before the
AMO, and the fence r,rw orders the AMO before subsequent stores.

If we use SC.rl, then the prior stores are ordered before the SC, and the
branch to check whether the SC succeeded induces a control dependency that
keeps subsequent stores ordered after the SC.

So, it seems to work anyway.  I did a quick check of this property against
my Alloy model and it seems to agree as well.

The one combination that doesn't quite get you RCtso on RISC-V is pairing a
fence r,rw with an LR.aq.  I think everything else works, including pairing
fence r,rw with AMO.aq.  So it's really this one case we have to look out for.

Does that seem plausible to you all?

Dan

>> In a nutshell, if a data structure uses only atomics with .aq/.rl,
>> RISC-V provides RCtso already anyway.  If a data structure uses fences,
>> or mixes fences and atomics, we can replace a

Re: [PATCH RFC LKMM 1/7] tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire

2018-09-07 Thread Daniel Lustig

On 9/7/2018 10:38 AM, Alan Stern wrote:
> On Fri, 7 Sep 2018, Daniel Lustig wrote:
> 
>> On 9/7/2018 9:09 AM, Will Deacon wrote:
>>> On Fri, Sep 07, 2018 at 12:00:19PM -0400, Alan Stern wrote:
 On Thu, 6 Sep 2018, Andrea Parri wrote:

>> Have you noticed any part of the generic code that relies on ordinary 
>> acquire-release (rather than atomic RMW acquire-release) in order to 
>> implement locking constructs?
>
> There are several places in code where the "lock-acquire" seems to be
> provided by an atomic_cond_read_acquire/smp_cond_load_acquire: I have
> mentioned one in qspinlock in this thread; qrwlock and mcs_spinlock
> provide other examples (grep for the primitives...).
>
> As long as we don't consider these primitive as RMW (which would seem
> odd...) or as acquire for which "most people expect strong ordering"
> (see above), these provides other examples for the _gap_ I mentioned.

 Okay, now I understand your objection.  It does appear that on RISC-V,
 if nowhere else, the current implementations of qspinlock, qrwlock,
 etc. will not provide "RCtso" ordering.

 The discussions surrounding this topic have been so lengthy and 
 confusing that I have lost track of any comments Palmer or Daniel may 
 have made concerning this potential problem.

 One possible resolution would be to define smp_cond_load_acquire() 
 specially on RISC-V so that it provided the same ordering guarantees as 
 RMW-acquire.  (Plus adding a comment in the asm-generic/barrier.h 
 pointing out the necessity for the stronger guarantee on all 
 architectures.)

 Another would be to replace the usages of atomic/smp_cond_load_acquire 
 in the locking constructs with a new function that would otherwise be 
 the same but would provide the ordering guarantee we want.

 Do you think either of these would be an adequate fix?
>>>
>>> I didn't think RISC-V used qspinlock or qrwlock, so I'm not sure there's
>>> actually anything to fix, is there?
>>>
>>> Will
>>
>> I've also lost track of whether the current preference is or is not for
>> RCtso, or in which subset of cases RCtso is currently preferred.  For
>> whichever cases do in fact need to be RCtso, the RISC-V approach would
>> still be the same as what I've written in the past, as far as I can
>> tell [1].
> 
> The patch which Paul plans to send in for the next merge window makes 
> the LKMM require RCtso ordering for spinlocks, and by extension, for 
> all locking operations.  As I understand it, the current RISC-V 
> implementation of spinlocks does provide this ordering.
> 
> We have discussed creating another patch for the LKMM which would
> require RMW-acquire/ordinary-release also to have RCtso ordering.  
> Nobody has written the patch yet, but it would be straightfoward.  The
> rationale is that many types of locks are implemented in terms of
> RMW-acquire, so if the locks are required to be RCtso then so should
> the lower-level operations they are built from.
> 
> Will feels strongly (and Linus agrees) that the LKMM should not require
> ordinary acquire and release to be any stronger than RCpc.
> 
> The issue that Andrea raised has to do with qspinlock, qrwlock, and
> mcs_spinlock, which are implemented using smp_cond_load_acquire()  
> instead of RMW-acquire.  This provides only the ordering properties of
> smp_load_acquire(), namely RCpc, which means that qspinlocks etc. might
> not be RCtso.
> 
> Since we do want locks to be RCtso, the question is how to resolve this 
> discrepancy.

Thanks for the summary Alan!

I think RISC-V might actually get RCtso with smp_cond_load_acquire()
implemented using fence r,rw, believe it or not :)

The read->read and read->write requirements are covered by the fence r,rw, so
what we need to add on is the write->write ordering requirement.  On RISC-V,
we can get release semantics in three ways: fence rw,w, AMO.rl, and SC.rl.

If we use fence rw,w for release, then the "w,w" part covers it.

If we use AMO.rl for release, then the prior stores are ordered before the
AMO, and the fence r,rw orders the AMO before subsequent stores.

If we use SC.rl, then the prior stores are ordered before the SC, and the
branch to check whether the SC succeeded induces a control dependency that
keeps subsequent stores ordered after the SC.

So, it seems to work anyway.  I did a quick check of this property against
my Alloy model and it seems to agree as well.

The one combination that doesn't quite get you RCtso on RISC-V is pairing a
fence r,rw with an LR.aq.  I think everything else works, including pairing
fence r,rw with AMO.aq.  So it's really this one case we have to look out for.

Does that seem plausible to you all?

Dan

>> In a nutshell, if a data structure uses only atomics with .aq/.rl,
>> RISC-V provides RCtso already anyway.  If a data structure uses fences,
>> or mixes fences and atomics, we can replace a

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Linus Torvalds

On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner  wrote:
>
> >  - We execute from an extra page and read from another extra page
> > during the syscall.  (The latter is because we need to use a relative
> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > anyway, but we're accessing it using an alias, so it's an extra TLB
> > entry.)
>
> Ok, but is this really an issue with PTI?

I'd expect it to be *more* of an issue with PTI, since you're already
wasting TLB entries due to the whole "two different page tables".

Sure, address space ID's save you from reloading them all the time,
but don't help with capacity.

But yeah, in the sense of "with PTI, all kernel entries are slow
anyway, so none of this matters" is probably correct in a very real
sense.

That said, the real reason I like Andy's patch series is that I think
it's simpler than the alternatives (including the current setup). No
subtle mappings, no nothing. It removes a lot more lines than it adds,
and half the lines that it *does* add are comments.

Virtual mapping tricks may be cool, but in the end, not having to use
them is better still, I think.

  Linus

Re: [PATCH v2 3/3] x86/pti/64: Remove the SYSCALL64 entry trampoline

2018-09-07 Thread Linus Torvalds

On Fri, Sep 7, 2018 at 12:54 PM Thomas Gleixner  wrote:
>
> >  - We execute from an extra page and read from another extra page
> > during the syscall.  (The latter is because we need to use a relative
> > addressing mode to find sp1 -- it's the same *cacheline* we'd use
> > anyway, but we're accessing it using an alias, so it's an extra TLB
> > entry.)
>
> Ok, but is this really an issue with PTI?

I'd expect it to be *more* of an issue with PTI, since you're already
wasting TLB entries due to the whole "two different page tables".

Sure, address space ID's save you from reloading them all the time,
but don't help with capacity.

But yeah, in the sense of "with PTI, all kernel entries are slow
anyway, so none of this matters" is probably correct in a very real
sense.

That said, the real reason I like Andy's patch series is that I think
it's simpler than the alternatives (including the current setup). No
subtle mappings, no nothing. It removes a lot more lines than it adds,
and half the lines that it *does* add are comments.

Virtual mapping tricks may be cool, but in the end, not having to use
them is better still, I think.

  Linus

Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread Tony Jones

On 09/07/2018 09:37 AM, John Johansen wrote:

> hey Tony,
> 
> thanks for the patch, I am curious did you're investigation look
> into what parts of DEFINE_AUDIT_SK are causing the issue?

Hi JJ.

Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative 
to the fn).   
Our kernel performance testing is carried out with default installs which means 
AppArmor 
is enabled but the performance tests are unconfined. It was obvious that the 
overhead of 
DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic 
benchmarks) 
and that it didn't need to execute for the unconfined case,  hence the patch.  
I didn't 
spend any time looking at the performance of confined tasks.  It may be worth 
your time to 
look at this.

Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 
3-6% in netperf
throughput for packet sizes 64-1024.

HTH

Tony

 Percent |  Source code & Disassembly of vmlinux for cycles:ppp (117 
samples)
-
 :
 :
 :
 :  Disassembly of section .text:
 :
 :  813fbec0 :
 :  aa_label_sk_perm():
 : 
type));
 :  }
 :
 :  static int aa_label_sk_perm(struct aa_label 
*label, const char *op, u32 request,
 :  struct sock *sk)
 :  {
0.00 :   813fbec0:   callq  81a017f0 <__fentry__>
2.56 :   813fbec5:   push   %r14
0.00 :   813fbec7:   mov%rcx,%r14
 :  struct aa_profile *profile;
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbeca:   mov$0x7,%ecx
 :  {
0.00 :   813fbecf:   push   %r13
3.42 :   813fbed1:   mov%edx,%r13d
0.00 :   813fbed4:   push   %r12
0.00 :   813fbed6:   push   %rbp
0.00 :   813fbed7:   mov%rdi,%rbp
5.13 :   813fbeda:   push   %rbx
0.00 :   813fbedb:   sub$0xb8,%rsp
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbee2:   movzwl 0x10(%r14),%r9d
 :  {
1.71 :   813fbee7:   mov%gs:0x28,%rax
0.00 :   813fbef0:   mov%rax,0xb0(%rsp)
0.00 :   813fbef8:   xor%eax,%eax
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbefa:   lea0x78(%rsp),%rdx
1.71 :   813fbeff:   lea0x20(%rsp),%r8
0.00 :   813fbf04:   movq   $0x0,(%rsp)
0.00 :   813fbf0c:   movq   $0x0,0x10(%rsp)
0.00 :   813fbf15:   mov%rdx,%rdi
   14.53 :   813fbf18:   rep stos %rax,%es:(%rdi)
1.71 :   813fbf1b:   mov$0xb,%ecx
0.00 :   813fbf20:   mov%r8,%rdi
0.00 :   813fbf23:   mov%r14,0x80(%rsp)
   18.80 :   813fbf2b:   rep stos %rax,%es:(%rdi)
0.00 :   813fbf2e:   mov%rsi,0x28(%rsp)
1.71 :   813fbf33:   mov%r9w,0x88(%rsp)
0.00 :   813fbf3c:   cmp$0x1,%r9w
0.00 :   813fbf41:   je 813fbfa1 

0.00 :   813fbf43:   mov$0x2,%eax
0.00 :   813fbf48:   test   %r14,%r14
0.00 :   813fbf4b:   je 813fbfa1 

   14.53 :   813fbf4d:   mov%al,(%rsp)
0.00 :   813fbf50:   movzwl 0x1ea(%r14),%eax
 :  AA_BUG(!sk);
 :
 :  if (unconfined(label))
 :  return 0;
 :
 :  return fn_for_each_confined(label, 
profile,
0.00 :   813fbf58:   xor%r12d,%r12d
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbf5b:   mov%r8,0x18(%rsp)
8.55 :   813fbf60:   mov%eax,0x58(%rsp)
0.00 :   813fbf64:   movzbl 0x1e9(%r14),%eax
0.00 :   813fbf6c:   mov%rdx,0x8(%rsp)
0.00 :   813fbf71:   mov%eax,0x5c(%rsp)
 :  if (unconfined(label))
8.55 :   813fbf75:   testb  $0x2,0x40(%rbp)
0.00 :   813fbf79:   je 813fbfa8 

 :  
aa_profile_af_sk_perm(profile, , request, sk));
 :  }
0.00 :   813fbf7b:   mov0xb0(%rsp),%rdx

Re: [PATCH] apparmor: Fix network performance issue in aa_label_sk_perm

2018-09-07 Thread Tony Jones

On 09/07/2018 09:37 AM, John Johansen wrote:

> hey Tony,
> 
> thanks for the patch, I am curious did you're investigation look
> into what parts of DEFINE_AUDIT_SK are causing the issue?

Hi JJ.

Attached are the perf annotations for DEFINE_AUDIT_SK (percentages are relative 
to the fn).   
Our kernel performance testing is carried out with default installs which means 
AppArmor 
is enabled but the performance tests are unconfined. It was obvious that the 
overhead of 
DEFINE_AUDIT_SK was significant for smaller packet sizes (typical of synthetic 
benchmarks) 
and that it didn't need to execute for the unconfined case,  hence the patch.  
I didn't 
spend any time looking at the performance of confined tasks.  It may be worth 
your time to 
look at this.

Comparing my current tip (2601dd392dd1) to tip+patch I'm seeing an increase of 
3-6% in netperf
throughput for packet sizes 64-1024.

HTH

Tony

 Percent |  Source code & Disassembly of vmlinux for cycles:ppp (117 
samples)
-
 :
 :
 :
 :  Disassembly of section .text:
 :
 :  813fbec0 :
 :  aa_label_sk_perm():
 : 
type));
 :  }
 :
 :  static int aa_label_sk_perm(struct aa_label 
*label, const char *op, u32 request,
 :  struct sock *sk)
 :  {
0.00 :   813fbec0:   callq  81a017f0 <__fentry__>
2.56 :   813fbec5:   push   %r14
0.00 :   813fbec7:   mov%rcx,%r14
 :  struct aa_profile *profile;
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbeca:   mov$0x7,%ecx
 :  {
0.00 :   813fbecf:   push   %r13
3.42 :   813fbed1:   mov%edx,%r13d
0.00 :   813fbed4:   push   %r12
0.00 :   813fbed6:   push   %rbp
0.00 :   813fbed7:   mov%rdi,%rbp
5.13 :   813fbeda:   push   %rbx
0.00 :   813fbedb:   sub$0xb8,%rsp
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbee2:   movzwl 0x10(%r14),%r9d
 :  {
1.71 :   813fbee7:   mov%gs:0x28,%rax
0.00 :   813fbef0:   mov%rax,0xb0(%rsp)
0.00 :   813fbef8:   xor%eax,%eax
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbefa:   lea0x78(%rsp),%rdx
1.71 :   813fbeff:   lea0x20(%rsp),%r8
0.00 :   813fbf04:   movq   $0x0,(%rsp)
0.00 :   813fbf0c:   movq   $0x0,0x10(%rsp)
0.00 :   813fbf15:   mov%rdx,%rdi
   14.53 :   813fbf18:   rep stos %rax,%es:(%rdi)
1.71 :   813fbf1b:   mov$0xb,%ecx
0.00 :   813fbf20:   mov%r8,%rdi
0.00 :   813fbf23:   mov%r14,0x80(%rsp)
   18.80 :   813fbf2b:   rep stos %rax,%es:(%rdi)
0.00 :   813fbf2e:   mov%rsi,0x28(%rsp)
1.71 :   813fbf33:   mov%r9w,0x88(%rsp)
0.00 :   813fbf3c:   cmp$0x1,%r9w
0.00 :   813fbf41:   je 813fbfa1 

0.00 :   813fbf43:   mov$0x2,%eax
0.00 :   813fbf48:   test   %r14,%r14
0.00 :   813fbf4b:   je 813fbfa1 

   14.53 :   813fbf4d:   mov%al,(%rsp)
0.00 :   813fbf50:   movzwl 0x1ea(%r14),%eax
 :  AA_BUG(!sk);
 :
 :  if (unconfined(label))
 :  return 0;
 :
 :  return fn_for_each_confined(label, 
profile,
0.00 :   813fbf58:   xor%r12d,%r12d
 :  DEFINE_AUDIT_SK(sa, op, sk);
0.00 :   813fbf5b:   mov%r8,0x18(%rsp)
8.55 :   813fbf60:   mov%eax,0x58(%rsp)
0.00 :   813fbf64:   movzbl 0x1e9(%r14),%eax
0.00 :   813fbf6c:   mov%rdx,0x8(%rsp)
0.00 :   813fbf71:   mov%eax,0x5c(%rsp)
 :  if (unconfined(label))
8.55 :   813fbf75:   testb  $0x2,0x40(%rbp)
0.00 :   813fbf79:   je 813fbfa8 

 :  
aa_profile_af_sk_perm(profile, , request, sk));
 :  }
0.00 :   813fbf7b:   mov0xb0(%rsp),%rdx

Re: [Patch] acpi_power_meter: remove 'ignoring unsafe software power cap' message

2018-09-07 Thread Darrick J. Wong

On Fri, Sep 07, 2018 at 10:07:39PM +, Max Asbock wrote:
> At boot time the acpi_power_meter driver greets users of non-IBM systems with 
> the message: 
>   
> "Ignoring unsafe software power cap". 
> 
> This message is generally interpreted as meaning: The system is
> operating under an unsafe power  cap and Linux is ignoring this fact,
> thus living dangerously. It, or its presumed origin, has been blamed
> for system crashes. People spent time writing knowledge base articles
> which explain that the message doesn't mean what users think. I now
> have to write another such article telling people to ignore this
> message. To avoid future confusion and unnecessary work, I think the
> best solution is to remove the message. 
> 
> Here is my translation of the 'ignoring unsafe power cap' message:
>
> 'The acpi_power_meter driver discovered an ACPI object that would in
> theory allow software to set power caps. The driver could now create
> files in sysfs to expose this interface to user space, but it chooses
> not to do so'.

TBH that driver safelists the (single) system that is known to have a
working power capping mechanism.  At the time (ten years ago) the AEM
group worried that other vendors would produce ACPI power meters which
advertised the capping ability but did not enforce the limit, hence the
safelist.  I asked "Well can't we just trust vendor firmware?" and peals
of laughter erupted on the phone. ;)

If the user confusion is that Lenovo systems have ACPI power meters with
working power capping, you could consider adding Lenovo to the safelist.

That said, the message is misleading.  It probably should have read:

"Power capping has not been verified to work on this platform.
Please ask the platform vendor email X to have it added to the list."

Alternately, you could remove the safelisting entirely and assume that
if the firmware is malicious there are far more evil things it will try.
Seeing as IBM AEM is a decade old I doubt it's protecting much now.

But /me shrugs and says, "eh, good enough".  Also, nice to hear from you
& lcm again. :)

Reviewed-by: Darrick J. Wong 

--D

> 
> Patch: Remove error message because it is generally misinterpreted. A 
> replacement
> for the message is not necessary.
> 
> Signed-off-by: Max Asbock 
> 
> diff --git a/drivers/hwmon/acpi_power_meter.c 
> b/drivers/hwmon/acpi_power_meter.c
> index 34e45b9..578e886 100644
> --- a/drivers/hwmon/acpi_power_meter.c
> +++ b/drivers/hwmon/acpi_power_meter.c
> @@ -693,11 +693,8 @@ static int setup_attrs(struct acpi_power_meter_resource 
> *resource)
>   }
>  
>   if (resource->caps.flags & POWER_METER_CAN_CAP) {
> - if (!can_cap_in_hardware()) {
> - dev_err(>acpi_dev->dev,
> - "Ignoring unsafe software power cap!\n");
> + if (!can_cap_in_hardware())
>   goto skip_unsafe_cap;
> - }
>  
>   if (resource->caps.configurable_cap)
>   res = register_attrs(resource, rw_cap_attrs);
> 
> 
>

Re: [Patch] acpi_power_meter: remove 'ignoring unsafe software power cap' message

2018-09-07 Thread Darrick J. Wong

On Fri, Sep 07, 2018 at 10:07:39PM +, Max Asbock wrote:
> At boot time the acpi_power_meter driver greets users of non-IBM systems with 
> the message: 
>   
> "Ignoring unsafe software power cap". 
> 
> This message is generally interpreted as meaning: The system is
> operating under an unsafe power  cap and Linux is ignoring this fact,
> thus living dangerously. It, or its presumed origin, has been blamed
> for system crashes. People spent time writing knowledge base articles
> which explain that the message doesn't mean what users think. I now
> have to write another such article telling people to ignore this
> message. To avoid future confusion and unnecessary work, I think the
> best solution is to remove the message. 
> 
> Here is my translation of the 'ignoring unsafe power cap' message:
>
> 'The acpi_power_meter driver discovered an ACPI object that would in
> theory allow software to set power caps. The driver could now create
> files in sysfs to expose this interface to user space, but it chooses
> not to do so'.

TBH that driver safelists the (single) system that is known to have a
working power capping mechanism.  At the time (ten years ago) the AEM
group worried that other vendors would produce ACPI power meters which
advertised the capping ability but did not enforce the limit, hence the
safelist.  I asked "Well can't we just trust vendor firmware?" and peals
of laughter erupted on the phone. ;)

If the user confusion is that Lenovo systems have ACPI power meters with
working power capping, you could consider adding Lenovo to the safelist.

That said, the message is misleading.  It probably should have read:

"Power capping has not been verified to work on this platform.
Please ask the platform vendor email X to have it added to the list."

Alternately, you could remove the safelisting entirely and assume that
if the firmware is malicious there are far more evil things it will try.
Seeing as IBM AEM is a decade old I doubt it's protecting much now.

But /me shrugs and says, "eh, good enough".  Also, nice to hear from you
& lcm again. :)

Reviewed-by: Darrick J. Wong 

--D

> 
> Patch: Remove error message because it is generally misinterpreted. A 
> replacement
> for the message is not necessary.
> 
> Signed-off-by: Max Asbock 
> 
> diff --git a/drivers/hwmon/acpi_power_meter.c 
> b/drivers/hwmon/acpi_power_meter.c
> index 34e45b9..578e886 100644
> --- a/drivers/hwmon/acpi_power_meter.c
> +++ b/drivers/hwmon/acpi_power_meter.c
> @@ -693,11 +693,8 @@ static int setup_attrs(struct acpi_power_meter_resource 
> *resource)
>   }
>  
>   if (resource->caps.flags & POWER_METER_CAN_CAP) {
> - if (!can_cap_in_hardware()) {
> - dev_err(>acpi_dev->dev,
> - "Ignoring unsafe software power cap!\n");
> + if (!can_cap_in_hardware())
>   goto skip_unsafe_cap;
> - }
>  
>   if (resource->caps.configurable_cap)
>   res = register_attrs(resource, rw_cap_attrs);
> 
> 
>

Re: [RFC][PATCH 7/8] x86/mm/vsyscall: consider vsyscall page part of user address space

2018-09-07 Thread Andy Lutomirski




> On Sep 7, 2018, at 12:49 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> The vsyscall page is weird.  It is in what is traditionally part of the
> kernel address space.  But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
> 
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults.  Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().

I assume the motivation is that you want to simplify the kernel error path. If 
so, can you mention this?

The patch itself is Reviewed-by: Andy Lutomirski , although 
adding an unlikely() somewhere might be nice. 

> 
> Signed-off-by: Dave Hansen 
> Cc: Sean Christopherson 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> Cc: Andy Lutomirski 
> ---
> 
> b/arch/x86/mm/fault.c |   36 
> 1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff -puN arch/x86/mm/fault.c~vsyscall-is-user-address-space 
> arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~vsyscall-is-user-address-space2018-09-07 
> 11:21:48.720751896 -0700
> +++ b/arch/x86/mm/fault.c2018-09-07 11:21:48.724751896 -0700
> @@ -873,18 +873,6 @@ __bad_area_nosemaphore(struct pt_regs *r
>if (is_errata100(regs, address))
>return;
> 
> -#ifdef CONFIG_X86_64
> -/*
> - * Instruction fetch faults in the vsyscall page might need
> - * emulation.
> - */
> -if (unlikely((error_code & X86_PF_INSTR) &&
> - is_vsyscall_vaddr(address))) {
> -if (emulate_vsyscall(regs, address))
> -return;
> -}
> -#endif
> -
>/*
> * To avoid leaking information about the kernel page table
> * layout, pretend that user-mode accesses to kernel addresses
> @@ -1192,6 +1180,13 @@ access_error(unsigned long error_code, s
> 
> static int fault_in_kernel_space(unsigned long address)
> {
> +/*
> + * The vsyscall page is at an address above TASK_SIZE_MAX,
> + * but is not considered part of the kernel address space.
> + */
> +if (is_vsyscall_vaddr(address))
> +return false;
> +
>return address >= TASK_SIZE_MAX;
> }
> 
> @@ -1357,6 +1352,23 @@ void do_user_addr_space_fault(struct pt_
>if (sw_error_code & X86_PF_INSTR)
>flags |= FAULT_FLAG_INSTRUCTION;
> 
> +#ifdef CONFIG_X86_64
> +/*
> + * Instruction fetch faults in the vsyscall page might need
> + * emulation.  The vsyscall page is at a high address
> + * (>PAGE_OFFSET), but is considered to be part of the user
> + * address space.
> + *
> + * The vsyscall page does not have a "real" VMA, so do this
> + * emulation before we go searching for VMAse
> + */
> +if (unlikely((sw_error_code & X86_PF_INSTR) &&
> + is_vsyscall_vaddr(address))) {
> +if (emulate_vsyscall(regs, address))
> +return;
> +}
> +#endif
> +
>/*
> * Kernel-mode access to the user address space should only occur
> * inside well-defined areas of code listed in the exception
> _

Re: [RFC][PATCH 7/8] x86/mm/vsyscall: consider vsyscall page part of user address space

2018-09-07 Thread Andy Lutomirski




> On Sep 7, 2018, at 12:49 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> The vsyscall page is weird.  It is in what is traditionally part of the
> kernel address space.  But, it has user permissions and we handle faults
> on it like we would on a user page: interrupts on.
> 
> Right now, we handle vsyscall emulation in the "bad_area" code, which
> is used for both user-address-space and kernel-address-space faults.  Move
> the handling to the user-address-space code *only* and ensure we get there
> by "excluding" the vsyscall page from the kernel address space via a check
> in fault_in_kernel_space().

I assume the motivation is that you want to simplify the kernel error path. If 
so, can you mention this?

The patch itself is Reviewed-by: Andy Lutomirski , although 
adding an unlikely() somewhere might be nice. 

> 
> Signed-off-by: Dave Hansen 
> Cc: Sean Christopherson 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> Cc: Andy Lutomirski 
> ---
> 
> b/arch/x86/mm/fault.c |   36 
> 1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff -puN arch/x86/mm/fault.c~vsyscall-is-user-address-space 
> arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~vsyscall-is-user-address-space2018-09-07 
> 11:21:48.720751896 -0700
> +++ b/arch/x86/mm/fault.c2018-09-07 11:21:48.724751896 -0700
> @@ -873,18 +873,6 @@ __bad_area_nosemaphore(struct pt_regs *r
>if (is_errata100(regs, address))
>return;
> 
> -#ifdef CONFIG_X86_64
> -/*
> - * Instruction fetch faults in the vsyscall page might need
> - * emulation.
> - */
> -if (unlikely((error_code & X86_PF_INSTR) &&
> - is_vsyscall_vaddr(address))) {
> -if (emulate_vsyscall(regs, address))
> -return;
> -}
> -#endif
> -
>/*
> * To avoid leaking information about the kernel page table
> * layout, pretend that user-mode accesses to kernel addresses
> @@ -1192,6 +1180,13 @@ access_error(unsigned long error_code, s
> 
> static int fault_in_kernel_space(unsigned long address)
> {
> +/*
> + * The vsyscall page is at an address above TASK_SIZE_MAX,
> + * but is not considered part of the kernel address space.
> + */
> +if (is_vsyscall_vaddr(address))
> +return false;
> +
>return address >= TASK_SIZE_MAX;
> }
> 
> @@ -1357,6 +1352,23 @@ void do_user_addr_space_fault(struct pt_
>if (sw_error_code & X86_PF_INSTR)
>flags |= FAULT_FLAG_INSTRUCTION;
> 
> +#ifdef CONFIG_X86_64
> +/*
> + * Instruction fetch faults in the vsyscall page might need
> + * emulation.  The vsyscall page is at a high address
> + * (>PAGE_OFFSET), but is considered to be part of the user
> + * address space.
> + *
> + * The vsyscall page does not have a "real" VMA, so do this
> + * emulation before we go searching for VMAse
> + */
> +if (unlikely((sw_error_code & X86_PF_INSTR) &&
> + is_vsyscall_vaddr(address))) {
> +if (emulate_vsyscall(regs, address))
> +return;
> +}
> +#endif
> +
>/*
> * Kernel-mode access to the user address space should only occur
> * inside well-defined areas of code listed in the exception
> _

[PATCH] afs: Fix cell specification to permit an empty address list

2018-09-07 Thread David Howells

Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).

This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]).  The lookups can be done later as
each cell is accessed through the filesystem.

Also remove the print statement that prints a line every time a new cell is
added.

[*] There are 144 cells in the list.  Each cell is first looked up for an
SRV record, and if that fails, for an AFSDB record.  These get a list
of server names, each of which then has to be looked up to get the
addresses for that server.  E.g.:

dig srv _afs3-vlserver._udp.grand.central.org

Signed-off-by: David Howells 
---

 fs/afs/proc.c |   15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/afs/proc.c b/fs/afs/proc.c
index 0c3285c8db95..476dcbb79713 100644
--- a/fs/afs/proc.c
+++ b/fs/afs/proc.c
@@ -98,13 +98,13 @@ static int afs_proc_cells_write(struct file *file, char 
*buf, size_t size)
goto inval;
 
args = strchr(name, ' ');
-   if (!args)
-   goto inval;
-   do {
-   *args++ = 0;
-   } while(*args == ' ');
-   if (!*args)
-   goto inval;
+   if (args) {
+   do {
+   *args++ = 0;
+   } while(*args == ' ');
+   if (!*args)
+   goto inval;
+   }
 
/* determine command to perform */
_debug("cmd=%s name=%s args=%s", buf, name, args);
@@ -120,7 +120,6 @@ static int afs_proc_cells_write(struct file *file, char 
*buf, size_t size)
 
if (test_and_set_bit(AFS_CELL_FL_NO_GC, >flags))
afs_put_cell(net, cell);
-   printk("kAFS: Added new cell '%s'\n", name);
} else {
goto inval;
}

[PATCH] afs: Fix cell specification to permit an empty address list

2018-09-07 Thread David Howells

Fix the cell specification mechanism to allow cells to be pre-created
without having to specify at least one address (the addresses will be
upcalled for).

This allows the cell information preload service to avoid the need to issue
loads of DNS lookups during boot to get the addresses for each cell (500+
lookups for the 'standard' cell list[*]).  The lookups can be done later as
each cell is accessed through the filesystem.

Also remove the print statement that prints a line every time a new cell is
added.

[*] There are 144 cells in the list.  Each cell is first looked up for an
SRV record, and if that fails, for an AFSDB record.  These get a list
of server names, each of which then has to be looked up to get the
addresses for that server.  E.g.:

dig srv _afs3-vlserver._udp.grand.central.org

Signed-off-by: David Howells 
---

 fs/afs/proc.c |   15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/afs/proc.c b/fs/afs/proc.c
index 0c3285c8db95..476dcbb79713 100644
--- a/fs/afs/proc.c
+++ b/fs/afs/proc.c
@@ -98,13 +98,13 @@ static int afs_proc_cells_write(struct file *file, char 
*buf, size_t size)
goto inval;
 
args = strchr(name, ' ');
-   if (!args)
-   goto inval;
-   do {
-   *args++ = 0;
-   } while(*args == ' ');
-   if (!*args)
-   goto inval;
+   if (args) {
+   do {
+   *args++ = 0;
+   } while(*args == ' ');
+   if (!*args)
+   goto inval;
+   }
 
/* determine command to perform */
_debug("cmd=%s name=%s args=%s", buf, name, args);
@@ -120,7 +120,6 @@ static int afs_proc_cells_write(struct file *file, char 
*buf, size_t size)
 
if (test_and_set_bit(AFS_CELL_FL_NO_GC, >flags))
afs_put_cell(net, cell);
-   printk("kAFS: Added new cell '%s'\n", name);
} else {
goto inval;
}

Re: [PATCH 2/2] f2fs: fix to avoid quota inode leak in ->put_super

2018-09-07 Thread Chao Yu

I can see it in dev, thanks for merging. ;)

On 2018/9/8 6:38, Jaegeuk Kim wrote:
> I merged as one. Please check dev. :)
> 
> On 09/06, Chao Yu wrote:
>> generic/019 reports below error:
>>
>>  __quota_error: 1160 callbacks suppressed
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have 
>> a nice day...
>>
>> If we failed in below path due to fail to write dquot block, we will miss
>> to release quota inode, fix it.
>>
>> - f2fs_put_super
>>  - f2fs_quota_off_umount
>>   - f2fs_quota_off
>>- f2fs_quota_sync   <-- failed
>>- dquot_quota_off   <-- missed to call
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/super.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>> index c026aaccf218..328f58647f4c 100644
>> --- a/fs/f2fs/super.c
>> +++ b/fs/f2fs/super.c
>> @@ -1900,10 +1900,12 @@ void f2fs_quota_off_umount(struct super_block *sb)
>>  for (type = 0; type < MAXQUOTAS; type++) {
>>  err = f2fs_quota_off(sb, type);
>>  if (err) {
>> +int ret = dquot_quota_off(sb, type);
>> +
>>  f2fs_msg(sb, KERN_ERR,
>>  "Fail to turn off disk quota "
>> -"(type: %d, err: %d), Please "
>> -"run fsck to fix it.", type, err);
>> +"(type: %d, err: %d, ret:%d), Please "
>> +"run fsck to fix it.", type, err, ret);
>>  set_sbi_flag(F2FS_SB(sb), SBI_QUOTA_NEED_REPAIR);
>>  }
>>  }
>> -- 
>> 2.18.0.rc1

Re: [PATCH 2/2] f2fs: fix to avoid quota inode leak in ->put_super

2018-09-07 Thread Chao Yu

I can see it in dev, thanks for merging. ;)

On 2018/9/8 6:38, Jaegeuk Kim wrote:
> I merged as one. Please check dev. :)
> 
> On 09/06, Chao Yu wrote:
>> generic/019 reports below error:
>>
>>  __quota_error: 1160 callbacks suppressed
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  Quota error (device zram1): write_blk: dquota write failed
>>  Quota error (device zram1): qtree_write_dquot: Error -28 occurred while 
>> creating quota
>>  VFS: Busy inodes after unmount of zram1. Self-destruct in 5 seconds.  Have 
>> a nice day...
>>
>> If we failed in below path due to fail to write dquot block, we will miss
>> to release quota inode, fix it.
>>
>> - f2fs_put_super
>>  - f2fs_quota_off_umount
>>   - f2fs_quota_off
>>- f2fs_quota_sync   <-- failed
>>- dquot_quota_off   <-- missed to call
>>
>> Signed-off-by: Chao Yu 
>> ---
>>  fs/f2fs/super.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
>> index c026aaccf218..328f58647f4c 100644
>> --- a/fs/f2fs/super.c
>> +++ b/fs/f2fs/super.c
>> @@ -1900,10 +1900,12 @@ void f2fs_quota_off_umount(struct super_block *sb)
>>  for (type = 0; type < MAXQUOTAS; type++) {
>>  err = f2fs_quota_off(sb, type);
>>  if (err) {
>> +int ret = dquot_quota_off(sb, type);
>> +
>>  f2fs_msg(sb, KERN_ERR,
>>  "Fail to turn off disk quota "
>> -"(type: %d, err: %d), Please "
>> -"run fsck to fix it.", type, err);
>> +"(type: %d, err: %d, ret:%d), Please "
>> +"run fsck to fix it.", type, err, ret);
>>  set_sbi_flag(F2FS_SB(sb), SBI_QUOTA_NEED_REPAIR);
>>  }
>>  }
>> -- 
>> 2.18.0.rc1

Re: kselftests for memory.oom.group

2018-09-07 Thread Shuah Khan

On 09/07/2018 03:34 PM, jgka...@fb.com wrote:
> Here is the third version of the patchset.
> 
> Changes since the last patchset:
> - Updated commit message of first patch to clarify fixes
> - Add ack from Roman
> 
> There should be no code changes since the last patchset.
> 
> Let me know if any improvements can be made, and thanks for your time!
> -Jay
> 
> 

Hi Jay,

Thanks for these patches. Applied to linux-kselftest next

-- Shuah

Re: kselftests for memory.oom.group

2018-09-07 Thread Shuah Khan

On 09/07/2018 03:34 PM, jgka...@fb.com wrote:
> Here is the third version of the patchset.
> 
> Changes since the last patchset:
> - Updated commit message of first patch to clarify fixes
> - Add ack from Roman
> 
> There should be no code changes since the last patchset.
> 
> Let me know if any improvements can be made, and thanks for your time!
> -Jay
> 
> 

Hi Jay,

Thanks for these patches. Applied to linux-kselftest next

-- Shuah

Re: [RFC][PATCH 1/8] x86/mm: clarify hardware vs. software "error_code"

2018-09-07 Thread Andy Lutomirski

> On Sep 7, 2018, at 12:48 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> We pass around a variable called "error_code" all around the page
> fault code.  Sounds simple enough, especially since "error_code" looks
> like it exactly matches the values that the hardware gives us on the
> stack to report the page fault error code (PFEC in SDM parlance).
> 
> But, that's not how it works.
> 
> For part of the page fault handler, "error_code" does exactly match
> PFEC.  But, during later parts, it diverges and starts to mean
> something a bit different.
> 
> Give it two names for its two jobs.

How hard would it be to just remove sw_error_code instead?  It seems like it 
adds little value and much confusion.

I’m also unconvinced that the warning is terribly useful. We’re going to oops 
when this happens anyway.

I’ll rebase my diagnostic patch on top of this series after it settles.

Re: [RFC][PATCH 1/8] x86/mm: clarify hardware vs. software "error_code"

2018-09-07 Thread Andy Lutomirski

> On Sep 7, 2018, at 12:48 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> We pass around a variable called "error_code" all around the page
> fault code.  Sounds simple enough, especially since "error_code" looks
> like it exactly matches the values that the hardware gives us on the
> stack to report the page fault error code (PFEC in SDM parlance).
> 
> But, that's not how it works.
> 
> For part of the page fault handler, "error_code" does exactly match
> PFEC.  But, during later parts, it diverges and starts to mean
> something a bit different.
> 
> Give it two names for its two jobs.

How hard would it be to just remove sw_error_code instead?  It seems like it 
adds little value and much confusion.

I’m also unconvinced that the warning is terribly useful. We’re going to oops 
when this happens anyway.

I’ll rebase my diagnostic patch on top of this series after it settles.

Re: [PATCH 3.18 00/29] 3.18.122-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:10:21PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.18.122 release.
> There are 29 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:52 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.122-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-3.18.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled with -Werror, and installed onto my Pixel XL.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

Re: [PATCH 3.18 00/29] 3.18.122-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:10:21PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.18.122 release.
> There are 29 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:52 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.122-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-3.18.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled with -Werror, and installed onto my Pixel XL.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

Re: [PATCH 4.4 00/47] 4.4.155-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:09:56PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.155 release.
> There are 47 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:44 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.155-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.4.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled with -Werror, and installed onto my Pixel 2 XL.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

Re: [PATCH 4.4 00/47] 4.4.155-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:09:56PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.155 release.
> There are 47 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:44 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.155-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.4.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled with -Werror, and installed onto my Pixel 2 XL.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

2018-09-07 Thread Dave Hansen

On 09/07/2018 03:21 PM, Andy Lutomirski wrote:
>> +static void
>> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
>> + unsigned long address)
>> +{
>
> Can you add a comment above this documenting *when* it’s called? Is
> it all faults, !user_mode faults, or !PF_USER?

Yep, can do.

>> +/*
>> + * This is a "bad" fault in the kernel address space.  There
>> + * is no reasonable explanation for it.  We will either kill
>> + * the process for making a bad access, or oops the kernel.
>> + */
> 
> Or call an extable handler?
> 
> Maybe the wording should be less scary, e.g. “this fault is a genuine
> error. Send a signal, call an exception handler, or oops, as
> appropriate.”

Yeah, the real behavior is quite a bit more subtle than I'm letting on.
I'll tone it down.

Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

2018-09-07 Thread Dave Hansen

On 09/07/2018 03:21 PM, Andy Lutomirski wrote:
>> +static void
>> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
>> + unsigned long address)
>> +{
>
> Can you add a comment above this documenting *when* it’s called? Is
> it all faults, !user_mode faults, or !PF_USER?

Yep, can do.

>> +/*
>> + * This is a "bad" fault in the kernel address space.  There
>> + * is no reasonable explanation for it.  We will either kill
>> + * the process for making a bad access, or oops the kernel.
>> + */
> 
> Or call an extable handler?
> 
> Maybe the wording should be less scary, e.g. “this fault is a genuine
> error. Send a signal, call an exception handler, or oops, as
> appropriate.”

Yeah, the real behavior is quite a bit more subtle than I'm letting on.
I'll tone it down.

Re: [PATCH 4.14 00/89] 4.14.69-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:08:54PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.14.69 release.
> There are 89 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:28 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.69-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.14.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled, and installed onto my Raspberry Pi.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

Re: [PATCH 4.14 00/89] 4.14.69-stable review

2018-09-07 Thread Nathan Chancellor

On Fri, Sep 07, 2018 at 11:08:54PM +0200, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.14.69 release.
> There are 89 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Sep  9 21:08:28 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.69-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.14.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Merged, compiled, and installed onto my Raspberry Pi.

No initial issues noticed in dmesg or general usage.

Thanks!
Nathan

[PATCH] ring-buffer: Allow for rescheduling when removing pages

2018-09-07 Thread Vaibhav Nagarnaik

When reducing ring buffer size, pages are removed by scheduling a work
item on each CPU for the corresponding CPU ring buffer. After the pages
are removed from ring buffer linked list, the pages are free()d in a
tight loop. The loop does not give up CPU until all pages are removed.
In a worst case behavior, when lot of pages are to be freed, it can
cause system stall.

After the pages are removed from the list, the free() can happen while
the work is rescheduled. Call cond_resched() in the loop to prevent the
system hangup.

Reported-by: Jason Behmer 
Signed-off-by: Vaibhav Nagarnaik 
---
 kernel/trace/ring_buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d92d4a982fd..65bd4616220d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1546,6 +1546,8 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, 
unsigned long nr_pages)
tmp_iter_page = first_page;
 
do {
+   cond_resched();
+
to_remove_page = tmp_iter_page;
rb_inc_page(cpu_buffer, _iter_page);
 
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

[PATCH] ring-buffer: Allow for rescheduling when removing pages

2018-09-07 Thread Vaibhav Nagarnaik

When reducing ring buffer size, pages are removed by scheduling a work
item on each CPU for the corresponding CPU ring buffer. After the pages
are removed from ring buffer linked list, the pages are free()d in a
tight loop. The loop does not give up CPU until all pages are removed.
In a worst case behavior, when lot of pages are to be freed, it can
cause system stall.

After the pages are removed from the list, the free() can happen while
the work is rescheduled. Call cond_resched() in the loop to prevent the
system hangup.

Reported-by: Jason Behmer 
Signed-off-by: Vaibhav Nagarnaik 
---
 kernel/trace/ring_buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1d92d4a982fd..65bd4616220d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1546,6 +1546,8 @@ rb_remove_pages(struct ring_buffer_per_cpu *cpu_buffer, 
unsigned long nr_pages)
tmp_iter_page = first_page;
 
do {
+   cond_resched();
+
to_remove_page = tmp_iter_page;
rb_inc_page(cpu_buffer, _iter_page);
 
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

Re: [PATCH] arm64: add NUMA emulation support

2018-09-07 Thread Shuah Khan

On 09/07/2018 02:34 AM, Michal Hocko wrote:
> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> [...]
>> A few critical allocations could be satisfied and root cgroup prevails. It 
>> is not the
>> intent to have exclusivity at the expense of the kernel.
> 
> Well, it is not "few critical allocations". It can be a lot of
> memory. Basically any GFP_KERNEL allocation. So how exactly you expect
> this to work when you cannot estimate how much
> memory will kernel eat?
> 
>>
>> This feature will allow a way to configure cpusets on non-NUMA for workloads 
>> that can
>> benefit from the reservation and isolation that is available within the 
>> constraints of
>> exclusive cpuset policies.
> 
> AFAIR this was the first approach Google took for the memory isolation
> and they moved over to memory cgroups. 

In addition to isolation, being able to reserve a block instead is one of the
issues I am looking to address. Unfortunately memory cgroups won't address that
issue.

I would recommend to talk to
> those guys bebfore you introduce potentially a lot of code that will not
> really work for the workload you indend it for.
> 

Will you be able to point me to a good contact at Goggle and/or some pointers
on finding discussion on the memory isolation. My searches on lkml came up
short,

thanks,
-- Shuah

Re: [PATCH] arm64: add NUMA emulation support

2018-09-07 Thread Shuah Khan

On 09/07/2018 02:34 AM, Michal Hocko wrote:
> On Thu 06-09-18 15:53:34, Shuah Khan wrote:
> [...]
>> A few critical allocations could be satisfied and root cgroup prevails. It 
>> is not the
>> intent to have exclusivity at the expense of the kernel.
> 
> Well, it is not "few critical allocations". It can be a lot of
> memory. Basically any GFP_KERNEL allocation. So how exactly you expect
> this to work when you cannot estimate how much
> memory will kernel eat?
> 
>>
>> This feature will allow a way to configure cpusets on non-NUMA for workloads 
>> that can
>> benefit from the reservation and isolation that is available within the 
>> constraints of
>> exclusive cpuset policies.
> 
> AFAIR this was the first approach Google took for the memory isolation
> and they moved over to memory cgroups. 

In addition to isolation, being able to reserve a block instead is one of the
issues I am looking to address. Unfortunately memory cgroups won't address that
issue.

I would recommend to talk to
> those guys bebfore you introduce potentially a lot of code that will not
> really work for the workload you indend it for.
> 

Will you be able to point me to a good contact at Goggle and/or some pointers
on finding discussion on the memory isolation. My searches on lkml came up
short,

thanks,
-- Shuah

Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

2018-09-07 Thread Andy Lutomirski




> On Sep 7, 2018, at 12:48 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> The page fault handler (__do_page_fault())  basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.
> 
> But, these two parts don't stick out that well.  Let's make that more
> clear from code separation and naming.  Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
> 
> Also, rewrite the vmalloc handling comment a bit.  It was a bit
> stale and also glossed over the reserved bit handling.
> 
> Signed-off-by: Dave Hansen 
> Cc: Sean Christopherson 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> Cc: Andy Lutomirski 
> ---
> 
> b/arch/x86/mm/fault.c |   98 
> ++
> 1 file changed, 59 insertions(+), 39 deletions(-)
> 
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-00 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-002018-09-07 
> 11:21:46.145751902 -0700
> +++ b/arch/x86/mm/fault.c2018-09-07 11:23:37.643751624 -0700
> @@ -1033,7 +1033,7 @@ mm_fault_error(struct pt_regs *regs, uns
>}
> }
> 
> -static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> +static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
> {
>if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
>return 0;
> @@ -1072,7 +1072,7 @@ static int spurious_fault_check(unsigned
>  * (Optional Invalidation).
>  */
> static noinline int
> -spurious_fault(unsigned long error_code, unsigned long address)
> +spurious_kernel_fault(unsigned long error_code, unsigned long address)
> {
>pgd_t *pgd;
>p4d_t *p4d;
> @@ -1103,27 +1103,27 @@ spurious_fault(unsigned long error_code,
>return 0;
> 
>if (p4d_large(*p4d))
> -return spurious_fault_check(error_code, (pte_t *) p4d);
> +return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
> 
>pud = pud_offset(p4d, address);
>if (!pud_present(*pud))
>return 0;
> 
>if (pud_large(*pud))
> -return spurious_fault_check(error_code, (pte_t *) pud);
> +return spurious_kernel_fault_check(error_code, (pte_t *) pud);
> 
>pmd = pmd_offset(pud, address);
>if (!pmd_present(*pmd))
>return 0;
> 
>if (pmd_large(*pmd))
> -return spurious_fault_check(error_code, (pte_t *) pmd);
> +return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
> 
>pte = pte_offset_kernel(pmd, address);
>if (!pte_present(*pte))
>return 0;
> 
> -ret = spurious_fault_check(error_code, pte);
> +ret = spurious_kernel_fault_check(error_code, pte);
>if (!ret)
>return 0;
> 
> @@ -1131,12 +1131,12 @@ spurious_fault(unsigned long error_code,
> * Make sure we have permissions in PMD.
> * If not, then there's a bug in the page tables:
> */
> -ret = spurious_fault_check(error_code, (pte_t *) pmd);
> +ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
>WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
> 
>return ret;
> }
> -NOKPROBE_SYMBOL(spurious_fault);
> +NOKPROBE_SYMBOL(spurious_kernel_fault);
> 
> int show_unhandled_signals = 1;
> 
> @@ -1203,6 +1203,55 @@ static inline bool smap_violation(int er
>return true;
> }
> 
> +static void
> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{

Can you add a comment above this documenting *when* it’s called?  Is it all 
faults, !user_mode faults, or !PF_USER?

> +/*
> + * We can fault-in kernel-space virtual memory on-demand. The
> + * 'reference' page table is init_mm.pgd.
> + *
> + * NOTE! We MUST NOT take any locks for this case. We may
> + * be in an interrupt or a critical region, and should
> + * only copy the information from the master page table,
> + * nothing more.
> + *
> + * Before doing this on-demand faulting, ensure that the
> + * fault is not any of the following:
> + * 1. A fault on a PTE with a reserved bit set.
> + * 2. A fault caused by a user-mode access.  (Do not demand-
> + *fault kernel memory due to user-mode accesses).
> + * 3. A fault caused by a page-level protection violation.
> + *(A demand fault would be on a non-present page which
> + * would have X86_PF_PROT==0).
> + */
> +if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
> +if (vmalloc_fault(address) >= 0)
> +return;
> +}
> +
> +/* Was the fault spurious, caused by lazy TLB invalidation? */
> +if (spurious_kernel_fault(hw_error_code, address))
> +return;
> +
> +/* kprobes don't want to hook the spurious faults: */
> +if (kprobes_fault(regs))
> +return;
> +
> +/*
> + * This

Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

2018-09-07 Thread Andy Lutomirski




> On Sep 7, 2018, at 12:48 PM, Dave Hansen  wrote:
> 
> 
> From: Dave Hansen 
> 
> The page fault handler (__do_page_fault())  basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.
> 
> But, these two parts don't stick out that well.  Let's make that more
> clear from code separation and naming.  Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
> 
> Also, rewrite the vmalloc handling comment a bit.  It was a bit
> stale and also glossed over the reserved bit handling.
> 
> Signed-off-by: Dave Hansen 
> Cc: Sean Christopherson 
> Cc: "Peter Zijlstra (Intel)" 
> Cc: Thomas Gleixner 
> Cc: x...@kernel.org
> Cc: Andy Lutomirski 
> ---
> 
> b/arch/x86/mm/fault.c |   98 
> ++
> 1 file changed, 59 insertions(+), 39 deletions(-)
> 
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-00 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-002018-09-07 
> 11:21:46.145751902 -0700
> +++ b/arch/x86/mm/fault.c2018-09-07 11:23:37.643751624 -0700
> @@ -1033,7 +1033,7 @@ mm_fault_error(struct pt_regs *regs, uns
>}
> }
> 
> -static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> +static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
> {
>if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
>return 0;
> @@ -1072,7 +1072,7 @@ static int spurious_fault_check(unsigned
>  * (Optional Invalidation).
>  */
> static noinline int
> -spurious_fault(unsigned long error_code, unsigned long address)
> +spurious_kernel_fault(unsigned long error_code, unsigned long address)
> {
>pgd_t *pgd;
>p4d_t *p4d;
> @@ -1103,27 +1103,27 @@ spurious_fault(unsigned long error_code,
>return 0;
> 
>if (p4d_large(*p4d))
> -return spurious_fault_check(error_code, (pte_t *) p4d);
> +return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
> 
>pud = pud_offset(p4d, address);
>if (!pud_present(*pud))
>return 0;
> 
>if (pud_large(*pud))
> -return spurious_fault_check(error_code, (pte_t *) pud);
> +return spurious_kernel_fault_check(error_code, (pte_t *) pud);
> 
>pmd = pmd_offset(pud, address);
>if (!pmd_present(*pmd))
>return 0;
> 
>if (pmd_large(*pmd))
> -return spurious_fault_check(error_code, (pte_t *) pmd);
> +return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
> 
>pte = pte_offset_kernel(pmd, address);
>if (!pte_present(*pte))
>return 0;
> 
> -ret = spurious_fault_check(error_code, pte);
> +ret = spurious_kernel_fault_check(error_code, pte);
>if (!ret)
>return 0;
> 
> @@ -1131,12 +1131,12 @@ spurious_fault(unsigned long error_code,
> * Make sure we have permissions in PMD.
> * If not, then there's a bug in the page tables:
> */
> -ret = spurious_fault_check(error_code, (pte_t *) pmd);
> +ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
>WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
> 
>return ret;
> }
> -NOKPROBE_SYMBOL(spurious_fault);
> +NOKPROBE_SYMBOL(spurious_kernel_fault);
> 
> int show_unhandled_signals = 1;
> 
> @@ -1203,6 +1203,55 @@ static inline bool smap_violation(int er
>return true;
> }
> 
> +static void
> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{

Can you add a comment above this documenting *when* it’s called?  Is it all 
faults, !user_mode faults, or !PF_USER?

> +/*
> + * We can fault-in kernel-space virtual memory on-demand. The
> + * 'reference' page table is init_mm.pgd.
> + *
> + * NOTE! We MUST NOT take any locks for this case. We may
> + * be in an interrupt or a critical region, and should
> + * only copy the information from the master page table,
> + * nothing more.
> + *
> + * Before doing this on-demand faulting, ensure that the
> + * fault is not any of the following:
> + * 1. A fault on a PTE with a reserved bit set.
> + * 2. A fault caused by a user-mode access.  (Do not demand-
> + *fault kernel memory due to user-mode accesses).
> + * 3. A fault caused by a page-level protection violation.
> + *(A demand fault would be on a non-present page which
> + * would have X86_PF_PROT==0).
> + */
> +if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
> +if (vmalloc_fault(address) >= 0)
> +return;
> +}
> +
> +/* Was the fault spurious, caused by lazy TLB invalidation? */
> +if (spurious_kernel_fault(hw_error_code, address))
> +return;
> +
> +/* kprobes don't want to hook the spurious faults: */
> +if (kprobes_fault(regs))
> +return;
> +
> +/*
> + * This

[PATCH] include/linux/compiler*.h: add version detection to asm_volatile_goto

2018-09-07 Thread Nick Desaulniers

The comment above asm_volatile_goto mentions working around a GCC bug,
and links to a bug report that claims this has been fixed in newer
versions of GCC.  Testing shows that this was resolved in GCC 4.8.2.
asm_volatile_goto should also be defined for other compilers that
support asm goto.

Signed-off-by: Nick Desaulniers 
---
 include/linux/compiler-gcc.h   | 7 ++-
 include/linux/compiler_types.h | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 763bbad1e258..149f411b4366 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -171,7 +171,7 @@
 #endif
 
 /*
- * GCC 'asm goto' miscompiles certain code sequences:
+ * GCC < 4.8.2 'asm goto' miscompiles certain code sequences:
  *
  *   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
  *
@@ -179,7 +179,12 @@
  *
  * (asm goto is automatically volatile - the naming reflects this.)
  */
+#if GCC_VERSION < 40802
 #define asm_volatile_goto(x...)do { asm goto(x); asm (""); } while (0)
+#else
+#define asm_volatile_goto(x...)asm goto(x)
+#endif
+
 
 /*
  * sparse (__CHECKER__) pretends to be gcc, but can't do constant
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 3525c179698c..61449dbf30d8 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -151,6 +151,10 @@ struct ftrace_likely_data {
 #define __assume_aligned(a, ...)
 #endif
 
+#ifndef asm_volatile_goto
+#define asm_volatile_goto(x...) asm goto(x)
+#endif
+
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

[PATCH] include/linux/compiler*.h: add version detection to asm_volatile_goto

2018-09-07 Thread Nick Desaulniers

The comment above asm_volatile_goto mentions working around a GCC bug,
and links to a bug report that claims this has been fixed in newer
versions of GCC.  Testing shows that this was resolved in GCC 4.8.2.
asm_volatile_goto should also be defined for other compilers that
support asm goto.

Signed-off-by: Nick Desaulniers 
---
 include/linux/compiler-gcc.h   | 7 ++-
 include/linux/compiler_types.h | 4 
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index 763bbad1e258..149f411b4366 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -171,7 +171,7 @@
 #endif
 
 /*
- * GCC 'asm goto' miscompiles certain code sequences:
+ * GCC < 4.8.2 'asm goto' miscompiles certain code sequences:
  *
  *   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=58670
  *
@@ -179,7 +179,12 @@
  *
  * (asm goto is automatically volatile - the naming reflects this.)
  */
+#if GCC_VERSION < 40802
 #define asm_volatile_goto(x...)do { asm goto(x); asm (""); } while (0)
+#else
+#define asm_volatile_goto(x...)asm goto(x)
+#endif
+
 
 /*
  * sparse (__CHECKER__) pretends to be gcc, but can't do constant
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index 3525c179698c..61449dbf30d8 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -151,6 +151,10 @@ struct ftrace_likely_data {
 #define __assume_aligned(a, ...)
 #endif
 
+#ifndef asm_volatile_goto
+#define asm_volatile_goto(x...) asm goto(x)
+#endif
+
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 
-- 
2.19.0.rc2.392.g5ba43deb5a-goog

Re: [PATCH 1/2] of/fdt: Allow architectures to override CONFIG_CMDLINE logic

2018-09-07 Thread Rob Herring

On Fri, Sep 7, 2018 at 4:01 PM Paul Burton  wrote:
>
> Hi Rob,
>
> On Fri, Sep 07, 2018 at 03:29:03PM -0500, Rob Herring wrote:
> > On Fri, Sep 7, 2018 at 1:55 PM Paul Burton  wrote:
> > > The CONFIG_CMDLINE-related logic in early_init_dt_scan_chosen() falls
> > > back to copying CONFIG_CMDLINE into boot_command_line/data if the DT has
> > > a /chosen node but that node has no bootargs property or a bootargs
> > > property of length zero.
> >
> > The Risc-V guys found a similar issue if chosen is missing[1]. I
> > started a patch[2] to address that, but then looking at the different
> > arches wasn't sure if I'd break something. I don't recall for sure,
> > but it may have been MIPS that worried me.
> >
> > > This is problematic for the MIPS architecture because we support
> > > concatenating arguments from either the DT or the bootloader with those
> > > from CONFIG_CMDLINE, but the behaviour of early_init_dt_scan_chosen()
> > > gives us no way of knowing whether boot_command_line contains arguments
> > > from DT or already contains CONFIG_CMDLINE. This can lead to us
> > > concatenating CONFIG_CMDLINE with itself, duplicating command line
> > > arguments which can be problematic (eg. for earlycon which will attempt
> > > to register the same console twice & warn about it).
> >
> > If CONFIG_CMDLINE_EXTEND is set, you know it contains CONFIG_CMDLINE.
> > But I guess part of the problem is MIPS using its own kconfig options.
>
> Yes, that's part of the problem but there's more:
>
>   - We can also take arguments from the bootloader/prom, so it's not
> just DT or CONFIG_CMDLINE as taken into account by
> early_init_dt_scan_chosen(). MIPS has options to concatenate various
> combinations of DT, bootloader & CONFIG_CMDLINE arguments & there's
> no mapping to move all of them to just CONFIG_CMDLINE_EXTEND &
> CONFIG_CMDLINE_OVERRIDE.
>
>   - Some MIPS systems don't use DT, so don't run
> early_init_dt_scan_chosen() at all. Thus the architecture code still
> needs to deal with the bootloader/prom & CONFIG_CMDLINE cases
> anyway. In a perfect world we'd migrate all systems to use DT but in
> this world I don't see that happening until we kill off support for
> some of the older systems, which always seems contentious. Even then
> there'd be a lot of work for some of the remaining systems. I guess
> we could enable DT for these systems & only use it for the command
> line, it just feels like overkill.

We have the same thing with old systems on ARM. I think the big
difference is the code paths are well separated. We have support for
old bootloaders which populates old boot parameters (atags) into DT
(and can replace or extend DT bootargs). That's all in the zImage
decompressor wrapper. Then in the kernel, there are 2 separate paths
depending whether you have atags or DT (setup_machine_tags and
setup_machine_fdt). I think the difference is we've imposed rules such
as the bootloader can only pass parameters one of 2 ways and if DT is
used then the bootargs must come from DT.

> > > Move the CONFIG_CMDLINE-related logic to a weak function that
> > > architectures can provide their own version of, such that we continue to
> > > use the existing logic for architectures where it's suitable but also
> > > allow MIPS to override this behaviour such that the architecture code
> > > knows when CONFIG_CMDLINE is used.
> >
> > More arch specific functions is not what I want. Really, all the
> > cmdline manipulating logic doesn't belong in DT code, but it shouldn't
> > be in the arch specific code either IMO. Really it should be some
> > common kernel function which calls into the DT code to retrieve the DT
> > bootargs and that's it. Then you can skip calling that kernel function
> > if you really need non-standard handling.
>
> That would make sense.
>
> > Perhaps you should consider filling DT bootargs with the cmdline from
> > bootloader. IOW, make the legacy case look like the non-legacy case
> > early, and then the kernel doesn't have to deal with both cases later
> > on.
>
> That could work, but would force us to use DT universally & is a larger
> change than this, which I was hoping to get in 4.19 to fix the
> regression described in patch 2 that happened back in 4.16. But if
> you're unhappy with this perhaps we just have to live with it a little
> longer...

You should probably at least do just the revert as that was what
worked for some platforms. But I guess you enabled some platforms and
just reverting would break them after working for a couple of cycles.

> An alternative workaround to this that would contain the regression fix
> within arch/mips/ would be to initialize boot_command_line to a single
> space character prior to early_init_dt_scan_chosen(), which would
> prevent early_init_dt_scan_chosen() from copying CONFIG_CMDLINE there.
> That smells much more like a hack to me than this patch though, so I'd
> rather not.

That doesn't seem that bad to me.

[Patch] acpi_power_meter: remove 'ignoring unsafe software power cap' message

2018-09-07 Thread Max Asbock

At boot time the acpi_power_meter driver greets users of non-IBM systems with 
the message: 

"Ignoring unsafe software power cap". 

This message is generally interpreted as meaning: The system is operating under 
an unsafe power  cap and Linux is ignoring this fact, thus living dangerously. 
It, or its presumed origin, has been blamed for system crashes. People spent 
time writing knowledge base articles which explain that the message doesn't 
mean what users think. I now have to write another such article telling people 
to ignore this message. To avoid future confusion and unnecessary work, I think 
the best solution is to remove the message. 

Here is my translation of the 'ignoring unsafe power cap' message:
'The acpi_power_meter driver discovered an ACPI object that would in theory 
allow software to set power caps. The driver could now create files in sysfs to 
expose this interface to user space, but it chooses not to do so'.

Patch: Remove error message because it is generally misinterpreted. A 
replacement
for the message is not necessary.

Signed-off-by: Max Asbock 

diff --git a/drivers/hwmon/acpi_power_meter.c b/drivers/hwmon/acpi_power_meter.c
index 34e45b9..578e886 100644
--- a/drivers/hwmon/acpi_power_meter.c
+++ b/drivers/hwmon/acpi_power_meter.c
@@ -693,11 +693,8 @@ static int setup_attrs(struct acpi_power_meter_resource 
*resource)
}
 
if (resource->caps.flags & POWER_METER_CAN_CAP) {
-   if (!can_cap_in_hardware()) {
-   dev_err(>acpi_dev->dev,
-   "Ignoring unsafe software power cap!\n");
+   if (!can_cap_in_hardware())
goto skip_unsafe_cap;
-   }
 
if (resource->caps.configurable_cap)
res = register_attrs(resource, rw_cap_attrs);

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1948 matches

Mail list logo