date:20180316

RE: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread David Laight

From: Linus Torvalds
> Sent: 16 March 2018 17:29
> On Fri, Mar 16, 2018 at 4:47 AM, Florian Weimer  wrote:
> >
> > If you want to catch stack frames which have unbounded size,
> > -Werror=stack-usage=1000 or -Werror=vla-larger-than=1000 (with the constant
> > adjusted as needed) might be the better approach.
> 
> No, we want to catch *variable* stack sizes.
> 
> Does "-Werror=vla-larger-than=0" perhaps work for that? No, because
> the stupid compiler says that is "meaningless".
> 
> And no, using "-Werror=vla-larger-than=1" doesn't work either, because
> the moronic compiler continues to think that "vla" is about the
> _type_, not the code:
> 
>t.c: In function ‘test’:
>t.c:6:6: error: argument to variable-length array is too large
> [-Werror=vla-larger-than=]
>  int array[(1,100)];
> 
> Gcc people are crazy.
> 
> Is there really no way to just say "shut up about the stupid _syntax_
> issue that is entirely irrelevant, and give us the _code_ issue".

I looked at the generated code for one of the constant sized VLA that
the compiler barfed at.
It seemed to subtract constants from %sp separately for the VLA.
So it looks like the compiler treats them as VLA even though it
knows the size.
That is probably missing optimisation.

David

Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary

2018-03-16 Thread Borislav Petkov

On Fri, Mar 16, 2018 at 10:22:29AM -0700, Linus Torvalds wrote:
> The reason we do that
> 
> printk(KERN_DEFAULT "CR2: %016lx\n", address);
> 
> is because WE ARE NOT PRINTING OUT THE CURRENT CR2 REGISTER!

Whoopsie!

Doh, __show_regs() reads CR2 again and there's a big fat window
in-between...

> This is really damn important.
> 
> The "address" register contains the CR2 value as it was read *very*
> early in the page fault case, before we enabled interrupts, and before
> we did various random things that can cause further page faults and
> change CR2!
> 
> So the executive summary that does __show_regs() may end up showing
> something completely different than the actual faulting address,
> because we might have taken a vmalloc-space exception in the meantime,
> for example.
> 
> Do *NOT* get rid of that thing.

Reverted.

> You're better off getting rid of the CR2 line from __show_regs(),
> because it can be dangerously confusing. It's not actually part of the
> saved register state at all, it's something entirely different. It's
> like showing the current eflags rather than the eflags saved on the
> faulting stack.

Yeah, __show_regs() goes and gets a bunch of registers at the time
__show_regs() runs. Which is ok for those which don't change in between
but CR2 is special.

We probably could improve that situation by having a struct fault_regs
or so wrapping pt_regs and adding a bunch of fields like CR2 etc. Fault
handlers would then populate fault_regs at fault time while we're atomic
and then hand this struct down to the printing path.

The printing path would fill out the rest and this way we won't have any
of that monkey business anymore.

Thoughts?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH] staging: typec: rt1711h typec chip driver

2018-03-16 Thread 李書帆

Hi Heikki,

2018-03-16 23:05 GMT+08:00 Heikki Krogerus :
> Hi ShuFan,
>
> On Fri, Mar 16, 2018 at 05:12:49PM +0800, ShuFan Lee wrote:
>> +static int rt1711h_init_gpio(struct rt1711h_chip *chip)
>> +{
>> + int ret;
>> + struct device_node *np = chip->dev->of_node;
>> +
>> + ret = of_get_named_gpio(np, "rt,intr_gpio", 0);
>> + if (ret < 0) {
>> + dev_err(chip->dev, "%s get int gpio fail(%d)\n", __func__, 
>> ret);
>> + return ret;
>> + }
>> + chip->irq_gpio = ret;
>> +
>> + ret = devm_gpio_request_one(chip->dev, chip->irq_gpio, GPIOF_IN,
>> + dev_name(chip->dev));
>> + if (ret < 0) {
>> + dev_err(chip->dev, "%s request gpio fail(%d)\n", __func__, 
>> ret);
>> + return ret;
>> + }
>> +
>> + chip->irq = gpio_to_irq(chip->irq_gpio);
>> + if (chip->irq <= 0) {
>> + dev_err(chip->dev, "%s gpio2irq fail(%d)\n", __func__,
>> + chip->irq);
>> + return -EINVAL;
>> + }
>> + return 0;
>
> "rt,intr_gpio" should probable be "rt,intr-gpio". Then this function
> can be prepared for all types of platforms:
>
> static int rt1711h_init_gpio(struct rt1711h_chip *chip)
> {
> struct gpio_desc *gpio;
>
> gpio = devm_gpiod_get(chip->dev, "rt,intr", GFP_KERNEL);
> if (IS_ERR(gpio))
> return PTR_ERR(gpio);
>
> chip->irq = gpiod_to_irq(gpio);
> if (chip->irq < 0)
> return chip->irq;
>
> return 0;
> }
>
>
> Thanks,
>
> --
> heikki

  Thank you, I've changed it in PATCH v2.

  May I add you to Suggested-by list?

-- 
Best Regards,
書帆

[PATCH] tpm: TPM 2.0 selftest performance improvement

2018-03-16 Thread Nayna Jain

For selftest being run in the background, the TCG 2.0 Specification
provides the command TPM2_GetTestResult to check the status of selftest
completion.

When the partial selftest command is sent just after TPM initialization,
it is observed that it returns RC_COMMAND_CODE error, which as per TPM 2.0
Specification, indicates "the response code that is returned if the TPM is
unmarshalling a value that it expects to be a TPM_CC and the input value is
not in the table." This doesn't indicate the exact status of selftest
command on TPM. But, it can be verified by sending the TPM2_GetTestResult.

This patch implements the TPM2_GetTestResult command and uses it to check
the selftest status, before sending the full selftest command after partial
selftest returns RC_COMMAND_CODE.

With this change, dmesg shows the TPM selftest completed at 1.243864
compared with the previous 1.939667 time.

Signed-off-by: Nayna Jain 
Tested-by: Mimi Zohar  (on Pi with TPM 2.0)
Signed-off-by: Mimi Zohar 
---
 drivers/char/tpm/tpm.h  |  2 ++
 drivers/char/tpm/tpm2-cmd.c | 59 +
 2 files changed, 61 insertions(+)

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 82ae7b722161..d95eeb7c002a 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -107,6 +107,7 @@ enum tpm2_return_codes {
TPM2_RC_FAILURE = 0x0101,
TPM2_RC_DISABLED= 0x0120,
TPM2_RC_COMMAND_CODE= 0x0143,
+   TPM2_RC_NEEDS_TEST  = 0x0153,
TPM2_RC_TESTING = 0x090A, /* RC_WARN */
TPM2_RC_REFERENCE_H0= 0x0910,
 };
@@ -135,6 +136,7 @@ enum tpm2_command_codes {
TPM2_CC_FLUSH_CONTEXT   = 0x0165,
TPM2_CC_GET_CAPABILITY  = 0x017A,
TPM2_CC_GET_RANDOM  = 0x017B,
+   TPM2_CC_GET_TEST_RESULT = 0x017C,
TPM2_CC_PCR_READ= 0x017E,
TPM2_CC_PCR_EXTEND  = 0x0182,
TPM2_CC_LAST= 0x018F,
diff --git a/drivers/char/tpm/tpm2-cmd.c b/drivers/char/tpm/tpm2-cmd.c
index 89a5397b18d2..494f6dfbc65d 100644
--- a/drivers/char/tpm/tpm2-cmd.c
+++ b/drivers/char/tpm/tpm2-cmd.c
@@ -823,6 +823,50 @@ unsigned long tpm2_calc_ordinal_duration(struct tpm_chip 
*chip, u32 ordinal)
 EXPORT_SYMBOL_GPL(tpm2_calc_ordinal_duration);
 
 /**
+ * tpm2_get_selftest_result() - get the status of self tests
+ *
+ * @chip: TPM chip to use
+ *
+ * Return: If error return rc, else return the result of the self tests.
+ * TPM_RC_NEEDS_TESTING: No self tests are done. Needs testing.
+ * TPM_RC_TESTING: Self tests are in progress.
+ * TPM_RC_SUCCESS: Self tests completed successfully.
+ * TPM_RC_FAILURE: Self tests completed failure.
+ *
+ * This function can be used to check the status of self tests on the TPM.
+ */
+static int tpm2_get_selftest_result(struct tpm_chip *chip)
+{
+   struct tpm_buf buf;
+   int rc;
+   int test_result;
+   uint16_t data_size;
+   int len;
+   const struct tpm_output_header *header;
+
+   rc = tpm_buf_init(&buf, TPM2_ST_NO_SESSIONS, TPM2_CC_GET_TEST_RESULT);
+   if (rc)
+   return rc;
+
+   len = tpm_transmit(chip, NULL, buf.data, PAGE_SIZE, 0);
+   if (len <  0)
+   return len;
+
+   header = (struct tpm_output_header *)buf.data;
+
+   rc = be32_to_cpu(header->return_code);
+   if (rc)
+   return rc;
+
+   data_size = be16_to_cpup((__be16 *)&buf.data[TPM_HEADER_SIZE]);
+
+   test_result = be32_to_cpup((__be32 *)
+   (&buf.data[TPM_HEADER_SIZE + 2 + data_size]));
+
+   return test_result;
+}
+
+/**
  * tpm2_do_selftest() - ensure that all self tests have passed
  *
  * @chip: TPM chip to use
@@ -851,10 +895,25 @@ static int tpm2_do_selftest(struct tpm_chip *chip)
  "attempting the self test");
tpm_buf_destroy(&buf);
 
+   dev_dbg(&chip->dev, "tpm selftest command returned %04x\n", rc);
if (rc == TPM2_RC_TESTING)
rc = TPM2_RC_SUCCESS;
if (rc == TPM2_RC_INITIALIZE || rc == TPM2_RC_SUCCESS)
return rc;
+
+   if (rc == TPM2_RC_COMMAND_CODE) {
+
+   dev_info(&chip->dev, "Check TPM Test Results\n");
+   rc = tpm2_get_selftest_result(chip);
+
+   dev_info(&chip->dev, "tpm self test result is %04x\n",
+   rc);
+   if (rc == TPM2_RC_TESTING)
+   rc = TPM2_RC_SUCCESS;
+   if (rc == TPM2_RC_INITIALIZE || rc == TPM2_RC_SUCCESS
+   || TPM2_RC_FAILURE)
+   return rc;
+   }
}
 
return rc;
-- 
2.13.6

Re: [PATCH 6/8] trace_uprobe/sdt: Fix multiple update of same reference counter

2018-03-16 Thread Oleg Nesterov

On 03/16, Ravi Bangoria wrote:
>
> On 03/15/2018 08:19 PM, Oleg Nesterov wrote:
> > On 03/13, Ravi Bangoria wrote:
> >> For tiny binaries/libraries, different mmap regions points to the
> >> same file portion. In such cases, we may increment reference counter
> >> multiple times.
> > Yes,
> >
> >> But while de-registration, reference counter will get
> >> decremented only by once
> > could you explain why this happens? sdt_increment_ref_ctr() and
> > sdt_decrement_ref_ctr() look symmetrical, _decrement_ should see
> > the same mappings?

...

>     # strace -o out python
>       mmap(NULL, 2738968, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 
> 0) = 0x7fff9246
>   mmap(0x7fff926a, 327680, PROT_READ|PROT_WRITE, 
> MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x23) = 0x7fff926a
>   mprotect(0x7fff926a, 65536, PROT_READ) = 0

Ah, in this case everything is clear, thanks.

I was confused by the changelog, I misinterpreted it as if inc/dec are not
balanced in case of multiple mappings even if the application doesn't play
with mmap/mprotect/etc.

And it seems that you are trying to confuse yourself, not only me ;) Just
suppose that an application does mmap+munmap in a loop and the mapped region
contains uprobe but not the counter.

And this all makes me think that we should do something else. Ideally,
install_breakpoint() and remove_breakpoint() should inc/dec the counter
if they do not fail...

Btw, why do we need a counter, not a boolean? Who else can modify it?
Or different uprobes can share the same counter?

Oleg.

Re: [PATCHv2 5/5] arm64: allwinner: a64: Add support for TERES-I laptop

2018-03-16 Thread afzal mohammed

Hi,

On Fri, Mar 16, 2018 at 12:07:53PM +0530, afzal mohammed wrote:

> Received only patch 4 & 5 in my inbox, receive path was via
> linux-kernel rather than linux-arm-kernel, but in both archives all
> patches are seen (though threading seems not right), probably missing
> patches are due to issue gmail have with LKML,

Cover letter plus 1-3 patches was swallowed by spam filter, even your
reply to me on v1 cover letter subthread was so, dunno whether it has
something to do with your mail header contents.

afzal

Re: [PATCH 2/2] kprobe: fix: Add ftrace_ops_assist_func to kprobe blacklist

2018-03-16 Thread Mathieu Desnoyers

- On Mar 16, 2018, at 12:48 PM, rostedt rost...@goodmis.org wrote:

> On Fri, 16 Mar 2018 12:41:34 -0400
> Steven Rostedt  wrote:
> 
>> Yes, kprobes are dangerous. I'm not saying it shouldn't be fixed, I'm
>> saying that I don't have time to fix it now, but would be happy to
>> accept patches if someone else does so.
> 
> And looking at what I replied before for the original patch. It would
> probably be a good idea to blacklist directories. Like we do with
> function tracing. We probably should black list both kernel/tracing and
> kernel/events from being probed.
> 
> Did this come up at plumbers? You were there too, I don't remember
> discussing it there.

I don't remember this coming up last Plumbers nor KS neither, given
that we were focused on other topics.

Would the general approach you envision be based on emitting all code
generated by compilation of all objects under kernel/tracing and
kernel/events into a specific "nokprobes" text section of the kernel ?
Perhaps we could create a specific linker scripts for those directories,
or do you have in mind a neater way to do this ?

Thanks,

Mathieu

> 
> -- Steve

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

Re: arc_usr_cmpxchg and preemption

2018-03-16 Thread Vineet Gupta


On 03/16/2018 10:33 AM, Alexey Brodkin wrote:

Hi Peter, Vineet,

On Wed, 2018-03-14 at 18:53 +0100, Peter Zijlstra wrote:

On Wed, Mar 14, 2018 at 09:58:19AM -0700, Vineet Gupta wrote:


Well it is broken wrt the semantics the syscall is supposed to provide.
Preemption disabling is what prevents a concurrent thread from coming in and
modifying the same location (Imagine a variable which is being cmpxchg
concurrently by 2 threads).

One approach is to do it the MIPS way, emulate the llsc flag - set it under
preemption disabled section and clear it in switch_to

*shudder*... just catch the -EFAULT, force the write fault and retry.

More I look at this initially quite simple thing more it looks like
a can of worms...



I'd say just bite the bullet, write the patch and we can refine it there !

-Vineet

[PATCH 1/7] dt-bindings: add compatible string for the A64 DE2 CCU

2018-03-16 Thread Icenowy Zheng

The Allwinner A64 SoC has a DE2 CCU like the one in the DE2 of Allwinner
H5 SoC.

Add a compatible string for it.

Signed-off-by: Icenowy Zheng 
---
 Documentation/devicetree/bindings/clock/sun8i-de2.txt | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/clock/sun8i-de2.txt 
b/Documentation/devicetree/bindings/clock/sun8i-de2.txt
index f2fa87c4765c..e94582e8b8a9 100644
--- a/Documentation/devicetree/bindings/clock/sun8i-de2.txt
+++ b/Documentation/devicetree/bindings/clock/sun8i-de2.txt
@@ -6,6 +6,7 @@ Required properties :
- "allwinner,sun8i-a83t-de2-clk"
- "allwinner,sun8i-h3-de2-clk"
- "allwinner,sun8i-v3s-de2-clk"
+   - "allwinner,sun50i-a64-de2-clk"
- "allwinner,sun50i-h5-de2-clk"
 
 - reg: Must contain the registers base address and length
-- 
2.15.1

[PATCH 0/7] Allwinner A64 DE2 CCU support with dedicated DE2 bus driver

2018-03-16 Thread Icenowy Zheng

This patchset tries to implement the Allwinner A64 DE2 as a bus driver,
in order to model the fact that the SRAM claim controls the access to
the whole DE2 memory space.

PATCH 1 and PATCH 4 are for the CCU part.

PATCH 2 is the device tree binding for the A64 DE2 bus, and PATCH 3
implements the bus driver.

PATCH 5 is a modified version of A64 DE2 CCU patch, which uses the A64
DE2 bus.

PATCH 6 and 7 are just the simplefb patches for A64.

Icenowy Zheng (7):
  dt-bindings: add compatible string for the A64 DE2 CCU
  dt-bindings: add binding for the Allwinner A64 DE2 bus
  bus: add bus driver for accessing Allwinner A64 DE2
  clk: sunxi-ng: add A64 compatible string
  arm64: allwinner: a64: add DE2 CCU related device tree nodes
  arm64: allwinner: a64: add simplefb for A64 SoC
  arm64: allwinner: a64: add HDMI regulator to all DTs' simplefb_hdmi

 .../devicetree/bindings/bus/sun50i-de2-bus.txt | 37 
 .../devicetree/bindings/clock/sun8i-de2.txt|  1 +
 .../boot/dts/allwinner/sun50i-a64-bananapi-m64.dts |  4 ++
 .../boot/dts/allwinner/sun50i-a64-nanopi-a64.dts   |  4 ++
 .../boot/dts/allwinner/sun50i-a64-olinuxino.dts|  4 ++
 .../boot/dts/allwinner/sun50i-a64-orangepi-win.dts |  4 ++
 .../arm64/boot/dts/allwinner/sun50i-a64-pine64.dts |  4 ++
 .../dts/allwinner/sun50i-a64-sopine-baseboard.dts  |  4 ++
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi  | 68 ++
 drivers/bus/Kconfig| 10 
 drivers/bus/Makefile   |  1 +
 drivers/bus/sun50i-de2.c   | 49 
 drivers/clk/sunxi-ng/ccu-sun8i-de2.c   | 11 ++--
 13 files changed, 194 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/bus/sun50i-de2-bus.txt
 create mode 100644 drivers/bus/sun50i-de2.c

-- 
2.15.1

Re: [PATCH v7 10/42] clk: davinci: New driver for davinci PSC clocks

2018-03-16 Thread Stephen Boyd

Quoting Bartosz Golaszewski (2018-02-28 04:38:38)
> 2018-02-19 21:21 GMT+01:00 David Lechner :
> 
> I believe there to be two issues: one is with v7 - we need to increase
> the clock reference count in davinci_psc_genpd_attach_dev().
> 
> Second is the error path in the clock framework - we should remove the
> destroyed clk_core from the debug list, which is not being done now.
> 
> Why we even need to track the refcount of clk_core is a mistery for me
> though. Stephen, Mike?
> 

Which part of the code are we talking about? I see that
__clk_core_init() calls clk_debug_register() when ret == 0 and that
looks fine. I do wonder why clk_debug_register() even returns a value
though because we ignore it.

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro

On Fri, Mar 16, 2018 at 10:29:16AM -0700, Linus Torvalds wrote:
>t.c: In function ‘test’:
>t.c:6:6: error: argument to variable-length array is too large
> [-Werror=vla-larger-than=]
>  int array[(1,100)];
> 
> Gcc people are crazy.

That's not them, that's C standard regarding ICE.  1,100 is *not* a
constant expression as far as the standard is concerned, and that
type is actually a VLA with the size that can be optimized into
a compiler-calculated value.

Would you argue that in
void foo(char c)
{
int a[(c<<1) + 10 - c + 2 - c];

a is not a VLA?  Sure, compiler probably would be able to reduce
that expression to 12, but demanding that to be recognized means
that compiler must do a bunch of optimizations in the middle of
typechecking.

expr, constant_expression is not a constant_expression.  And in
this particular case the standard is not insane - the only reason
for using that is typechecking and _that_ can be achieved without
violating 6.6p6:
sizeof(expr,0) * 0 + ICE
*is* an integer constant expression, and it gives you exact same
typechecking.  So if somebody wants to play odd games, they can
do that just fine, without complicating the logics for compilers...

[PATCH 2/7] dt-bindings: add binding for the Allwinner A64 DE2 bus

2018-03-16 Thread Icenowy Zheng

All the sub-blocks of Allwinner A64 DE2 needs the SRAM C on A64 SoC to
be claimed, otherwise the whole DE2 space is inaccessible.

Add a device tree binding of the DE2 part as a sub-bus.

Signed-off-by: Icenowy Zheng 
---
 .../devicetree/bindings/bus/sun50i-de2-bus.txt | 37 ++
 1 file changed, 37 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/bus/sun50i-de2-bus.txt

diff --git a/Documentation/devicetree/bindings/bus/sun50i-de2-bus.txt 
b/Documentation/devicetree/bindings/bus/sun50i-de2-bus.txt
new file mode 100644
index ..87dfb33fb3be
--- /dev/null
+++ b/Documentation/devicetree/bindings/bus/sun50i-de2-bus.txt
@@ -0,0 +1,37 @@
+Device tree bindings for Allwinner A64 DE2 bus
+
+The Allwinner A64 DE2 is on a special bus, which needs a SRAM region (SRAM C)
+to be claimed for enabling the access.
+
+Required properties:
+
+ - compatible: Should contain "allwinner,sun50i-a64-de2"
+ - reg:A resource specifier for the register space
+ - #address-cells: Must be set to 1
+ - #size-cells:Must be set to 1
+ - ranges: Must be set up to map the address space inside the
+   DE2, for the sub-blocks of DE2.
+ - allwinner,sram: the SRAM that needs to be claimed
+
+Example:
+
+   de2@100 {
+   compatible = "allwinner,sun50i-a64-de2";
+   reg = <0x100 0x40>;
+   allwinner,sram = <&de2_sram 1>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0 0x100 0x40>;
+
+   display_clocks: clock@0 {
+   compatible = "allwinner,sun50i-a64-de2-clk";
+   reg = <0x0 0x10>;
+   clocks = <&ccu CLK_DE>,
+<&ccu CLK_BUS_DE>;
+   clock-names = "mod",
+ "bus";
+   resets = <&ccu RST_BUS_DE>;
+   #clock-cells = <1>;
+   #reset-cells = <1>;
+   };
+   };
-- 
2.15.1

[PATCH 4/7] clk: sunxi-ng: add A64 compatible string

2018-03-16 Thread Icenowy Zheng

As claiming Allwinner A64 SRAM C is a prerequisite for all sub-blocks of
the A64 DE2, not only the CCU sub-block, a bus driver is then written for
enabling the access to the whole DE2 part by claiming the SRAM.

In this situation, the A64 compatible string will be just added with no
other requirments, as they're processed by the parent bus driver.

Signed-off-by: Icenowy Zheng 
---
 drivers/clk/sunxi-ng/ccu-sun8i-de2.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/clk/sunxi-ng/ccu-sun8i-de2.c 
b/drivers/clk/sunxi-ng/ccu-sun8i-de2.c
index 468d1abaf0ee..8df7cd93453e 100644
--- a/drivers/clk/sunxi-ng/ccu-sun8i-de2.c
+++ b/drivers/clk/sunxi-ng/ccu-sun8i-de2.c
@@ -292,13 +292,10 @@ static const struct of_device_id sunxi_de2_clk_ids[] = {
.compatible = "allwinner,sun50i-h5-de2-clk",
.data = &sun50i_a64_de2_clk_desc,
},
-   /*
-* The Allwinner A64 SoC needs some bit to be poke in syscon to make
-* DE2 really working.
-* So there's currently no A64 compatible here.
-* H5 shares the same reset line with A64, so here H5 is using the
-* clock description of A64.
-*/
+   {
+   .compatible = "allwinner,sun50i-a64-de2-clk",
+   .data = &sun50i_a64_de2_clk_desc,
+   },
{ }
 };
 
-- 
2.15.1

[PATCH 3/7] bus: add bus driver for accessing Allwinner A64 DE2

2018-03-16 Thread Icenowy Zheng

The "Display Engine 2.0" (usually called DE2) on the Allwinner A64 SoC
is different from the ones on other Allwinner SoCs. It requires a SRAM
region to be claimed, otherwise all DE2 subblocks won't be accessible.

Add a bus driver for the Allwinner A64 DE2 part which claims the SRAM
region when probing.

Signed-off-by: Icenowy Zheng 
---
 drivers/bus/Kconfig  | 10 ++
 drivers/bus/Makefile |  1 +
 drivers/bus/sun50i-de2.c | 49 
 3 files changed, 60 insertions(+)
 create mode 100644 drivers/bus/sun50i-de2.c

diff --git a/drivers/bus/Kconfig b/drivers/bus/Kconfig
index ff70850031c5..cc8e4b4b6b59 100644
--- a/drivers/bus/Kconfig
+++ b/drivers/bus/Kconfig
@@ -95,6 +95,16 @@ config SIMPLE_PM_BUS
  Controller (BSC, sometimes called "LBSC within Bus Bridge", or
  "External Bus Interface") as found on several Renesas ARM SoCs.
 
+config SUN50I_DE2_BUS
+   bool "Allwinner A64 DE2 Bus Driver"
+ default ARM64
+ depends on ARCH_SUNXI
+ select SUNXI_SRAM
+ help
+ Say y here to enable support for Allwinner A64 DE2 bus driver. It's
+ mostly transparent, but a SRAM region needs to be claimed in the SRAM
+ controller to make the all blocks in the DE2 part accessible.
+
 config SUNXI_RSB
tristate "Allwinner sunXi Reduced Serial Bus Driver"
  default MACH_SUN8I || MACH_SUN9I || ARM64
diff --git a/drivers/bus/Makefile b/drivers/bus/Makefile
index 3d473b8adeac..746ff0cebe10 100644
--- a/drivers/bus/Makefile
+++ b/drivers/bus/Makefile
@@ -19,6 +19,7 @@ obj-$(CONFIG_OMAP_INTERCONNECT)   += omap_l3_smx.o 
omap_l3_noc.o
 
 obj-$(CONFIG_OMAP_OCP2SCP) += omap-ocp2scp.o
 obj-$(CONFIG_QCOM_EBI2)+= qcom-ebi2.o
+obj-$(CONFIG_SUN50I_DE2_BUS)   += sun50i-de2.o
 obj-$(CONFIG_SUNXI_RSB)+= sunxi-rsb.o
 obj-$(CONFIG_SIMPLE_PM_BUS)+= simple-pm-bus.o
 obj-$(CONFIG_TEGRA_ACONNECT)   += tegra-aconnect.o
diff --git a/drivers/bus/sun50i-de2.c b/drivers/bus/sun50i-de2.c
new file mode 100644
index ..836828ef96d5
--- /dev/null
+++ b/drivers/bus/sun50i-de2.c
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Allwinner A64 Display Engine 2.0 Bus Driver
+ *
+ * Copyright (C) 2018 Icenowy Zheng 
+ */
+
+#include 
+#include 
+#include 
+
+static int sun50i_de2_bus_probe(struct platform_device *pdev)
+{
+   struct device_node *np = pdev->dev.of_node;
+   int ret;
+
+   ret = sunxi_sram_claim(&pdev->dev);
+   if (ret) {
+   dev_err(&pdev->dev, "Error couldn't map SRAM to device\n");
+   return ret;
+   }
+
+   if (np)
+   of_platform_populate(np, NULL, NULL, &pdev->dev);
+
+   return 0;
+}
+
+static int sun50i_de2_bus_remove(struct platform_device *pdev)
+{
+   sunxi_sram_release(&pdev->dev);
+   return 0;
+}
+
+static const struct of_device_id sun50i_de2_bus_of_match[] = {
+   { .compatible = "allwinner,sun50i-a64-de2", },
+   { /* sentinel */ }
+};
+
+static struct platform_driver sun50i_de2_bus_driver = {
+   .probe = sun50i_de2_bus_probe,
+   .remove = sun50i_de2_bus_remove,
+   .driver = {
+   .name = "sun50i-de2-bus",
+   .of_match_table = sun50i_de2_bus_of_match,
+   },
+};
+
+builtin_platform_driver(sun50i_de2_bus_driver);
-- 
2.15.1

[PATCH 7/7] arm64: allwinner: a64: add HDMI regulator to all DTs' simplefb_hdmi

2018-03-16 Thread Icenowy Zheng

On usual A64 board design the power of HDMI controller is connected to
DLDO1 of the AXP803 PMIC. If this regulator is shut down, the HDMI
output will be blank. Therefore the simplefb driver should keep this
regulator on.

Add the regulator to all currently available A64 boards' simplefb_hdmi
device node.

Signed-off-by: Icenowy Zheng 
---
 arch/arm64/boot/dts/allwinner/sun50i-a64-bananapi-m64.dts | 4 
 arch/arm64/boot/dts/allwinner/sun50i-a64-nanopi-a64.dts   | 4 
 arch/arm64/boot/dts/allwinner/sun50i-a64-olinuxino.dts| 4 
 arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts | 4 
 arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts   | 4 
 arch/arm64/boot/dts/allwinner/sun50i-a64-sopine-baseboard.dts | 4 
 6 files changed, 24 insertions(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-bananapi-m64.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-bananapi-m64.dts
index 2250dec9974c..2fd343512d41 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-bananapi-m64.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-bananapi-m64.dts
@@ -282,6 +282,10 @@
regulator-name = "vcc-rtc";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 &uart0 {
pinctrl-names = "default";
pinctrl-0 = <&uart0_pins_a>;
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-nanopi-a64.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-nanopi-a64.dts
index e2dce48fa29a..98dbff19f5cc 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-nanopi-a64.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-nanopi-a64.dts
@@ -195,6 +195,10 @@
regulator-name = "vcc-rtc";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 &uart0 {
pinctrl-names = "default";
pinctrl-0 = <&uart0_pins_a>;
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-olinuxino.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-olinuxino.dts
index 3b3081b10ecb..3f531393eaee 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-olinuxino.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-olinuxino.dts
@@ -214,6 +214,10 @@
regulator-name = "vcc-rtc";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 &uart0 {
pinctrl-names = "default";
pinctrl-0 = <&uart0_pins_a>;
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
index bf42690a3361..1221764f5719 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-orangepi-win.dts
@@ -191,6 +191,10 @@
regulator-name = "vcc-rtc";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 &uart0 {
pinctrl-names = "default";
pinctrl-0 = <&uart0_pins_a>;
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
index a75825798a71..1b9b92e541d2 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-pine64.dts
@@ -229,6 +229,10 @@
regulator-name = "vcc-rtc";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 /* On Euler connector */
 &spdif {
status = "disabled";
diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64-sopine-baseboard.dts 
b/arch/arm64/boot/dts/allwinner/sun50i-a64-sopine-baseboard.dts
index abe179de35d7..c21f2331add6 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64-sopine-baseboard.dts
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64-sopine-baseboard.dts
@@ -134,6 +134,10 @@
regulator-name = "vcc-wifi";
 };
 
+&simplefb_hdmi {
+   vcc-hdmi-supply = <®_dldo1>;
+};
+
 &uart0 {
pinctrl-names = "default";
pinctrl-0 = <&uart0_pins_a>;
-- 
2.15.1

[PATCH 6/7] arm64: allwinner: a64: add simplefb for A64 SoC

2018-03-16 Thread Icenowy Zheng

The A64 SoC features two display pipelines, one has a LCD output, the
other has a HDMI output.

Add support for simplefb for these pipelines on A64 SoC.

Signed-off-by: Icenowy Zheng 
---
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi 
b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
index 1f92015503ea..7767d0761b2e 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
@@ -42,9 +42,11 @@
  * OTHER DEALINGS IN THE SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 / {
@@ -52,6 +54,30 @@
#address-cells = <1>;
#size-cells = <1>;
 
+   chosen {
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   simplefb_lcd: framebuffer-lcd {
+   compatible = "allwinner,simple-framebuffer",
+"simple-framebuffer";
+   allwinner,pipeline = "mixer0-lcd0";
+   clocks = <&display_clocks CLK_MIXER0>,
+<&ccu CLK_TCON0>;
+   status = "disabled";
+   };
+
+   simplefb_hdmi: framebuffer-hdmi {
+   compatible = "allwinner,simple-framebuffer",
+"simple-framebuffer";
+   allwinner,pipeline = "mixer1-lcd1-hdmi";
+   clocks = <&display_clocks CLK_MIXER1>,
+<&ccu CLK_TCON1>, <&ccu CLK_HDMI>;
+   status = "disabled";
+   };
+   };
+
cpus {
#address-cells = <1>;
#size-cells = <0>;
-- 
2.15.1

[PATCH 5/7] arm64: allwinner: a64: add DE2 CCU related device tree nodes

2018-03-16 Thread Icenowy Zheng

As we have all necessary parts to enable the DE2 CCU on the Allwinner
A64 SoC, add the needed device tree nodes, including the SRAM controller
node, SRAM C node, DE2 bus node and DE2 CCU node.

Signed-off-by: Icenowy Zheng 
---
 arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi | 42 +++
 1 file changed, 42 insertions(+)

diff --git a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi 
b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
index 1b6dc31e7d91..1f92015503ea 100644
--- a/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
+++ b/arch/arm64/boot/dts/allwinner/sun50i-a64.dtsi
@@ -148,6 +148,48 @@
#size-cells = <1>;
ranges;
 
+   de2@100 {
+   compatible = "allwinner,sun50i-a64-de2";
+   reg = <0x100 0x40>;
+   allwinner,sram = <&de2_sram 1>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0 0x100 0x40>;
+
+   display_clocks: clock@0 {
+   compatible = "allwinner,sun50i-a64-de2-clk";
+   reg = <0x0 0x10>;
+   clocks = <&ccu CLK_DE>,
+<&ccu CLK_BUS_DE>;
+   clock-names = "mod",
+ "bus";
+   resets = <&ccu RST_BUS_DE>;
+   #clock-cells = <1>;
+   #reset-cells = <1>;
+   };
+   };
+
+   sram-controller@1c0 {
+   compatible = "allwinner,sun50i-a64-sram-controller";
+   reg = <0x01c0 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges;
+
+   sram_c: sram@18000 {
+   compatible = "mmio-sram";
+   reg = <0x00018000 0x28000>;
+   #address-cells = <1>;
+   #size-cells = <1>;
+   ranges = <0 0x00018000 0x28000>;
+
+   de2_sram: sram-section@0 {
+   compatible = 
"allwinner,sun50i-a64-sram-c";
+   reg = <0x 0x28000>;
+   };
+   };
+   };
+
syscon: syscon@1c0 {
compatible = "allwinner,sun50i-a64-system-controller",
"syscon";
-- 
2.15.1

Re: [tip:perf/core 1/2] drivers//perf/qcom_l2_pmu.c:598:13: error: invalid storage class for function 'l2_cache_event_start'

2018-03-16 Thread Peter Zijlstra

On Sat, Mar 17, 2018 at 01:19:54AM +0800, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git perf/core
> head:   bbb68468641547d56c83012670bcaf77f3dacd64
> commit: 7eb709f29593aced51901cb53565477762800722 [1/2] perf: Fix sibling 
> iteration

ARGH.. the below folded it makes it build again. But lets see if the
0day finds more fail.


diff --git a/drivers/perf/qcom_l2_pmu.c b/drivers/perf/qcom_l2_pmu.c
index 1a7889f63c9a..842135cf35a3 100644
--- a/drivers/perf/qcom_l2_pmu.c
+++ b/drivers/perf/qcom_l2_pmu.c
@@ -541,6 +541,7 @@ static int l2_cache_event_init(struct perf_event *event)
 "Can't create mixed PMU group\n");
return -EINVAL;
}
+   }
 
cluster = get_cluster_pmu(l2cache_pmu, event->cpu);
if (!cluster) {

Re: arc_usr_cmpxchg and preemption

2018-03-16 Thread Peter Zijlstra

On Fri, Mar 16, 2018 at 10:54:52AM -0700, Vineet Gupta wrote:
> I'd say just bite the bullet, write the patch and we can refine it there !

Just be glad its not futex.c proper ;-) I'll try and have a look later..

Re: [PATCH v4 1/5] PCI: endpoint: BAR width should not depend on sizeof dma_addr_t

2018-03-16 Thread Lorenzo Pieralisi

On Thu, Mar 08, 2018 at 02:33:26PM +0100, Niklas Cassel wrote:
> If a BAR supports 64-bit width or not depends on the hardware,
> and should thus not depend on sizeof(dma_addr_t).
> 
> Since this driver is generic, default to always using BAR width
> of 32-bits. 64-bit BARs can easily be tested by replacing
> PCI_BASE_ADDRESS_MEM_TYPE_32 with PCI_BASE_ADDRESS_MEM_TYPE_64
> in bar_flags.
> 
> Signed-off-by: Niklas Cassel 
> ---
> Note to Lorenzo/Bjorn:
> It is not trivial to convert the bar_size + bar_flags +
> struct pci_epf->bar member array to an array of struct resources,
> since we need to be able to store the addresses returned
> by dma_alloc_coherent(), which is of type dma_addr_t.
> struct resource uses resource_size_t, which is defined as phys_addr_t.
> E.g. ARTPEC-7 uses 64-bit dma_addr_t, but only 32-bit phys_addr_t.
> 
>  drivers/pci/endpoint/functions/pci-epf-test.c | 15 +--
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/endpoint/functions/pci-epf-test.c 
> b/drivers/pci/endpoint/functions/pci-epf-test.c
> index 800da09d9005..7c70433b11a7 100644
> --- a/drivers/pci/endpoint/functions/pci-epf-test.c
> +++ b/drivers/pci/endpoint/functions/pci-epf-test.c
> @@ -71,6 +71,14 @@ struct pci_epf_test_data {
>  };
>  
>  static int bar_size[] = { 512, 512, 1024, 16384, 131072, 1048576 };
> +static int bar_flags[] = {
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32,
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32,
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32,
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32,
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32,
> + PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32
> +};

Niklas,

I think you are almost there, I have one question though to address
that can even simplify the patchset.

If, according to your own commit logs (and my reading of the code), the
Cadence driver makes a decision on the BAR size just by checking the
corresponding region size (I would be happy to hear the reason
underpinning that choice, BTW), why can't we do the same for DWC (ie to
let the DWC driver decides whether a BAR should be 64 or 32 bits ?)

This would mean that in this patch we would not bother about the BAR
32/64 size flag at all.

Thoughts ?

Lorenzo

>  
>  static int pci_epf_test_copy(struct pci_epf_test *epf_test)
>  {
> @@ -358,7 +366,6 @@ static void pci_epf_test_unbind(struct pci_epf *epf)
>  
>  static int pci_epf_test_set_bar(struct pci_epf *epf)
>  {
> - int flags;
>   int bar;
>   int ret;
>   struct pci_epf_bar *epf_bar;
> @@ -367,15 +374,11 @@ static int pci_epf_test_set_bar(struct pci_epf *epf)
>   struct pci_epf_test *epf_test = epf_get_drvdata(epf);
>   enum pci_barno test_reg_bar = epf_test->test_reg_bar;
>  
> - flags = PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_32;
> - if (sizeof(dma_addr_t) == 0x8)
> - flags |= PCI_BASE_ADDRESS_MEM_TYPE_64;
> -
>   for (bar = BAR_0; bar <= BAR_5; bar++) {
>   epf_bar = &epf->bar[bar];
>   ret = pci_epc_set_bar(epc, epf->func_no, bar,
> epf_bar->phys_addr,
> -   epf_bar->size, flags);
> +   epf_bar->size, bar_flags[bar]);
>   if (ret) {
>   pci_epf_free_space(epf, epf_test->reg[bar], bar);
>   dev_err(dev, "failed to set BAR%d\n", bar);
> -- 
> 2.14.2
>

Re: Bug: Microblaze stopped booting after 0fa1c579349fdd90173381712ad78aa99c09d38b

2018-03-16 Thread Michal Simek

On 16.3.2018 16:18, Rob Herring wrote:
> On Wed, Mar 14, 2018 at 10:04 AM, Michal Simek  wrote:
>> On 12.3.2018 11:21, Michal Simek wrote:
>>> On 12.3.2018 08:52, Alvaro G. M. wrote:
 On Fri, Mar 09, 2018 at 01:05:11PM -0600, Rob Herring wrote:
> On Fri, Mar 9, 2018 at 6:51 AM, Alvaro G. M.  
> wrote:
>> Hi,
>>
>> I've found via git bisect that 0fa1c579349fdd90173381712ad78aa99c09d38b
>> makes microblaze unbootable.
>>
>> I'm sorry I can't provide any console output, as nothing appears at all,
>> even when setting earlyprintk (or at least I wasn't able to get anything
>> back!).
>
> Ah, looks like microblaze doesn't set CONFIG_NO_BOOTMEM and so
> memblock_virt_alloc() doesn't work for CONFIG_HAVE_MEMBLOCK &&
> !CONFIG_NO_BOOTMEM. AFAICT, microblaze doesn't really need bootmem and
> it can be removed, but I'm still investigating. Can you try out this
> branch[1].
>
> Rob
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux.git
> microblaze-fixes

 Hi, Rob!

 This branch does indeed solve the issue. My microblaze system is now
 booting as it did before, and everything seems normal now. Thanks!

 Tested-by: Alvaro Gamez Machado 

>>>
>>> I have tested it and I can also confirm that your two patches are fixing
>>> issue with
>>>
>>> 809d0e2c: 00240034 2600 746f6f62 206d656d4.$&bootmem
>>> 809d0e3c: 6f6c6c61 666f2063 33353220 62203830alloc of 25308 b
>>> 809d0e4c: 73657479 69616620 2164656c ytes failed!
>>> 809d0e5c:  0029003c 0600 6e72654b<.).Kern
>>> 809d0e6c: 70206c65 63696e61 6e202d20 7320746fel panic - not s
>>>
>>> Can you please update that second commit with reasonable description and
>>> send it out? I will take it via my tree and will send pull request to Linus.
>>>
>>
>> I couldn't wait to fix current issue till 4.16 is done that's why I have
>> sent that patches with updated commit message to lkml.
> 
> Thanks for writing my commit msg. :) I got distracted looking at
> whether other arches got broken too and didn't get this sent out.
> 
> BTW, there is a more simple fix of just moving setup_memory() call to
> before unflattening if you prefer for 4.16.

I have sent pull request to Linus with your two patches and a lot of
architectures is setting this up that's why not a problem. I need to
also look at the rest of your patches and how in-kernel dtb is handled
because there is a size limit which was fine for years but we have
reached the case that it is not enough. Simple extension is easy but not
generic solution.

Thanks,
Michal

-- 
Michal Simek, Ing. (M.Eng), OpenPGP -> KeyID: FE3D1F91
w: www.monstr.eu p: +42-0-721842854
Maintainer of Linux kernel - Xilinx Microblaze
Maintainer of Linux kernel - Xilinx Zynq ARM and ZynqMP ARM64 SoCs
U-Boot custodian - Xilinx Microblaze/Zynq/ZynqMP SoCs




signature.asc
Description: OpenPGP digital signature

Re: [ANNOUNCE] Git v2.17.0-rc0

2018-03-16 Thread Junio C Hamano

Ævar Arnfjörð Bjarmason  writes:

> On Fri, Mar 16 2018, Junio C. Hamano jotted:
>
>>   gitweb: hard-depend on the Digest::MD5 5.8 module
>
> I've just noticed this now, but while this module is in 5.8 RedHat's
> butchered perl doesn't have it in the base system, thus this introduces
> the do-we-even-care regression that git's full test suite won't pass on
> a RedHat (or CentOS) base system, because the gitweb tests will fail to
> "use" Digest::MD5.
>
> I'm slightly leaning towards not caring about it, since there's no other
> perl distributor that does this sort of split-out of the core, and if
> you're on a RedHat system they're solving your package problems, so this
> really only impacts the edge case of git developers and redhat
> packagers, both of whom can just do "yum install -y perl-Digest-MD5" to
> fix it.

Thanks for noting.  I agree that this is not something that requires
more than a mention near the beginning of release notes.

I haven't wordsmithed it fully, but it should say something along
the lines of ...

 Documentation/RelNotes/2.16.0.txt | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/Documentation/RelNotes/2.16.0.txt 
b/Documentation/RelNotes/2.16.0.txt
index 8f0461eefd..8b4c24200b 100644
--- a/Documentation/RelNotes/2.16.0.txt
+++ b/Documentation/RelNotes/2.16.0.txt
@@ -6,6 +6,16 @@ Backward compatibility notes and other notable changes.
  * Use of an empty string as a pathspec element that is used for
'everything matches' is now an error.
 
+ * Part of Git that depends on Perl have required at least Perl 5.8
+   since Git v1.7.4 released in 2010, but we used to assume some core
+   modules from Perl distribution may not exist on the system and did
+   a conditional "eval { require <> }"; we no longer do this.
+   On a platform that ships a stripped-down Perl by default, the user
+   may have to install modules the platform chooses not to ship as
+   part of its core (e.g. Digest::MD5, File::Temp, File::Spec,
+   Net::SMTP, NET::Domain).  RedHat/CentOS excludes Digest::MD5 from
+   its base installation, for example.
+
 
 Updates since v2.15
 ---

Re: [PATCH 4.4 00/63] 4.4.122-stable review

2018-03-16 Thread Nathan Chancellor

On Fri, Mar 16, 2018 at 04:22:32PM +0100, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.122 release.
> There are 63 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Sun Mar 18 15:22:41 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.122-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.4.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
>

Merged, compiled, and flashed onto my OnePlus 5 (unfortunately, my Pixel
2 XL is going in for an RMA).

No initial issues noticed in general usage or dmesg.

Thanks!
Nathan

Re: [PATCH v9 08/61] page cache: Use xa_lock

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:46AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> Remove the address_space ->tree_lock and use the xa_lock newly added to
> the radix_tree_root.  Rename the address_space ->page_tree to ->i_pages,
> since we don't really care that it's a tree.
> 
> Signed-off-by: Matthew Wilcox 
> Acked-by: Jeff Layton 

Man my eyes started to glaze over about halfway through this one

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [ANNOUNCE] Git v2.17.0-rc0

2018-03-16 Thread Junio C Hamano

Junio C Hamano  writes:

> I haven't wordsmithed it fully, but it should say something along
> the lines of ...
>
>  Documentation/RelNotes/2.16.0.txt | 10 ++
>  1 file changed, 10 insertions(+)

Eh, of course the addition should go to 2.17 release notes ;-)  I
just happened to be reviewing a topic forked earlier.

Re: arc_usr_cmpxchg and preemption

2018-03-16 Thread Max Filippov

> On Thu, Mar 15, 2018 at 12:03 PM, Alexey Brodkin 
>  wrote:
> Here's a brief analysis:
> ARM:  Looks like they got rid of that stuff in v4.4, see
>   commit db695c0509d6 ("ARM: remove user cmpxchg syscall").
>
> M68K: That's even uglier implementation which is really asking for
>   a facelift, look at sys_atomic_cmpxchg_32() here:
>   
> https://elixir.bootlin.com/linux/latest/source/arch/m68k/kernel/sys_m68k.c#L461
>
> MIPS: They do it via special sysmips syscall which among other things
>   might handle MIPS_ATOMIC_SET with mips_atomic_set()
>
> I don't immediately see if there're others but really I'm not sure if it even 
> worth trying to
> clean-up all that since efforts might be spent pointlessly.

xtensa is another one. We used to have a buggy implementation in
arch/xtensa/kernel/entry.S:fast_syscall_xtensa which we still keep
disabled by default, just in case somebody wanted backwards
compatibility. I don't think it's worth fixing.

-- 
Thanks.
-- Max

[PATCH v5 3/9] sysctl: Warn when a clamped sysctl parameter is set out of range

2018-03-16 Thread Waiman Long

Even with clamped sysctl parameters, it is still not that straight
forward to figure out the exact range of those parameters. One may
try to write extreme parameter values to see if they get clamped.
To make it easier, a warning with the expected range will now be
printed into the kernel ring buffer when a clamped sysctl parameter
receives an out of range value.

The pr_warn_ratelimited() macro is used to limit the number of warning
messages that can be printed within a given period of time.

Signed-off-by: Waiman Long 
---
 kernel/sysctl.c | 44 
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index af351ed..a9e3ed4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -17,6 +17,7 @@
  * The list_for_each() macro wasn't appropriate for the sysctl loop.
  *  Removed it and replaced it with older style, 03/23/00, Bill Wendling
  */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include 
 #include 
@@ -2505,6 +2506,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2514,6 +2516,7 @@ struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2521,24 +2524,35 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
int write, void *data)
 {
struct do_proc_dointvec_minmax_conv_param *param = data;
+
if (write) {
int val = *negp ? -*lvalp : *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
if (param->min && *param->min > val) {
-   if (clamp)
+   if (clamp) {
val = *param->min;
-   else
+   clamped = true;
+   } else {
return -EINVAL;
+   }
}
if (param->max && *param->max < val) {
-   if (clamp)
+   if (clamp) {
val = *param->max;
-   else
+   clamped = true;
+   } else {
return -EINVAL;
+   }
}
*valp = val;
+   if (clamped && param->name)
+   pr_warn_ratelimited("\"%s\" was set out of range [%d, 
%d], clamped to %d.\n",
+   param->name,
+   param->min ? *param->min : -INT_MAX,
+   param->max ? *param->max :  INT_MAX, val);
} else {
int val = *valp;
if (val < 0) {
@@ -2576,6 +2590,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
.min = (int *) table->extra1,
.max = (int *) table->extra2,
.flags = &table->flags,
+   .name  = table->procname,
};
return do_proc_dointvec(table, write, buffer, lenp, ppos,
do_proc_dointvec_minmax_conv, ¶m);
@@ -2586,6 +2601,7 @@ int proc_dointvec_minmax(struct ctl_table *table, int 
write,
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
  * @flags: pointer to flags
+ * @name: sysctl parameter name
  *
  * The do_proc_douintvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2595,6 +2611,7 @@ struct do_proc_douintvec_minmax_conv_param {
unsigned int *min;
unsigned int *max;
uint16_t *flags;
+   const char *name;
 };
 
 static int do_proc_douintvec_minmax_conv(unsigned long *lvalp,
@@ -2605,6 +2622,7 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
 
if (write) {
unsigned int val = *lvalp;
+   bool clamped = false;
bool clamp = param->flags &&
   (*param->flags & CTL_FLAGS_CLAMP_RANGE);
 
@@ -2612,18 +2630,27 @@ static int do_proc_douintvec_minmax_conv(unsigned long 
*lvalp,
return -EINVAL;
 
if (param->min && *param->min > val) {
-   if (clamp)
+   if (clamp) {
val = *param->min;
-   else
+   cla

[PATCH v5 2/9] proc/sysctl: Provide additional ctl_table.flags checks

2018-03-16 Thread Waiman Long

Checking code is added to provide the following additional
ctl_table.flags checks:

 1) No unknown flag is allowed.
 2) Minimum of a range cannot be larger than the maximum value.
 3) The signed and unsigned flags are mutually exclusive.
 4) The proc_handler should be consistent with the signed or unsigned
flags.

Two new flags are added to indicate if the min/max values are signed
or unsigned - CTL_FLAGS_SIGNED_RANGE & CTL_FLAGS_UNSIGNED_RANGE.
These 2 flags can be optionally enabled for range checking purpose.
But either one of them must be set with CTL_FLAGS_CLAMP_RANGE.

Signed-off-by: Waiman Long 
---
 fs/proc/proc_sysctl.c  | 62 ++
 include/linux/sysctl.h | 16 +++--
 2 files changed, 76 insertions(+), 2 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 493c975..2863ea1 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -1092,6 +1092,66 @@ static int sysctl_check_table_array(const char *path, 
struct ctl_table *table)
return err;
 }
 
+static int sysctl_check_flags(const char *path, struct ctl_table *table)
+{
+   int err = 0;
+   uint16_t sign_flags = CTL_FLAGS_SIGNED_RANGE|CTL_FLAGS_UNSIGNED_RANGE;
+
+   if ((table->flags & ~CTL_TABLE_FLAGS_ALL) ||
+  ((table->flags & sign_flags) == sign_flags))
+   err = sysctl_err(path, table, "invalid flags");
+
+   if (table->flags & (CTL_FLAGS_CLAMP_RANGE | sign_flags)) {
+   int range_err = 0;
+   bool is_int = (table->maxlen == sizeof(int));
+
+   if (!is_int && (table->maxlen != sizeof(long))) {
+   range_err++;
+   } else if (!table->extra1 || !table->extra2) {
+   /* No min > max checking needed */
+   } else if (table->flags & CTL_FLAGS_UNSIGNED_RANGE) {
+   unsigned long min, max;
+
+   min = is_int ? *(unsigned int *)table->extra1
+: *(unsigned long *)table->extra1;
+   max = is_int ? *(unsigned int *)table->extra2
+: *(unsigned long *)table->extra2;
+   range_err += (min > max);
+   } else if (table->flags & CTL_FLAGS_SIGNED_RANGE) {
+
+   long min, max;
+
+   min = is_int ? *(int *)table->extra1
+: *(long *)table->extra1;
+   max = is_int ? *(int *)table->extra2
+: *(long *)table->extra2;
+   range_err += (min > max);
+   } else {
+   /*
+* Either CTL_FLAGS_UNSIGNED_RANGE or
+* CTL_FLAGS_SIGNED_RANGE should be set.
+*/
+   range_err++;
+   }
+
+   /*
+* proc_handler and flag consistency check.
+*/
+   if (((table->proc_handler == proc_douintvec_minmax)   ||
+(table->proc_handler == proc_doulongvec_minmax)) &&
+   !(table->flags & CTL_FLAGS_UNSIGNED_RANGE))
+   range_err++;
+
+   if ((table->proc_handler == proc_dointvec_minmax) &&
+  !(table->flags & CTL_FLAGS_SIGNED_RANGE))
+   range_err++;
+
+   if (range_err)
+   err |= sysctl_err(path, table, "Invalid range");
+   }
+   return err;
+}
+
 static int sysctl_check_table(const char *path, struct ctl_table *table)
 {
int err = 0;
@@ -,6 +1171,8 @@ static int sysctl_check_table(const char *path, struct 
ctl_table *table)
(table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) 
{
if (!table->data)
err |= sysctl_err(path, table, "No data");
+   if (table->flags)
+   err |= sysctl_check_flags(path, table);
if (!table->maxlen)
err |= sysctl_err(path, table, "No maxlen");
else
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index e446e1f..088f032 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -134,14 +134,26 @@ struct ctl_table
  * the input value. No lower bound or upper bound checking will be
  * done if the corresponding minimum or maximum value isn't provided.
  *
+ * @CTL_FLAGS_SIGNED_RANGE: Set to indicate that the extra1 and extra2
+ * fields are pointers to minimum and maximum signed values of
+ * an allowable range.
+ *
+ * @CTL_FLAGS_UNSIGNED_RANGE: Set to indicate that the extra1 and extra2
+ * fields are pointers to minimum and maximum unsigned values of
+ * an allowable range.
+ *
  * At most 16 different flags are allowed.
  */
 enum ctl_t

[PATCH v5 1/9] sysctl: Add flags to support min/max range clamping

2018-03-16 Thread Waiman Long

When minimum/maximum values are specified for a sysctl parameter in
the ctl_table structure with proc_dointvec_minmax() handler, update
to that parameter will fail with error if the given value is outside
of the required range.

There are use cases where it may be better to clamp the value of
the sysctl parameter to the given range without failing the update,
especially if the users are not aware of the actual range limits.
Reading the value back after the update will now be a good practice
to see if the provided value exceeds the range limits.

To provide this less restrictive form of range checking, a new flags
field is added to the ctl_table structure. The new field is a 16-bit
value that just fits into the hole left by the 16-bit umode_t field
without increasing the size of the structure.

When the CTL_FLAGS_CLAMP_RANGE flag is set in the ctl_table
entry, any update from the userspace will be clamped to the given
range without error if either the proc_dointvec_minmax() or the
proc_douintvec_minmax() handlers is used.

The clamped value is either the maximum or minimum value that is
closest to the input value provided by the user.

Signed-off-by: Waiman Long 
---
 include/linux/sysctl.h | 20 
 kernel/sysctl.c| 48 +++-
 2 files changed, 59 insertions(+), 9 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index b769ecf..e446e1f 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
void *data;
int maxlen;
umode_t mode;
+   uint16_t flags;
struct ctl_table *child;/* Deprecated */
proc_handler *proc_handler; /* Callback for text formatting */
struct ctl_table_poll *poll;
@@ -123,6 +124,25 @@ struct ctl_table
void *extra2;
 } __randomize_layout;
 
+/**
+ * enum ctl_table_flags - flags for the ctl table (struct ctl_table.flags)
+ *
+ * @CTL_FLAGS_CLAMP_RANGE: Set to indicate that the entry should be
+ * flexibly clamped to the provided min/max value in case the user
+ * provided a value outside of the given range. The clamped value is
+ * either the provided minimum or maximum value that is closest to
+ * the input value. No lower bound or upper bound checking will be
+ * done if the corresponding minimum or maximum value isn't provided.
+ *
+ * At most 16 different flags are allowed.
+ */
+enum ctl_table_flags {
+   CTL_FLAGS_CLAMP_RANGE   = BIT(0),
+   __CTL_FLAGS_MAX = BIT(1),
+};
+
+#define CTL_TABLE_FLAGS_ALL(__CTL_FLAGS_MAX - 1)
+
 struct ctl_node {
struct rb_node node;
struct ctl_table_header *header;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d2aa6b4..af351ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2504,6 +2504,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
  * struct do_proc_dointvec_minmax_conv_param - proc_dointvec_minmax() range 
checking structure
  * @min: pointer to minimum allowable value
  * @max: pointer to maximum allowable value
+ * @flags: pointer to flags
  *
  * The do_proc_dointvec_minmax_conv_param structure provides the
  * minimum and maximum values for doing range checking for those sysctl
@@ -2512,6 +2513,7 @@ static int proc_dointvec_minmax_sysadmin(struct ctl_table 
*table, int write,
 struct do_proc_dointvec_minmax_conv_param {
int *min;
int *max;
+   uint16_t *flags;
 };
 
 static int do_proc_dointvec_minmax_conv(bool *negp, unsigned long *lvalp,
@@ -2521,9 +2523,21 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
struct do_proc_dointvec_minmax_conv_param *param = data;
if (write) {
int val = *negp ? -*lvalp : *lvalp;
-   if ((param->min && *param->min > val) ||
-   (param->max && *param->max < val))
-   return -EINVAL;
+   bool clamp = param->flags &&
+  (*param->flags & CTL_FLAGS_CLAMP_RANGE);
+
+   if (param->min && *param->min > val) {
+   if (clamp)
+   val = *param->min;
+   else
+   return -EINVAL;
+   }
+   if (param->max && *param->max < val) {
+   if (clamp)
+   val = *param->max;
+   else
+   return -EINVAL;
+   }
*valp = val;
} else {
int val = *valp;
@@ -2552,7 +2566,8 @@ static int do_proc_dointvec_minmax_conv(bool *negp, 
unsigned long *lvalp,
  * This routine will ensure the values are within the range specified by
  * table->extra1 (min) and table->extra2 (max).
  *
- * Returns 0 on success or -EINVAL on write when the range check fails.
+ * Returns 0 on success or -EINVAL on writ

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Al Viro

On Fri, Mar 16, 2018 at 05:55:02PM +, Al Viro wrote:
> On Fri, Mar 16, 2018 at 10:29:16AM -0700, Linus Torvalds wrote:
> >t.c: In function ‘test’:
> >t.c:6:6: error: argument to variable-length array is too large
> > [-Werror=vla-larger-than=]
> >  int array[(1,100)];
> > 
> > Gcc people are crazy.
> 
> That's not them, that's C standard regarding ICE.  1,100 is *not* a
> constant expression as far as the standard is concerned, and that
> type is actually a VLA with the size that can be optimized into
> a compiler-calculated value.
> 
> Would you argue that in

s/argue/agree/, sorry

> void foo(char c)
> {
>   int a[(c<<1) + 10 - c + 2 - c];
> 
> a is not a VLA?

FWIW, 6.6 starts with
 constant-expression:
conditional-expression
for syntax, with 6.6p3 being "Constant expression shall not contain
assignment, increment, decrement, function call or comma operators,
except when they are contained in a subexpression that is not evaluated",
with "The operand of sizeof operator is usually not evaluated (6.5.3.4)"
as a footnote.

6.6p10 allows implementation to accept other forms of constant expressions,
but arguing that such-and-such construct surely must be recognized as one,
when there are perfectly portable ways to achieve the same...

Realistically, code like that can come only from macros, and one can wrap
the damn thing into 0 * sizeof(..., 0) + just fine there.  Which will
satisfy the conditions for sizeof argument not being evaluated...

Re: [RESEND PATCH v2] sched/fair: Remove check in idle_balance against migration_cost

2018-03-16 Thread Rohit Jain


On 03/16/2018 10:42 AM, Peter Zijlstra wrote:



You need to look at:

   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/

my queue.git is the sporadic push of my quilt tree on top of that.


Thanks found it! I will re-test and send v3 as needed.

[PATCH v5 9/9] ipc: Conserve sequence numbers in extended IPCMNI mode

2018-03-16 Thread Waiman Long

The mixing in of a sequence number into the IPC IDs is probably to
avoid ID reuse in userspace as much as possible. With extended IPCMNI
mode, the number of usable sequecne numbers is greatly reduced leading
to higher chance of ID reuse.

To address this issue, we need to conserve the sequence number space
as much as possible. Right now, the sequence number is incremented
for every new ID created. In reality, we only need to increment the
sequence number when one or more IDs have been removed previously to
make sure that those IDs will not be reused when a new one is built.
This is being done in the extended IPCMNI mode,

Signed-off-by: Waiman Long 
---
 include/linux/ipc_namespace.h |  1 +
 ipc/ipc_sysctl.c  |  2 ++
 ipc/util.c| 29 ++---
 ipc/util.h|  1 +
 4 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
index b5630c8..9c86fd9 100644
--- a/include/linux/ipc_namespace.h
+++ b/include/linux/ipc_namespace.h
@@ -16,6 +16,7 @@
 struct ipc_ids {
int in_use;
unsigned short seq;
+   unsigned short deleted;
bool tables_initialized;
struct rw_semaphore rwsem;
struct idr ipcs_idr;
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 5f7cfae..61a832d 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -111,6 +111,7 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int int_max = INT_MAX;
 int ipc_mni __read_mostly = IPCMNI;
 int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
+bool ipc_mni_extended __read_mostly;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -243,6 +244,7 @@ static int __init ipc_mni_extend(char *str)
 {
ipc_mni = IPCMNI_EXTEND;
ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   ipc_mni_extended = true;
pr_info("IPCMNI extended to %d.\n", ipc_mni);
return 0;
 }
diff --git a/ipc/util.c b/ipc/util.c
index daee305..8b38a6f 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -118,7 +118,8 @@ int ipc_init_ids(struct ipc_ids *ids)
 {
int err;
ids->in_use = 0;
-   ids->seq = 0;
+   ids->deleted = false;
+   ids->seq = ipc_mni_extended ? 0 : -1; /* seq # is pre-incremented */
init_rwsem(&ids->rwsem);
err = rhashtable_init(&ids->key_ht, &ipc_kht_params);
if (err)
@@ -192,6 +193,11 @@ static struct kern_ipc_perm *ipc_findkey(struct ipc_ids 
*ids, key_t key)
return NULL;
 }
 
+/*
+ * To conserve sequence number space with extended ipc_mni when new ID
+ * is built, the sequence number is incremented only when one or more
+ * IDs have been removed previously.
+ */
 #ifdef CONFIG_CHECKPOINT_RESTORE
 /*
  * Specify desired id for next allocated IPC object.
@@ -205,9 +211,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
if (ids->next_id < 0) { /* default, behave as !CHECKPOINT_RESTORE */
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
} else {
new->seq = ipcid_to_seqx(ids->next_id);
ids->next_id = -1;
@@ -223,9 +233,13 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
 static inline int ipc_buildid(int id, struct ipc_ids *ids,
  struct kern_ipc_perm *new)
 {
-   new->seq = ids->seq++;
-   if (ids->seq > IPCID_SEQ_MAX)
-   ids->seq = 0;
+   if (!ipc_mni_extended || ids->deleted) {
+   ids->seq++;
+   if (ids->seq > IPCID_SEQ_MAX)
+   ids->seq = 0;
+   ids->deleted = false;
+   }
+   new->seq = ids->seq;
 
return (new->seq << SEQ_SHIFT) + id;
 }
@@ -435,6 +449,7 @@ void ipc_rmid(struct ipc_ids *ids, struct kern_ipc_perm 
*ipcp)
idr_remove(&ids->ipcs_idr, lid);
ipc_kht_remove(ids, ipcp);
ids->in_use--;
+   ids->deleted = true;
ipcp->deleted = true;
 
if (unlikely(lid == ids->max_id)) {
diff --git a/ipc/util.h b/ipc/util.h
index 6871ca9..e6c2055 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -17,6 +17,7 @@
 
 extern int ipc_mni;
 extern int ipc_mni_shift;
+extern bool ipc_mni_extended;
 
 #define SEQ_SHIFT  ipc_mni_shift
 #define SEQ_MASK   ((1 << ipc_mni_shift) - 1)
-- 
1.8.3.1

[PATCH v5 6/9] test_sysctl: Add range clamping test

2018-03-16 Thread Waiman Long

Add a range clamping test to verify that the input value will be
clamped if it exceeds the builtin maximum or minimum value.

Below is the expected test run result:

Running test: sysctl_test_0006 - run #0
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok
Checking range minimum clamping ... ok
Checking range maximum clamping ... ok

Signed-off-by: Waiman Long 
---
 lib/test_sysctl.c| 29 ++
 tools/testing/selftests/sysctl/sysctl.sh | 52 
 2 files changed, 81 insertions(+)

diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 3dd801c..7bb4cf7 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -38,12 +38,18 @@
 
 static int i_zero;
 static int i_one_hundred = 100;
+static int signed_min = -10;
+static int signed_max = 10;
+static unsigned int unsigned_min = 10;
+static unsigned int unsigned_max = 30;
 
 struct test_sysctl_data {
int int_0001;
int int_0002;
int int_0003[4];
+   int range_0001;
 
+   unsigned int urange_0001;
unsigned int uint_0001;
 
char string_0001[65];
@@ -58,6 +64,9 @@ struct test_sysctl_data {
.int_0003[2] = 2,
.int_0003[3] = 3,
 
+   .range_0001 = 0,
+   .urange_0001 = 20,
+
.uint_0001 = 314,
 
.string_0001 = "(none)",
@@ -102,6 +111,26 @@ struct test_sysctl_data {
.mode   = 0644,
.proc_handler   = proc_dostring,
},
+   {
+   .procname   = "range_0001",
+   .data   = &test_data.range_0001,
+   .maxlen = sizeof(test_data.range_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
+   .extra1 = &signed_min,
+   .extra2 = &signed_max,
+   },
+   {
+   .procname   = "urange_0001",
+   .data   = &test_data.urange_0001,
+   .maxlen = sizeof(test_data.urange_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_douintvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_UNSIGNED,
+   .extra1 = &unsigned_min,
+   .extra2 = &unsigned_max,
+   },
{ }
 };
 
diff --git a/tools/testing/selftests/sysctl/sysctl.sh 
b/tools/testing/selftests/sysctl/sysctl.sh
index ec232c3..1aa1bba 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -34,6 +34,7 @@ ALL_TESTS="$ALL_TESTS 0002:1:1"
 ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
+ALL_TESTS="$ALL_TESTS 0006:1:1"
 
 test_modprobe()
 {
@@ -543,6 +544,38 @@ run_stringtests()
test_rc
 }
 
+# TARGET, RANGE_MIN & RANGE_MAX need to be defined before running test.
+run_range_clamping_test()
+{
+   rc=0
+
+   echo -n "Checking range minimum clamping ... "
+   VAL=$((RANGE_MIN - 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MIN ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   echo -n "Checking range maximum clamping ... "
+   VAL=$((RANGE_MAX + 1))
+   echo -n $VAL > "${TARGET}" 2> /dev/null
+   EXITVAL=$?
+   NEWVAL=$(cat "${TARGET}")
+   if [[ $EXITVAL -ne 0 || $NEWVAL -ne $RANGE_MAX ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   test_rc
+}
+
 sysctl_test_0001()
 {
TARGET="${SYSCTL}/int_0001"
@@ -600,6 +633,25 @@ sysctl_test_0005()
run_limit_digit_int_array
 }
 
+sysctl_test_0006()
+{
+   TARGET="${SYSCTL}/range_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=-10
+   RANGE_MAX=10
+
+   run_range_clamping_test
+   set_orig
+
+   TARGET="${SYSCTL}/urange_0001"
+   ORIG=$(cat "${TARGET}")
+   RANGE_MIN=10
+   RANGE_MAX=30
+
+   run_range_clamping_test
+   set_orig
+}
+
 list_tests()
 {
echo "Test ID list:"
-- 
1.8.3.1

[PATCH v5 7/9] test_sysctl: Add ctl_table registration failure test

2018-03-16 Thread Waiman Long

Incorrect sysctl tables are constructed and fed to the
register_sysctl_table() function in the test_sysctl kernel module.
The function is supposed to fail the registration of those tables or
an error will be printed if no failure is returned.

The registration failures will cause other warning and error messages
to be printed into the dmesg log, though.

A new test is also added to the sysctl.sh to look for those failure
messages in the dmesg log to see if anything unexpeced happens.

Signed-off-by: Waiman Long 
---
 lib/test_sysctl.c| 41 
 tools/testing/selftests/sysctl/sysctl.sh | 15 
 2 files changed, 56 insertions(+)

diff --git a/lib/test_sysctl.c b/lib/test_sysctl.c
index 7bb4cf7..14853d5 100644
--- a/lib/test_sysctl.c
+++ b/lib/test_sysctl.c
@@ -154,13 +154,54 @@ struct test_sysctl_data {
{ }
 };
 
+static struct ctl_table fail_sysctl_table0[] = {
+   {
+   .procname   = "failed_sysctl0",
+   .data   = &test_data.range_0001,
+   .maxlen = sizeof(test_data.range_0001),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
+   .extra1 = &signed_max,
+   .extra2 = &signed_min,
+   },
+   { }
+};
+
+static struct ctl_table fail_sysctl_root_table[] = {
+   {
+   .procname   = "debug",
+   .maxlen = 0,
+   .mode   = 0555,
+   },
+   { }
+};
+
+static struct ctl_table *fail_tables[] = {
+   fail_sysctl_table0, NULL,
+};
+
 static struct ctl_table_header *test_sysctl_header;
 
 static int __init test_sysctl_init(void)
 {
+   struct ctl_table_header *fail_sysctl_header;
+   int i;
+
test_sysctl_header = register_sysctl_table(test_sysctl_root_table);
if (!test_sysctl_header)
return -ENOMEM;
+
+   for (i = 0; fail_tables[i]; i++) {
+   fail_sysctl_root_table[0].child = fail_tables[i];
+   fail_sysctl_header = 
register_sysctl_table(fail_sysctl_root_table);
+   if (fail_sysctl_header) {
+   pr_err("fail_tables[%d] registration check failed!\n", 
i);
+   unregister_sysctl_table(fail_sysctl_header);
+   break;
+   }
+   }
+
return 0;
 }
 late_initcall(test_sysctl_init);
diff --git a/tools/testing/selftests/sysctl/sysctl.sh 
b/tools/testing/selftests/sysctl/sysctl.sh
index 1aa1bba..23acdee 100755
--- a/tools/testing/selftests/sysctl/sysctl.sh
+++ b/tools/testing/selftests/sysctl/sysctl.sh
@@ -35,6 +35,7 @@ ALL_TESTS="$ALL_TESTS 0003:1:1"
 ALL_TESTS="$ALL_TESTS 0004:1:1"
 ALL_TESTS="$ALL_TESTS 0005:3:1"
 ALL_TESTS="$ALL_TESTS 0006:1:1"
+ALL_TESTS="$ALL_TESTS 0007:1:1"
 
 test_modprobe()
 {
@@ -652,6 +653,20 @@ sysctl_test_0006()
set_orig
 }
 
+sysctl_test_0007()
+{
+   echo "Checking test_sysctl module registration failure test ..."
+   dmesg | grep "sysctl.*fail_tables.*failed"
+   if [[ $? -eq 0 ]]; then
+   echo "FAIL" >&2
+   rc=1
+   else
+   echo "ok"
+   fi
+
+   test_rc
+}
+
 list_tests()
 {
echo "Test ID list:"
-- 
1.8.3.1

[PATCH v5 8/9] ipc: Allow boot time extension of IPCMNI from 32k to 2M

2018-03-16 Thread Waiman Long

The maximum number of unique System V IPC identifiers was limited to
32k.  That limit should be big enough for most use cases.

However, there are some users out there requesting for more. To satisfy
the need of those users, a new boot time kernel option "ipcmni_extend"
is added to extend the IPCMNI value to 2M. This is a 64X increase which
hopefully is big enough for them.

This new option does have the side effect of reducing the maximum
number of unique sequence numbers from 64k down to 1k. So it is
a trade-off.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 include/linux/ipc.h | 11 ++-
 ipc/ipc_sysctl.c| 12 +++-
 ipc/util.c  | 12 ++--
 ipc/util.h  | 18 +++---
 5 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 1d1d53f..2be35a4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1733,6 +1733,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
 
+   ipcmni_extend   [KNL] Extend the maximum number of unique System V
+   IPC identifiers from 32768 to 2097152.
+
irqaffinity=[SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
 
diff --git a/include/linux/ipc.h b/include/linux/ipc.h
index 821b2f2..3ecd869 100644
--- a/include/linux/ipc.h
+++ b/include/linux/ipc.h
@@ -8,7 +8,16 @@
 #include 
 #include 
 
-#define IPCMNI 32768  /* <= MAX_INT limit for ipc arrays (including sysctl 
changes) */
+/*
+ * By default, the ipc arrays can have up to 32k (15 bits) entries.
+ * When IPCMNI extension mode is turned on, the ipc arrays can have up
+ * to 2M (21 bits) entries. However, the space for sequence number will
+ * be shrunk from 16 bits to 10 bits.
+ */
+#define IPCMNI_SHIFT   15
+#define IPCMNI_EXTEND_SHIFT21
+#define IPCMNI (1 << IPCMNI_SHIFT)
+#define IPCMNI_EXTEND  (1 << IPCMNI_EXTEND_SHIFT)
 
 /* used by in-kernel data structures */
 struct kern_ipc_perm {
diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 0ad7088..5f7cfae 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -109,7 +109,8 @@ static int proc_ipc_sem_dointvec(struct ctl_table *table, 
int write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
-static int ipc_mni = IPCMNI;
+int ipc_mni __read_mostly = IPCMNI;
+int ipc_mni_shift __read_mostly = IPCMNI_SHIFT;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -237,3 +238,12 @@ static int __init ipc_sysctl_init(void)
 }
 
 device_initcall(ipc_sysctl_init);
+
+static int __init ipc_mni_extend(char *str)
+{
+   ipc_mni = IPCMNI_EXTEND;
+   ipc_mni_shift = IPCMNI_EXTEND_SHIFT;
+   pr_info("IPCMNI extended to %d.\n", ipc_mni);
+   return 0;
+}
+early_param("ipcmni_extend", ipc_mni_extend);
diff --git a/ipc/util.c b/ipc/util.c
index 4ed5a17..daee305 100644
--- a/ipc/util.c
+++ b/ipc/util.c
@@ -112,7 +112,7 @@ static int __init ipc_init(void)
  * @ids: ipc identifier set
  *
  * Set up the sequence range to use for the ipc identifier range (limited
- * below IPCMNI) then initialise the keys hashtable and ids idr.
+ * below ipc_mni) then initialise the keys hashtable and ids idr.
  */
 int ipc_init_ids(struct ipc_ids *ids)
 {
@@ -213,7 +213,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
ids->next_id = -1;
}
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #else
@@ -227,7 +227,7 @@ static inline int ipc_buildid(int id, struct ipc_ids *ids,
if (ids->seq > IPCID_SEQ_MAX)
ids->seq = 0;
 
-   return SEQ_MULTIPLIER * new->seq + id;
+   return (new->seq << SEQ_SHIFT) + id;
 }
 
 #endif /* CONFIG_CHECKPOINT_RESTORE */
@@ -251,8 +251,8 @@ int ipc_addid(struct ipc_ids *ids, struct kern_ipc_perm 
*new, int limit)
kgid_t egid;
int id, err;
 
-   if (limit > IPCMNI)
-   limit = IPCMNI;
+   if (limit > ipc_mni)
+   limit = ipc_mni;
 
if (!ids->tables_initialized || ids->in_use >= limit)
return -ENOSPC;
@@ -769,7 +769,7 @@ static struct kern_ipc_perm *sysvipc_find_ipc(struct 
ipc_ids *ids, loff_t pos,
if (total >= ids->in_use)
return NULL;
 
-   for (; pos < IPCMNI; pos++) {
+   for (; pos < ipc_mni; pos++) {
ipc = idr_find(&ids->ipcs_idr, pos);
if (ipc != NULL) {
*new_pos = pos + 1;
diff --git a/ipc/util.h b/ipc/util.h
index af57394..6871ca9 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -15,7 +15,11 @@
 #include

[PATCH v5 0/9] ipc: Clamp *mni to the real IPCMNI limit & increase that limit

2018-03-16 Thread Waiman Long

v4->v5:
 - Revert the flags back to 16-bit so that there will be no change to
   the size of ctl_table.
 - Enhance the sysctl_check_flags() as requested by Luis to perform more
   checks to spot incorrect ctl_table entries.
 - Change the sysctl selftest to use dummy sysctls instead of production
   ones & enhance it to do more checks.
 - Add one more sysctl selftest for registration failure.
 - Add 2 ipc patches to add an extended mode to increase IPCMNI from
   32k to 2M.
 - Miscellaneous change to incorporate feedback comments from
   reviewers.

v3->v4:
 - Remove v3 patches 1 & 2 as they have been merged into the mm tree.
 - Change flags from uint16_t to unsigned int.
 - Remove CTL_FLAGS_OOR_WARNED and use pr_warn_ratelimited() instead.
 - Simplify the warning message code.
 - Add a new patch to fail the ctl_table registration with invalid flag.
 - Add a test case for range clamping in sysctl selftest.

v2->v3:
 - Fix kdoc comment errors.
 - Incorporate comments and suggestions from Luis R. Rodriguez.
 - Add a patch to fix a typo error in fs/proc/proc_sysctl.c.

v1->v2:
 - Add kdoc comments to the do_proc_do{u}intvec_minmax_conv_param
   structures.
 - Add a new flags field to the ctl_table structure for specifying
   whether range clamping should be activated instead of adding new
   sysctl parameter handlers.
 - Clamp the semmni value embedded in the multi-values sem parameter.

v1 patch: https://lkml.org/lkml/2018/2/19/453
v2 patch: https://lkml.org/lkml/2018/2/27/627
v3 patch: https://lkml.org/lkml/2018/3/1/716 
v4 patch: https://lkml.org/lkml/2018/3/12/867

The sysctl parameters msgmni, shmmni and semmni have an inherent limit
of IPC_MNI (32k). However, users may not be aware of that because they
can write a value much higher than that without getting any error or
notification. Reading the parameters back will show the newly written
values which are not real.

Enforcing the limit by failing sysctl parameter write, however, may
cause regressions if existing user setup scripts set those parameters
above 32k as those scripts will now fail in this case.

To address this delemma, a new flags field is introduced into
the ctl_table. The value CTL_FLAGS_CLAMP_RANGE can be added to any
ctl_table entries to enable a looser range clamping without returning
any error. For example,

  .flags = CTL_FLAGS_CLAMP_RANGE,

This flags value are now used for the range checking of shmmni,
msgmni and semmni without breaking existing applications. If any out
of range value is written to those sysctl parameters, the following
warning will be printed instead.

  sysctl: "shmmni" was set out of range [0, 32768], clamped to 32768.

Reading the values back will show 32768 instead of some fake values.

New sysctl selftests are added to exercise new code added by this
patchset.

There are users out there requesting increase in the IPCMNI value.
The last 2 patches attempt to do that by using a boot kernel parameter
"ipcmni_extend" to increase the IPCMNI limit from 32k to 2M.

Eric Biederman had posted an RFC patch to just scrap the IPCMNI limit
and open up the whole positive integer space for IPC IDs. A major
issue that I have with this approach is that SysV IPC had been in use
for over 20 years. We just don't know if there are user applications
that have dependency on the way that the IDs are built. So drastic
change like this may have the potential of breaking some applications.

I prefer a more conservative approach where users will observe no
change in behavior unless they explictly opt in to enable the extended
mode. I could open up the whole positive integer space in this case
like what Eric did, but that will make the code more complex.  So I
just extend IPCMNI to 2M in this case and keep similar ID generation
logic.

Waiman Long (9):
  sysctl: Add flags to support min/max range clamping
  proc/sysctl: Provide additional ctl_table.flags checks
  sysctl: Warn when a clamped sysctl parameter is set out of range
  ipc: Clamp msgmni and shmmni to the real IPCMNI limit
  ipc: Clamp semmni to the real IPCMNI limit
  test_sysctl: Add range clamping test
  test_sysctl: Add ctl_table registration failure test
  ipc: Allow boot time extension of IPCMNI from 32k to 2M
  ipc: Conserve sequence numbers in extended IPCMNI mode

 Documentation/admin-guide/kernel-parameters.txt |  3 +
 fs/proc/proc_sysctl.c   | 62 
 include/linux/ipc.h | 11 +++-
 include/linux/ipc_namespace.h   |  1 +
 include/linux/sysctl.h  | 32 +++
 ipc/ipc_sysctl.c| 33 ++-
 ipc/sem.c   | 25 
 ipc/util.c  | 41 -
 ipc/util.h  | 23 +---
 kernel/sysctl.c | 76 ++---
 lib/test_sysctl.c   | 70

[PATCH v5 5/9] ipc: Clamp semmni to the real IPCMNI limit

2018-03-16 Thread Waiman Long

For SysV semaphores, the semmni value is the last part of the 4-element
sem number array. To make semmni behave in a similar way to msgmni
and shmmni, we can't directly use the _minmax handler. Instead,
a special sem specific handler is added to check the last argument
to make sure that it is clamped to the [0, IPCMNI] range and prints
a warning message once when an out-of-range value is being written.
This does require duplicating some of the code in the _minmax handlers.

Signed-off-by: Waiman Long 
---
 ipc/ipc_sysctl.c | 12 +++-
 ipc/sem.c| 25 +
 ipc/util.h   |  4 
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 088721e..0ad7088 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -88,12 +88,22 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
return proc_dointvec_minmax(&ipc_table, write, buffer, lenp, ppos);
 }
 
+static int proc_ipc_sem_dointvec(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+   int ret = proc_ipc_dointvec(table, write, buffer, lenp, ppos);
+
+   sem_check_semmni(table, current->nsproxy->ipc_ns);
+   return ret;
+}
+
 #else
 #define proc_ipc_doulongvec_minmax NULL
 #define proc_ipc_dointvec NULL
 #define proc_ipc_dointvec_minmax   NULL
 #define proc_ipc_dointvec_minmax_orphans   NULL
 #define proc_ipc_auto_msgmni  NULL
+#define proc_ipc_sem_dointvec NULL
 #endif
 
 static int zero;
@@ -177,7 +187,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = &init_ipc_ns.sem_ctls,
.maxlen = 4*sizeof(int),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_sem_dointvec,
},
 #ifdef CONFIG_CHECKPOINT_RESTORE
{
diff --git a/ipc/sem.c b/ipc/sem.c
index a4af049..faf2caa 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -2337,3 +2337,28 @@ static int sysvipc_sem_proc_show(struct seq_file *s, 
void *it)
return 0;
 }
 #endif
+
+#ifdef CONFIG_PROC_SYSCTL
+/*
+ * Check to see if semmni is out of range and clamp it if necessary.
+ */
+void sem_check_semmni(struct ctl_table *table, struct ipc_namespace *ns)
+{
+   bool clamped = false;
+
+   /*
+* Clamp semmni to the range [0, IPCMNI].
+*/
+   if (ns->sc_semmni < 0) {
+   ns->sc_semmni = 0;
+   clamped = true;
+   }
+   if (ns->sc_semmni > IPCMNI) {
+   ns->sc_semmni = IPCMNI;
+   clamped = true;
+   }
+   if (clamped)
+   pr_warn_ratelimited("sysctl: \"sem[3]\" was set out of range 
[%d, %d], clamped to %d.\n",
+0, IPCMNI, ns->sc_semmni);
+}
+#endif
diff --git a/ipc/util.h b/ipc/util.h
index 89b8ec1..af57394 100644
--- a/ipc/util.h
+++ b/ipc/util.h
@@ -206,6 +206,10 @@ int ipcget(struct ipc_namespace *ns, struct ipc_ids *ids,
 void free_ipcs(struct ipc_namespace *ns, struct ipc_ids *ids,
void (*free)(struct ipc_namespace *, struct kern_ipc_perm *));
 
+#ifdef CONFIG_PROC_SYSCTL
+extern void sem_check_semmni(struct ctl_table *table, struct ipc_namespace 
*ns);
+#endif
+
 #ifdef CONFIG_COMPAT
 #include 
 struct compat_ipc_perm {
-- 
1.8.3.1

[PATCH v5 4/9] ipc: Clamp msgmni and shmmni to the real IPCMNI limit

2018-03-16 Thread Waiman Long

A user can write arbitrary integer values to msgmni and shmmni sysctl
parameters without getting error, but the actual limit is really
IPCMNI (32k). This can mislead users as they think they can get a
value that is not real.

Enforcing the limit by failing the sysctl parameter write, however,
can break existing user applications. Instead, the range clamping flag
is set to enforce the limit without failing existing user code. Users
can easily figure out if the sysctl parameter value is out of range
by either reading back the parameter value or checking the kernel
ring buffer for warning.

Signed-off-by: Waiman Long 
---
 ipc/ipc_sysctl.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/ipc/ipc_sysctl.c b/ipc/ipc_sysctl.c
index 8ad93c2..088721e 100644
--- a/ipc/ipc_sysctl.c
+++ b/ipc/ipc_sysctl.c
@@ -99,6 +99,7 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, int 
write,
 static int zero;
 static int one = 1;
 static int int_max = INT_MAX;
+static int ipc_mni = IPCMNI;
 
 static struct ctl_table ipc_kern_table[] = {
{
@@ -120,7 +121,10 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.data   = &init_ipc_ns.shm_ctlmni,
.maxlen = sizeof(init_ipc_ns.shm_ctlmni),
.mode   = 0644,
-   .proc_handler   = proc_ipc_dointvec,
+   .proc_handler   = proc_ipc_dointvec_minmax,
+   .extra1 = &zero,
+   .extra2 = &ipc_mni,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
},
{
.procname   = "shm_rmid_forced",
@@ -147,7 +151,8 @@ static int proc_ipc_auto_msgmni(struct ctl_table *table, 
int write,
.mode   = 0644,
.proc_handler   = proc_ipc_dointvec_minmax,
.extra1 = &zero,
-   .extra2 = &int_max,
+   .extra2 = &ipc_mni,
+   .flags  = CTL_FLAGS_CLAMP_RANGE_SIGNED,
},
{
.procname   = "auto_msgmni",
-- 
1.8.3.1

Re: [PATCH v3] vsprintf: Prevent crash when dereferencing invalid pointers

2018-03-16 Thread Andy Shevchenko

On Thu, 2018-03-15 at 16:26 +0100, Petr Mladek wrote:
> On Thu 2018-03-15 15:09:03, Andy Shevchenko wrote:
> > On Wed, 2018-03-14 at 15:09 +0100, Petr Mladek wrote:
> > > We already prevent crash when dereferencing some obviously broken
> > > pointers. But the handling is not consistent. Sometimes we print
> > > "(null)"
> > > only for pure NULL pointer, sometimes for pointers in the first
> > > page and 
> > 
> > 
> > > sometimes also for pointers in the last page (error codes).
> > 
> > I still think that printing a hex value of the error code is much
> > better
> > than some odd "(efault)".
> 
> Do you mean (err:0e)? Google gives rather confusing answers for this.

More like "(0x)" (we have already more than 512 error code numbers.


> I am not super excited about (efault). But it seems to be less
> cryptic and the style is more similar to (null).
> 
> Best Regards,
> Petr

-- 
Andy Shevchenko 
Intel Finland Oy

[PATCH] mm: add config for readahead window

2018-03-16 Thread Wei Wang

From: Wei Wang 

Change VM_MAX_READAHEAD value from the default 128KB to a configurable
value. This will allow the readahead window to grow to a maximum size
bigger than 128KB during boot, which could benefit to sequential read
throughput and thus boot performance.

Signed-off-by: Wei Wang 
---
 include/linux/mm.h | 2 +-
 mm/Kconfig | 8 
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..d7dc6125833e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2291,7 +2291,7 @@ int __must_check write_one_page(struct page *page);
 void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
-#define VM_MAX_READAHEAD   128 /* kbytes */
+#define VM_MAX_READAHEAD   CONFIG_VM_MAX_READAHEAD_KB
 #define VM_MIN_READAHEAD   16  /* kbytes (includes current page) */
 
 int force_page_cache_readahead(struct address_space *mapping, struct file 
*filp,
diff --git a/mm/Kconfig b/mm/Kconfig
index c782e8fb7235..da9ff543bdb9 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -760,3 +760,11 @@ config GUP_BENCHMARK
  performance of get_user_pages_fast().
 
  See tools/testing/selftests/vm/gup_benchmark.c
+
+config VM_MAX_READAHEAD_KB
+   int "Default max readahead window size in Kilobytes"
+   default 128
+   help
+ This sets the VM_MAX_READAHEAD value to allow the readahead window
+ to grow to a maximum size of configured. Increasing this value will
+ benefit sequential read throughput.
-- 
2.16.2.804.g6dcf76e118-goog

Re: [PATCH v4 2/2] dt-bindings: introduce Command DB for QCOM SoCs

2018-03-16 Thread Stephen Boyd

Quoting Bjorn Andersson (2018-03-07 11:02:49)
> On Tue 06 Mar 07:57 PST 2018, Lina Iyer wrote:
> 
> > On Mon, Mar 05 2018 at 16:15 -0700, Bjorn Andersson wrote:
> > > On Mon 26 Feb 09:58 PST 2018, Lina Iyer wrote:
> 
> > > As such I think you should just describe only the 0x85fe + 0x2
> > > region here and to support the dynamic aspect of this from a system
> > > point of view you can have the boot loader read the information at
> > > 0xc3f000c and adjust the reserved memory. (Or just keep the step of
> > > manually update the dts without caring about the indirection)
> > > 
> > It would be incorrect and very board specific to just use the 0x85fe000
> > as the address. It is not how the SoC defines the location. Upon request
> > earlier, this memory location was added in DT and the location is
> > typical reference platform usage only.
> > 
> 
> The problem is that as the db resides in a chunk of memory in the middle
> of what Linux considers System RAM the DTS must specify this region as
> reserved. Which means that as you, like described above, update the
> dictionary something (in your scheme a person) has to update the
> reserved-memory region as well.
> 
> That's why I'm proposing that the appropriate implementation for this
> is to have the boot loader to the dictionary part of this and Linux only
> care about the actual reserved-memory region. This way you would still
> implement the dictionary lookup on a system level, but the Linux
> part no longer depend on a human updating the DTS to match the values of
> the dictionary.

Agreed. I thought SMEM had a similar design of a cookie in IMEM to
indicate location and size because coordinating changes across all the
various software images is a hard problem. But coordinating between
linux and the linux bootloader shouldn't be as hard.

> 
> 
> But if we stick with the approach of describing both these and hoping
> that the values in the first region matches the second (or should we add
> a sanity check in probe?). The memory reserve defined as 0xc3f000c + 8
> looks strange, is this system ram as well and what other things resides
> in that same page?
> 

Doesn't look like it could be RAM, the address is not very close to the
other one so I would guess it's something like IMEM. And there are two
32-bit numbers to describe address and size?

RE: [PATCH] storvsc: Set up correct queue depth values for IDE devices

2018-03-16 Thread Long Li

> > Subject: [PATCH] storvsc: Set up correct queue depth values for IDE
> > devices
> >
> > From: Long Li 
> >
> > Unlike SCSI and FC, we don't use multiple channels for IDE. So set
> > queue depth correctly for IDE.
> >
> > Also set the correct cmd_per_lun for all devices.
> >
> > Signed-off-by: Long Li 
> > ---
> >  drivers/scsi/storvsc_drv.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index 8c51d628b52e..fba170640e9c 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -1722,15 +1722,19 @@ static int storvsc_probe(struct hv_device
> *device,
> > max_targets = STORVSC_MAX_TARGETS;
> > max_channels = STORVSC_MAX_CHANNELS;
> > /*
> > -* On Windows8 and above, we support sub-channels for
> storage.
> > +* On Windows8 and above, we support sub-channels for
> storage
> > +* on SCSI and FC controllers.
> >  * The number of sub-channels offerred is based on the
> number of
> >  * VCPUs in the guest.
> >  */
> > -   max_sub_channels = (num_cpus /
> storvsc_vcpus_per_sub_channel);
> > +   if (!dev_is_ide)
> > +   max_sub_channels =
> > +   num_cpus / storvsc_vcpus_per_sub_channel;
> 
> This calculation of the # of sub-channels doesn't get the right answer (and it
> didn't before this patch either).  storvsc_vcpus_per_sub_channel defaults to
> 4.
> If num_cpus is 8, max_sub_channels will be 2, but it should be 1.  The sub-
> channel count should not include the main channel since we add 1 to the
> sub-channel count below when calculating can_queue.

This is a good point. I will fix the code calculating can_queue.

> 
> Furthermore, this is calculation is just a guess, in the sense that we're
> replicating the algorithm we think Hyper-V is using to determine the number
> of sub-channels to offer.   It turns out Hyper-V is changing that algorithm 
> for
> particular devices in an upcoming new Azure VM size.  But the only use of
> max_sub_channels is in the calculation of can_queue below, so the impact is
> minimal.
> 
> > }
> >
> > scsi_driver.can_queue = (max_outstanding_req_per_channel *
> >  (max_sub_channels + 1));
> > +   scsi_driver.cmd_per_lun = scsi_driver.can_queue;
> 
> can_queue is defined as "int", while cmd_per_lun is defined as "short".
> The calculated value of can_queue could easily be over 32,767 with
> 15 sub-channels and max_outstanding_req_per_channel being 3036 for the
> default 1 Mbyte ring buffer.  So the assignment to cmd_per_lun could
> produce truncation and even a negative number.

This is a good catch. I think I should try calling blk_set_queue_depth() and 
pass the correct value. 

> 
> More broadly, since max_outstanding_req_per_channel is based on the ring
> buffer size, these calculations imply that Hyper-V storvsp's queuing capacity
> is based on the ring buffer size.  I don't think that's the case.  From
> conversations with the storvsp folks, I think Hyper-V always removes entries
> from the guest->host ring buffer and then
> lets storvsp queue them separately.   So we don't want to be linking
> cmd_per_lun (or even can_queue, for that matter) to the ring buffer size.
> The current default ring buffer size of 1 Mbyte is probably 10x bigger than
> needed, and we want to be able to adjust that without ending up with
> can_queue and cmd_per_lun values that are too small.

cmd_per_lun needs to reflect the device capacity. What value do you propose? 
It's not a good idea to leave them constant. Setting those values are also 
important because we don't' want to return BUSY on writing to ring buffer on 
full, that will slow down the SCSI stack.

Historically we use ring buffer size to calculate device properties (e.g. 
can_queue for SCSI host).

I agree this doesn't need to be based on the exact queuing capacity of ring 
buffer, maybe we can do 2X of that value (e.g. look at how block uses 
nr_request in MQ). Setting those values smaller is more conservative and I 
don't see an ill effect.

> 
> We would probably do better to set can_queue to a constant, and
> leave cmd_per_lun at its current value of 2048.   The can_queue
> value is already capped at 10240 in the blk-mq layer, so maybe that's a
> reasonable constant to use.

Actually this is not a good idea for smaller ring buffers. You'll see the 
problem when setting both ring buffer sizes to 10 pages.

> 
> Thoughts?
> 
> >
> > host = scsi_host_alloc(&scsi_driver,
> >sizeof(struct hv_host_device));
> > --
> > 2.14.1

Re: [PATCH v5 2/2] dt-bindings: introduce Command DB for QCOM SoCs

2018-03-16 Thread Stephen Boyd

Quoting Lina Iyer (2018-03-14 10:13:30)
> +Properties:
> +- compatible:
> +   Usage: required
> +   Value type: 
> +   Definition: Should be "qcom,cmd-db"
> +
> +- reg:
> +   Usage: required
> +   Value type: 
> +   Definition: The register address that points to the location of the
> +   Command DB in memory. Additionally, specify the address
> +   and size of the actual lacation in memory.

s/lacation/location/ (seems this was missed from last round)

> +
> +Example:
> +
> +   reserved-memory {
> +   [...]
> +   qcom,cmd-db@c3f000c {
> +   reg = <0x0 0xc3f000c 0x0 0x8>,
> + <0x0 0x85fe 0x0 0x2>;

I agree with Bjorn and replied so on v4.

> +   compatible = "qcom,cmd-db";
> +   };
> +   };
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project
>

Re: [PATCH -next 00/22] remove in-kernel syscall invocations (part 2 == netdev)

2018-03-16 Thread David Miller

From: Dominik Brodowski 
Date: Fri, 16 Mar 2018 18:05:52 +0100

> The rationale of this change is described in patch 1 of part 1[*] as follows:
> 
>   The syscall entry points to the kernel defined by SYSCALL_DEFINEx()
>   and COMPAT_SYSCALL_DEFINEx() should only be called from userspace
>   through kernel entry points, but not from the kernel itself. This
>   will allow cleanups and optimizations to the entry paths *and* to
>   the parts of the kernel code which currently need to pretend to be
>   userspace in order to make use of syscalls.
> 
> At present, these patches are based on v4.16-rc5; there is one trivial
> conflict against net-next. Dave, I presume that you prefer to take them
> through net-next? If you want to, I can re-base them against net-next.
> If you prefer otherwise, though, I can route them as part of my whole
> syscall series.

So the transformations themeselves are relatively trivial, so on that
aspect I don't have any problems with these changes.

But overall I have to wonder.

I imagine one of the things you'd like to do is declare that syscall
entries use a different (better) argument passing scheme.  For
example, passing values in registers instead of on the stack.

But in situations where you split out the system call function
completely into one of these "helpers", the compiler is going
to have two choices:

1) Expand the helper into the syscall function inline, thus we end up
   with two copies of the function.

2) Call the helper from the syscall function.  Well, then the compiler
   will need to pop the syscal obtained arguments from the registers
   onto the stack.

So this doesn't seem like such a total win to me.

Maybe you can explain things better to ease my concerns.

About merging, I'm fine with you taking this via your tree.  I do not
see there being any terribly difficult conflicts arising (famous
last words).

Thanks.

Re: [PATCH v4 3/4] PCI: hv: Remove hbus->enum_sem

2018-03-16 Thread Lorenzo Pieralisi

On Fri, Mar 16, 2018 at 05:41:27PM +, Dexuan Cui wrote:
> > From: Lorenzo Pieralisi 
> > Sent: Friday, March 16, 2018 03:54
> > ...
> > Dexuan,
> > while applying/updating these patches I notice this one may be squashed
> > into: https://patchwork.ozlabs.org/patch/886266/
> > 
> > since they logically belong in the same patch. Are you OK with me doing
> > that ? Is my reading correct ?
> > Lorenzo
> 
> I'm OK. 
> I used two patches
> [PATCH v4 1/2] PCI: hv: Serialize the present and eject work items
> [PATCH v4 3/4] PCI: hv: Remove hbus->enum_sem
> only because the first fixed a real issue and hence IMO should go into
> stable kernels, and the second is only a cleanup patch, which doesn't
> need go into stable kernels.
> 
> Either way is ok to me. 
> Please feel free to do whatever you think is better. :-)

OK, patch series reworked and queued in my pci/hv branch please have
a look and let me know if that looks OK for you, I won't ask Bjorn
to move it into -next till you give me the go-ahead.

Thanks,
Lorenzo

Re: [PATCH net-next] net: ethernet: ti: cpsw: enable vlan rx vlan offload

2018-03-16 Thread David Miller

From: Andrew Lunn 
Date: Fri, 16 Mar 2018 01:29:35 +0100

> On Thu, Mar 15, 2018 at 03:15:50PM -0500, Grygorii Strashko wrote:
>> In VLAN_AWARE mode CPSW can insert VLAN header encapsulation word on Host
>> port 0 egress (RX) before the packet data if RX_VLAN_ENCAP bit is set in
>> CPSW_CONTROL register. VLAN header encapsulation word has following format:
>> 
>>  HDR_PKT_Priority bits 29-31 - Header Packet VLAN prio (Highest prio: 7)
>>  HDR_PKT_CFI   bits 28 - Header Packet VLAN CFI bit.
>>  HDR_PKT_Vid   bits 27-16 - Header Packet VLAN ID
>>  PKT_Type bits 8-9 - Packet Type. Indicates whether the packet is
>>  VLAN-tagged, priority-tagged, or non-tagged.
>>  00: VLAN-tagged packet
>>  01: Reserved
>>  10: Priority-tagged packet
>>  11: Non-tagged packet
>> 
>> This feature can be used to implement TX VLAN offload in case of
>> VLAN-tagged packets and to insert VLAN tag in case Non-tagged packet was
>> received on port with PVID set. As per documentation, CPSW never modifies
>> packet data on Host egress (RX) and as result, without this feature
>> enabled, Host port will not be able to receive properly packets which
>> entered switch non-tagged through external Port with PVID set (when
>> non-tagged packet forwarded from external Port with PVID set to another
>> external Port - packet will be VLAN tagged properly).
> 
> So, i think it is time to discuss the future of this driver. It should
> really be replaced by a switchdev/DSA driver. There are plenty of
> carrots for a new driver: Better statistics, working ethtool support
> for all the PHYs, better user experience, etc. But maybe now it is
> time for the stick. Should we Maintainers decide that no new features
> should be added to the existing drivers, just bug fixes?

Andrew, I totally share your concerns.

However, I think the reality is that at best we can strongly urge
people to do such a large amount of work such as writing a new
switchdev/DSA driver for this cpsw hardware.

We can't really compel them.

And a stick could have the opposite of it's intended effect.  If still
nobody wants to do the switchdev/DSA driver, then this existing one
rots and even worse we can end up with an out-of-tree version of this
driver that has the changes (such as this one) that people want.

I'd like to see the switchdev/DSA driver for cpsw as much as you do,
but I am not convinced that rejecting patches like this one will
necessarily make that happen.

Also, it would be a completely different situation if we had someone
working on the switchdev/DSA version already.

So as it stands I really don't think we can block this patch.

Thank you.

Re: [PATCH 8/9] x86/dumpstack: Save first regs set for the executive summary

2018-03-16 Thread Josh Poimboeuf

On Fri, Mar 16, 2018 at 06:45:29PM +0100, Borislav Petkov wrote:
> > You're better off getting rid of the CR2 line from __show_regs(),
> > because it can be dangerously confusing. It's not actually part of the
> > saved register state at all, it's something entirely different. It's
> > like showing the current eflags rather than the eflags saved on the
> > faulting stack.
> 
> Yeah, __show_regs() goes and gets a bunch of registers at the time
> __show_regs() runs. Which is ok for those which don't change in between
> but CR2 is special.
> 
> We probably could improve that situation by having a struct fault_regs
> or so wrapping pt_regs and adding a bunch of fields like CR2 etc. Fault
> handlers would then populate fault_regs at fault time while we're atomic
> and then hand this struct down to the printing path.
> 
> The printing path would fill out the rest and this way we won't have any
> of that monkey business anymore.
> 
> Thoughts?

It would be nice if we could save *all* the printed registers before
they get a chance to change, but I don't know how feasible that is.
Some of the registers change in entry code, like CR3 and GS.

-- 
Josh

Re: [PATCH] platform/x86: panasonic-laptop: add support to mute and hardware optical switch

2018-03-16 Thread Andy Shevchenko

On Fri, Mar 16, 2018 at 1:09 PM, Harald Welte  wrote:
> Hi Kenneth,
>
> thanks for your fix.  I'm not currently using any Panasonic Laptop devices
> anymore (After the CF-R series was discontinued), so I'm not able to properly
> maintain this driver anymore.
>
> However, your changes look sound to me (aside from that we don't use C++ style
> comments in kernel source code), hence:
>
> Acked-by: Harald Welte 

I'm not sure I ever seen the original patch. Patchwork also misses it.


-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v5 1/2] hwmon: (ucd9000) Add gpio chip interface

2018-03-16 Thread Eddie James




On 03/16/2018 08:40 AM, Guenter Roeck wrote:

On 03/15/2018 03:21 PM, Eddie James wrote:

From: Christopher Bostic 

Add a struct gpio_chip and define some methods so that this device's
I/O can be accessed via /sys/class/gpio.



Sorry for not noticing earlier. The 0day reports should be addressed 
by selecting GPIOLIB

in the Kconfig entry.


Getting kbuild recursive dependencies when I select GPIOLIB for ucd9000 :(

May have to do "depends on" instead and #ifdef GPIOLIB in ucd9000, 
unless you have another recommendation?


Thanks
Eddie



Guenter


Signed-off-by: Christopher Bostic 
Signed-off-by: Andrew Jeffery 
Signed-off-by: Eddie James 
---
  drivers/hwmon/pmbus/ucd9000.c | 201 
++

  1 file changed, 201 insertions(+)

diff --git a/drivers/hwmon/pmbus/ucd9000.c 
b/drivers/hwmon/pmbus/ucd9000.c

index b74dbec..a34ffc4 100644
--- a/drivers/hwmon/pmbus/ucd9000.c
+++ b/drivers/hwmon/pmbus/ucd9000.c
@@ -27,6 +27,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "pmbus.h"
    enum chips { ucd9000, ucd90120, ucd90124, ucd90160, ucd9090, 
ucd90910 };

@@ -35,8 +36,18 @@
  #define UCD9000_NUM_PAGES    0xd6
  #define UCD9000_FAN_CONFIG_INDEX    0xe7
  #define UCD9000_FAN_CONFIG    0xe8
+#define UCD9000_GPIO_SELECT    0xfa
+#define UCD9000_GPIO_CONFIG    0xfb
  #define UCD9000_DEVICE_ID    0xfd
  +/* GPIO CONFIG bits */
+#define UCD9000_GPIO_CONFIG_ENABLE    BIT(0)
+#define UCD9000_GPIO_CONFIG_OUT_ENABLE    BIT(1)
+#define UCD9000_GPIO_CONFIG_OUT_VALUE    BIT(2)
+#define UCD9000_GPIO_CONFIG_STATUS    BIT(3)
+#define UCD9000_GPIO_INPUT    0
+#define UCD9000_GPIO_OUTPUT    1
+
  #define UCD9000_MON_TYPE(x)    (((x) >> 5) & 0x07)
  #define UCD9000_MON_PAGE(x)    ((x) & 0x0f)
  @@ -47,9 +58,15 @@
    #define UCD9000_NUM_FAN    4
  +#define UCD9000_GPIO_NAME_LEN    16
+#define UCD9090_NUM_GPIOS    23
+#define UCD901XX_NUM_GPIOS    26
+#define UCD90910_NUM_GPIOS    26
+
  struct ucd9000_data {
  u8 fan_data[UCD9000_NUM_FAN][I2C_SMBUS_BLOCK_MAX];
  struct pmbus_driver_info info;
+    struct gpio_chip gpio;
  };
  #define to_ucd9000_data(_info) container_of(_info, struct 
ucd9000_data, info)
  @@ -149,6 +166,188 @@ static int ucd9000_read_byte_data(struct 
i2c_client *client, int page, int reg)

  };
  MODULE_DEVICE_TABLE(of, ucd9000_of_match);
  +static int ucd9000_gpio_read_config(struct i2c_client *client,
+    unsigned int offset)
+{
+    int ret;
+
+    /* No page set required */
+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_SELECT, 
offset);

+    if (ret < 0)
+    return ret;
+
+    return i2c_smbus_read_byte_data(client, UCD9000_GPIO_CONFIG);
+}
+
+static int ucd9000_gpio_get(struct gpio_chip *gc, unsigned int offset)
+{
+    struct i2c_client *client  = gpiochip_get_data(gc);
+    int ret;
+
+    ret = ucd9000_gpio_read_config(client, offset);
+    if (ret < 0)
+    return ret;
+
+    return !!(ret & UCD9000_GPIO_CONFIG_STATUS);
+}
+
+static void ucd9000_gpio_set(struct gpio_chip *gc, unsigned int offset,
+ int value)
+{
+    struct i2c_client *client = gpiochip_get_data(gc);
+    int ret;
+
+    ret = ucd9000_gpio_read_config(client, offset);
+    if (ret < 0) {
+    dev_dbg(&client->dev, "failed to read GPIO %d config: %d\n",
+    offset, ret);
+    return;
+    }
+
+    if (value) {
+    if (ret & UCD9000_GPIO_CONFIG_STATUS)
+    return;
+
+    ret |= UCD9000_GPIO_CONFIG_STATUS;
+    } else {
+    if (!(ret & UCD9000_GPIO_CONFIG_STATUS))
+    return;
+
+    ret &= ~UCD9000_GPIO_CONFIG_STATUS;
+    }
+
+    ret |= UCD9000_GPIO_CONFIG_ENABLE;
+
+    /* Page set not required */
+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
+    if (ret < 0) {
+    dev_dbg(&client->dev, "Failed to write GPIO %d config: %d\n",
+    offset, ret);
+    return;
+    }
+
+    ret &= ~UCD9000_GPIO_CONFIG_ENABLE;
+
+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
+    if (ret < 0)
+    dev_dbg(&client->dev, "Failed to write GPIO %d config: %d\n",
+    offset, ret);
+}
+
+static int ucd9000_gpio_get_direction(struct gpio_chip *gc,
+  unsigned int offset)
+{
+    struct i2c_client *client = gpiochip_get_data(gc);
+    int ret;
+
+    ret = ucd9000_gpio_read_config(client, offset);
+    if (ret < 0)
+    return ret;
+
+    return !(ret & UCD9000_GPIO_CONFIG_OUT_ENABLE);
+}
+
+static int ucd9000_gpio_set_direction(struct gpio_chip *gc,
+  unsigned int offset, bool direction_out,
+  int requested_out)
+{
+    struct i2c_client *client = gpiochip_get_data(gc);
+    int ret, config, out_val;
+
+    ret = ucd9000_gpio_read_config(client, offset);
+    if (ret < 0)
+    return ret;
+
+    if (direction_out) {
+    out_val = requested_out ? UCD9000_GPIO_CONFIG_OUT_VALUE : 0;
+
+    if (ret & UCD9000_G

Re: [PATCH v2] netns: send uevent messages

2018-03-16 Thread David Miller

From: Christian Brauner 
Date: Fri, 16 Mar 2018 13:50:30 +0100

> +static int uevent_net_broadcast(struct sock *usk, struct sk_buff *skb,
> + struct netlink_ext_ack *extack)
> +{
> + int ret;
> + /* u64 to chars: 2^64 - 1 = 21 chars */
> + char buf[sizeof("SEQNUM=") + 21];
> + struct sk_buff *skbc;

I hate to be difficult, but please use reverse christmas tree ordering
for local variables.

> +static int uevent_net_rcv_skb(struct sk_buff *skb, struct nlmsghdr *nlh,
> +   struct netlink_ext_ack *extack)
> +{
> + int ret;
> + struct net *net;

Likewise.

Thank you.

Re: [PATCH 0/2] net: phy: relax error checking when creating sysfs link netdev->phydev

2018-03-16 Thread Grygorii Strashko



On 03/16/2018 12:34 PM, Florian Fainelli wrote:
> 
> 
> On 03/16/2018 10:22 AM, Andrew Lunn wrote:
>> On Wed, Mar 14, 2018 at 05:26:22PM -0500, Grygorii Strashko wrote:
>>> Some ethernet drivers (like TI CPSW) may connect and manage >1 Net PHYs per
>>> one netdevice, as result such drivers will produce warning during system
>>> boot and fail to connect second phy to netdevice when PHYLIB framework
>>> will try to create sysfs link netdev->phydev for second PHY
>>> in phy_attach_direct(), because sysfs link with the same name has been
>>> created already for the first PHY.
>>> As result, second CPSW external port will became unusable.
>>> This issue was introduced by commits:
>>> 5568363f0cb3 ("net: phy: Create sysfs reciprocal links for 
>>> attached_dev/phydev"
>>> a3995460491d ("net: phy: Relax error checking on sysfs_create_link()"
>>
>> I wonder if it would be better to add a flag to the phydev that
>> indicates it is the second PHY connected to a MAC? Add a bit to
>> phydrv->mdiodrv.flags. If that bit is set, don't create the sysfs
>> file.
> 
> We could indeed do that, I am fine with Grygorii's approach though in
> making the creation more silent and non fatal.

The link phydev->netdev still can be created. And failure to create links
is non fatal error in my opinion. 

> 
>>
>> For 99% of MAC drivers, having two PHYs is an error, so we want to aid
>> debug by reporting the sysfs error.
> That is true, either way is fine with me, really.
> 

Error still will be reported, just not warning and it will be non-fatal.
So, with this patch set it will be possible now to continue boot (NFS for 
example),
connect to the system and gather logs.


-- 
regards,
-grygorii

Re: [PATCH v2 03/36] mm: use do_futex() instead of sys_futex() in mm_release()

2018-03-16 Thread Darren Hart

On Thu, Mar 15, 2018 at 08:04:56PM +0100, Dominik Brodowski wrote:
> sys_futex() is a wrapper to do_futex() which does not modify any
> values here:
> 
> - uaddr, val and val3 are kept the same
> 
> - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE.
>   Therefore, val2 is always 0.
> 
> - as utime is set to NULL, *timeout is NULL
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Peter Zijlstra 
> Cc: Darren Hart 
> Cc: Andrew Morton 
> Signed-off-by: Dominik Brodowski 

Hi Dominik,

I'm missing the "why" part here. What is it you are trying to address?

do_futex is not currently in use outside of the futex implementation,
while sys_futex is. This decouples the interface from the
implementation. While this is perhaps less critical within the
kernel, I don't see a compelling reason to increase the coupling
between the mm and futex implementations.

Without a compelling WHY, Nack from me.

-- 
Darren Hart
VMware Open Source Technology Center

Re: [PATCH][RFC] kernel.h: provide array iterator

2018-03-16 Thread Joe Perches

On Fri, 2018-03-16 at 16:27 +0100, Rasmus Villemoes wrote:
> On 2018-03-15 11:00, Kieran Bingham wrote:
> > Simplify array iteration with a helper to iterate each entry in an array.
> > Utilise the existing ARRAY_SIZE macro to identify the length of the array
> > and pointer arithmetic to process each item as a for loop.

I recall getting negative feedback on a similar proposal
a decade ago:

https://lkml.org/lkml/2007/2/13/25

Not sure this is different.

Re: [PATCH 4.14 064/109] dmaengine: bcm2835-dma: Use vchan_terminate_vdesc() instead of desc_free

2018-03-16 Thread Dan Rue

On Fri, Mar 16, 2018 at 04:23:33PM +0100, Greg Kroah-Hartman wrote:
> 4.14-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Peter Ujfalusi 
> 
> 
> [ Upstream commit de92436ac40ffe9933230aa503e24dbb5ede9201 ]
> 
> To avoid race with vchan_complete, use the race free way to terminate
> running transfer.
> 
> Implement the device_synchronize callback to make sure that the terminated
> descriptor is freed.
> 
> Signed-off-by: Peter Ujfalusi 
> Acked-by: Eric Anholt 
> Signed-off-by: Vinod Koul 
> Signed-off-by: Sasha Levin 
> Signed-off-by: Greg Kroah-Hartman 

This patch is causing a build error on arm and arm64 per i.e.
https://kernelci.org/build/id/5aac017e59b5141cb1b3a4d5/

Builds are also failing for arm/arm64 on 4.15 and this patch seems to be a
problem there as well, but I have not verified it yet.

#
# make -j10 -k -s ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=build-arm64 
defconfig
#
#
# make -j10 -k -s ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- O=build-arm64
#
arch/arm64/Makefile:48: Detected assembler with broken .inst; disassembly will 
be unreliable
../drivers/dma/bcm2835-dma.c: In function 'bcm2835_dma_terminate_all':
../drivers/dma/bcm2835-dma.c:815:3: error: implicit declaration of function 
'vchan_terminate_vdesc' [-Werror=implicit-function-declaration]
   vchan_terminate_vdesc(&c->desc->vd);
   ^
cc1: some warnings being treated as errors
../scripts/Makefile.build:334: recipe for target 'drivers/dma/bcm2835-dma.o' 
failed
make[3]: *** [drivers/dma/bcm2835-dma.o] Error 1
make[3]: Target '__build' not remade because of errors.
../scripts/Makefile.build:587: recipe for target 'drivers/dma' failed
make[2]: *** [drivers/dma] Error 2
make[2]: Target '__build' not remade because of errors.
/home/buildslave/workspace/kernel-single-defconfig-builder/defconfig/defconfig/label/builder/Makefile:1031:
 recipe for target 'drivers' failed
make[1]: *** [drivers] Error 2
make[1]: Target '_all' not remade because of errors.
Makefile:146: recipe for target 'sub-make' failed
make: *** [sub-make] Error 2
make: Target '_all' not remade because of errors.

> ---
>  drivers/dma/bcm2835-dma.c |   10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> --- a/drivers/dma/bcm2835-dma.c
> +++ b/drivers/dma/bcm2835-dma.c
> @@ -812,7 +812,7 @@ static int bcm2835_dma_terminate_all(str
>* c->desc is NULL and exit.)
>*/
>   if (c->desc) {
> - bcm2835_dma_desc_free(&c->desc->vd);
> + vchan_terminate_vdesc(&c->desc->vd);
>   c->desc = NULL;
>   bcm2835_dma_abort(c->chan_base);
>  
> @@ -836,6 +836,13 @@ static int bcm2835_dma_terminate_all(str
>   return 0;
>  }
>  
> +static void bcm2835_dma_synchronize(struct dma_chan *chan)
> +{
> + struct bcm2835_chan *c = to_bcm2835_dma_chan(chan);
> +
> + vchan_synchronize(&c->vc);
> +}
> +
>  static int bcm2835_dma_chan_init(struct bcm2835_dmadev *d, int chan_id,
>int irq, unsigned int irq_flags)
>  {
> @@ -942,6 +949,7 @@ static int bcm2835_dma_probe(struct plat
>   od->ddev.device_prep_dma_memcpy = bcm2835_dma_prep_dma_memcpy;
>   od->ddev.device_config = bcm2835_dma_slave_config;
>   od->ddev.device_terminate_all = bcm2835_dma_terminate_all;
> + od->ddev.device_synchronize = bcm2835_dma_synchronize;
>   od->ddev.src_addr_widths = BIT(DMA_SLAVE_BUSWIDTH_4_BYTES);
>   od->ddev.dst_addr_widths = BIT(DMA_SLAVE_BUSWIDTH_4_BYTES);
>   od->ddev.directions = BIT(DMA_DEV_TO_MEM) | BIT(DMA_MEM_TO_DEV) |
> 
>

Re: [PATCH v9 09/61] xarray: Replace exceptional entries

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:47AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> Introduce xarray value entries to replace the radix tree exceptional
> entry code.  This is a slight change in encoding to allow the use of an
> extra bit (we can now store BITS_PER_LONG - 1 bits in a value entry).
> It is also a change in emphasis; exceptional entries are intimidating
> and different.  As the comment explains, you can choose to store values
> or pointers in the xarray and they are both first-class citizens.
> 
> Signed-off-by: Matthew Wilcox 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h|   4 +-
>  arch/powerpc/include/asm/nohash/64/pgtable.h|   4 +-
>  drivers/gpu/drm/i915/i915_gem.c |  17 ++--
>  drivers/staging/lustre/lustre/mdc/mdc_request.c |   2 +-
>  fs/btrfs/compression.c  |   2 +-
>  fs/dax.c| 107 
> 
>  fs/proc/task_mmu.c  |   2 +-
>  include/linux/radix-tree.h  |  36 ++--
>  include/linux/swapops.h |  19 ++---
>  include/linux/xarray.h  |  54 
>  lib/idr.c   |  61 ++
>  lib/radix-tree.c|  21 ++---
>  mm/filemap.c|  10 +--
>  mm/khugepaged.c |   2 +-
>  mm/madvise.c|   2 +-
>  mm/memcontrol.c |   2 +-
>  mm/mincore.c|   2 +-
>  mm/readahead.c  |   2 +-
>  mm/shmem.c  |  10 +--
>  mm/swap.c   |   2 +-
>  mm/truncate.c   |  12 +--
>  mm/workingset.c |  12 ++-
>  tools/testing/radix-tree/idr-test.c |   6 +-
>  tools/testing/radix-tree/linux/radix-tree.h |   1 +
>  tools/testing/radix-tree/multiorder.c   |  47 +--
>  tools/testing/radix-tree/test.c |   2 +-
>  26 files changed, 223 insertions(+), 218 deletions(-)
> 



>  
> @@ -453,18 +449,14 @@ int ida_get_new_above(struct ida *ida, int start, int 
> *id)
>   new += bit;
>   if (new < 0)
>   return -ENOSPC;
> - if (ebit < BITS_PER_LONG) {
> - bitmap = (void *)((1UL << ebit) |
> - RADIX_TREE_EXCEPTIONAL_ENTRY);
> - radix_tree_iter_replace(root, &iter, slot,
> - bitmap);
> - *id = new;
> - return 0;
> + if (bit < BITS_PER_XA_VALUE) {
> + bitmap = xa_mk_value(1UL << bit);
> + } else {
> + bitmap = this_cpu_xchg(ida_bitmap, NULL);
> + if (!bitmap)
> + return -EAGAIN;
> + __set_bit(bit, bitmap->bitmap);
>   }
> - bitmap = this_cpu_xchg(ida_bitmap, NULL);
> - if (!bitmap)
> - return -EAGAIN;
> - __set_bit(bit, bitmap->bitmap);
>   radix_tree_iter_replace(root, &iter, slot, bitmap);
>   }
>  

This threw me off a bit, but we do *id = new below.

> @@ -495,9 +487,9 @@ void ida_remove(struct ida *ida, int id)
>   goto err;
>  
>   bitmap = rcu_dereference_raw(*slot);
> - if (radix_tree_exception(bitmap)) {
> + if (xa_is_value(bitmap)) {
>   btmp = (unsigned long *)slot;
> - offset += RADIX_TREE_EXCEPTIONAL_SHIFT;
> + offset += 1; /* Intimate knowledge of the xa_data encoding */
>   if (offset >= BITS_PER_LONG)
>   goto err;
>   } else {

Ick.



> @@ -393,11 +393,11 @@ void ida_check_conv(void)
>   for (i = 0; i < 100; i++) {
>   int err = ida_get_new(&ida, &id);
>   if (err == -EAGAIN) {
> - assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 2));
> + assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 1));
>   assert(ida_pre_get(&ida, GFP_KERNEL));
>   err = ida_get_new(&ida, &id);
>   } else {
> - assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 2));
> + assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 1));

Can we just use BITS_PER_XA_VALUE here?

Overall looks fine to me, I'm not married to changing any of the nits.

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v5 1/2] hwmon: (ucd9000) Add gpio chip interface

2018-03-16 Thread Guenter Roeck

On Fri, Mar 16, 2018 at 01:39:48PM -0500, Eddie James wrote:
> 
> 
> On 03/16/2018 08:40 AM, Guenter Roeck wrote:
> >On 03/15/2018 03:21 PM, Eddie James wrote:
> >>From: Christopher Bostic 
> >>
> >>Add a struct gpio_chip and define some methods so that this device's
> >>I/O can be accessed via /sys/class/gpio.
> >>
> >
> >Sorry for not noticing earlier. The 0day reports should be addressed by
> >selecting GPIOLIB
> >in the Kconfig entry.
> 
> Getting kbuild recursive dependencies when I select GPIOLIB for ucd9000 :(
> 
Sure, it would have been too easy otherwise.

> May have to do "depends on" instead and #ifdef GPIOLIB in ucd9000, unless
> you have another recommendation?
> 

Good news is that the code is all in one code block in the driver,
so make it #ifdef and 

...
#else
static void ucd9000_probe_gpio(...) { }
#endif

With that, you won't need "depends on ..."

Thanks,
Guenter

> Thanks
> Eddie
> 
> >
> >Guenter
> >
> >>Signed-off-by: Christopher Bostic 
> >>Signed-off-by: Andrew Jeffery 
> >>Signed-off-by: Eddie James 
> >>---
> >>  drivers/hwmon/pmbus/ucd9000.c | 201
> >>++
> >>  1 file changed, 201 insertions(+)
> >>
> >>diff --git a/drivers/hwmon/pmbus/ucd9000.c
> >>b/drivers/hwmon/pmbus/ucd9000.c
> >>index b74dbec..a34ffc4 100644
> >>--- a/drivers/hwmon/pmbus/ucd9000.c
> >>+++ b/drivers/hwmon/pmbus/ucd9000.c
> >>@@ -27,6 +27,7 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >>+#include 
> >>  #include "pmbus.h"
> >>    enum chips { ucd9000, ucd90120, ucd90124, ucd90160, ucd9090,
> >>ucd90910 };
> >>@@ -35,8 +36,18 @@
> >>  #define UCD9000_NUM_PAGES    0xd6
> >>  #define UCD9000_FAN_CONFIG_INDEX    0xe7
> >>  #define UCD9000_FAN_CONFIG    0xe8
> >>+#define UCD9000_GPIO_SELECT    0xfa
> >>+#define UCD9000_GPIO_CONFIG    0xfb
> >>  #define UCD9000_DEVICE_ID    0xfd
> >>  +/* GPIO CONFIG bits */
> >>+#define UCD9000_GPIO_CONFIG_ENABLE    BIT(0)
> >>+#define UCD9000_GPIO_CONFIG_OUT_ENABLE    BIT(1)
> >>+#define UCD9000_GPIO_CONFIG_OUT_VALUE    BIT(2)
> >>+#define UCD9000_GPIO_CONFIG_STATUS    BIT(3)
> >>+#define UCD9000_GPIO_INPUT    0
> >>+#define UCD9000_GPIO_OUTPUT    1
> >>+
> >>  #define UCD9000_MON_TYPE(x)    (((x) >> 5) & 0x07)
> >>  #define UCD9000_MON_PAGE(x)    ((x) & 0x0f)
> >>  @@ -47,9 +58,15 @@
> >>    #define UCD9000_NUM_FAN    4
> >>  +#define UCD9000_GPIO_NAME_LEN    16
> >>+#define UCD9090_NUM_GPIOS    23
> >>+#define UCD901XX_NUM_GPIOS    26
> >>+#define UCD90910_NUM_GPIOS    26
> >>+
> >>  struct ucd9000_data {
> >>  u8 fan_data[UCD9000_NUM_FAN][I2C_SMBUS_BLOCK_MAX];
> >>  struct pmbus_driver_info info;
> >>+    struct gpio_chip gpio;
> >>  };
> >>  #define to_ucd9000_data(_info) container_of(_info, struct
> >>ucd9000_data, info)
> >>  @@ -149,6 +166,188 @@ static int ucd9000_read_byte_data(struct
> >>i2c_client *client, int page, int reg)
> >>  };
> >>  MODULE_DEVICE_TABLE(of, ucd9000_of_match);
> >>  +static int ucd9000_gpio_read_config(struct i2c_client *client,
> >>+    unsigned int offset)
> >>+{
> >>+    int ret;
> >>+
> >>+    /* No page set required */
> >>+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_SELECT,
> >>offset);
> >>+    if (ret < 0)
> >>+    return ret;
> >>+
> >>+    return i2c_smbus_read_byte_data(client, UCD9000_GPIO_CONFIG);
> >>+}
> >>+
> >>+static int ucd9000_gpio_get(struct gpio_chip *gc, unsigned int offset)
> >>+{
> >>+    struct i2c_client *client  = gpiochip_get_data(gc);
> >>+    int ret;
> >>+
> >>+    ret = ucd9000_gpio_read_config(client, offset);
> >>+    if (ret < 0)
> >>+    return ret;
> >>+
> >>+    return !!(ret & UCD9000_GPIO_CONFIG_STATUS);
> >>+}
> >>+
> >>+static void ucd9000_gpio_set(struct gpio_chip *gc, unsigned int offset,
> >>+ int value)
> >>+{
> >>+    struct i2c_client *client = gpiochip_get_data(gc);
> >>+    int ret;
> >>+
> >>+    ret = ucd9000_gpio_read_config(client, offset);
> >>+    if (ret < 0) {
> >>+    dev_dbg(&client->dev, "failed to read GPIO %d config: %d\n",
> >>+    offset, ret);
> >>+    return;
> >>+    }
> >>+
> >>+    if (value) {
> >>+    if (ret & UCD9000_GPIO_CONFIG_STATUS)
> >>+    return;
> >>+
> >>+    ret |= UCD9000_GPIO_CONFIG_STATUS;
> >>+    } else {
> >>+    if (!(ret & UCD9000_GPIO_CONFIG_STATUS))
> >>+    return;
> >>+
> >>+    ret &= ~UCD9000_GPIO_CONFIG_STATUS;
> >>+    }
> >>+
> >>+    ret |= UCD9000_GPIO_CONFIG_ENABLE;
> >>+
> >>+    /* Page set not required */
> >>+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
> >>+    if (ret < 0) {
> >>+    dev_dbg(&client->dev, "Failed to write GPIO %d config: %d\n",
> >>+    offset, ret);
> >>+    return;
> >>+    }
> >>+
> >>+    ret &= ~UCD9000_GPIO_CONFIG_ENABLE;
> >>+
> >>+    ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
> >>+    if (ret < 0)
> >>+    dev_dbg(&client->dev, "Failed to write GPIO %d config

Re: [PATCH net-next] net: ethernet: ti: cpsw: enable vlan rx vlan offload

2018-03-16 Thread Grygorii Strashko

On 03/16/2018 01:37 PM, David Miller wrote:

From: Andrew Lunn 
Date: Fri, 16 Mar 2018 01:29:35 +0100

On Thu, Mar 15, 2018 at 03:15:50PM -0500, Grygorii Strashko wrote:

In VLAN_AWARE mode CPSW can insert VLAN header encapsulation word on Host
port 0 egress (RX) before the packet data if RX_VLAN_ENCAP bit is set in
CPSW_CONTROL register. VLAN header encapsulation word has following format:

  HDR_PKT_Priority bits 29-31 - Header Packet VLAN prio (Highest prio: 7)
  HDR_PKT_CFI bits 28 - Header Packet VLAN CFI bit.
  HDR_PKT_Vid bits 27-16 - Header Packet VLAN ID
  PKT_Type bits 8-9 - Packet Type. Indicates whether the packet is
VLAN-tagged, priority-tagged, or non-tagged.
00: VLAN-tagged packet
01: Reserved
10: Priority-tagged packet
11: Non-tagged packet

This feature can be used to implement TX VLAN offload in case of
VLAN-tagged packets and to insert VLAN tag in case Non-tagged packet was
received on port with PVID set. As per documentation, CPSW never modifies
packet data on Host egress (RX) and as result, without this feature
enabled, Host port will not be able to receive properly packets which
entered switch non-tagged through external Port with PVID set (when
non-tagged packet forwarded from external Port with PVID set to another
external Port - packet will be VLAN tagged properly).

So, i think it is time to discuss the future of this driver. It should
really be replaced by a switchdev/DSA driver. There are plenty of
carrots for a new driver: Better statistics, working ethtool support
for all the PHYs, better user experience, etc. But maybe now it is
time for the stick. Should we Maintainers decide that no new features
should be added to the existing drivers, just bug fixes?

Andrew, I totally share your concerns.

However, I think the reality is that at best we can strongly urge
people to do such a large amount of work such as writing a new
switchdev/DSA driver for this cpsw hardware.

We can't really compel them.

And a stick could have the opposite of it's intended effect.  If still
nobody wants to do the switchdev/DSA driver, then this existing one
rots and even worse we can end up with an out-of-tree version of this
driver that has the changes (such as this one) that people want.

Yeh :( This one was created to satisfy real customer use case.
So we'll have to carry it internally any way, but having it in LKML will 
allow to involve broader number of people in review, testing and fixing.
And the same code will have to be part of dsa switch driver also - it 
will be just more stable at time of migration to dsa.

I'd like to see the switchdev/DSA driver for cpsw as much as you do,
but I am not convinced that rejecting patches like this one will
necessarily make that happen.

+1. Hope this work will be started as soon as possible.

--
regards,
-grygorii

Re: [PATCH v2 0/5] Allow compile-testing NO_DMA (core)

2018-03-16 Thread Christoph Hellwig

Thanks Geert,

applied to the dma-mapping tree for 4.17.

Re: [PATCH v6 02/15] fs, dax: prepare for dax-specific address_space_operations

2018-03-16 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH v6 03/15] block, dax: remove dead code in blkdev_writepages()

2018-03-16 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH v9 10/61] xarray: Change definition of sibling entries

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:48AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> Instead of storing a pointer to the slot containing the canonical entry,
> store the offset of the slot.  Produces slightly more efficient code
> (~300 bytes) and simplifies the implementation.
> 
> Signed-off-by: Matthew Wilcox 
> ---
>  include/linux/xarray.h | 93 
> ++
>  lib/radix-tree.c   | 66 +++
>  2 files changed, 112 insertions(+), 47 deletions(-)
> 

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v6 04/15] xfs, dax: introduce xfs_dax_aops

2018-03-16 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH 13/22] signal: Move addr_lsb into the _sigfault union for clarity

2018-03-16 Thread Dave Hansen

On 01/15/2018 04:40 PM, Eric W. Biederman wrote:
> The addr_lsb fields is only valid and available when the
> signal is SIGBUS and the si_code is BUS_MCEERR_AR or BUS_MCEERR_AO.
> Document this with a comment and place the field in the _sigfault union
> to make this clear.
> 
> All of the fields stay in the same physical location so both the old
> and new definitions of struct siginfo will continue to work.

This breaks the ABI and breaks protection keys.  The physical locations
*DO* change.

Before this patch:

#define si_pkey _sifields._sigfault._pkey
(gdb) print &((siginfo_t *)0)->_sifields._sigfault._pkey
$1 = (__u32 *) 0x20 

and after:

+#define si_pkey_sifields._sigfault._addr_pkey._pkey
(gdb) print &((siginfo_t *)0)->_sifields._sigfault._addr_pkey._pkey
$1 = (__u32 *) 0x1c

Can we revert this, please?

Re: [PATCH v6 11/15] mm, fs, dax: handle layout changes to pinned dax mappings

2018-03-16 Thread Christoph Hellwig

Looks good (at least up to my comprehension of the mm code :))

Reviewed-by: Christoph Hellwig

Re: [PATCH 2/2] kprobe: fix: Add ftrace_ops_assist_func to kprobe blacklist

2018-03-16 Thread Steven Rostedt

On Fri, 16 Mar 2018 13:53:01 -0400 (EDT)
Mathieu Desnoyers  wrote:

> Would the general approach you envision be based on emitting all code
> generated by compilation of all objects under kernel/tracing and
> kernel/events into a specific "nokprobes" text section of the kernel ?
> Perhaps we could create a specific linker scripts for those directories,
> or do you have in mind a neater way to do this ?

I was thinking of adding it to the objtool work. I need to consolidate
the recordmcount code and objtool for doing this, and that is on my
agenda (it has to do with some of the current "urgent" needs).

But this will take a bit of effort. In the mean time and not against
just adding the whack-a-mole approach and add the functions that the
original patch selected as nokprobes.

-- Steve

Re: [PATCH v2 03/36] mm: use do_futex() instead of sys_futex() in mm_release()

2018-03-16 Thread Andy Lutomirski

On Fri, Mar 16, 2018 at 6:43 PM, Darren Hart  wrote:
> On Thu, Mar 15, 2018 at 08:04:56PM +0100, Dominik Brodowski wrote:
>> sys_futex() is a wrapper to do_futex() which does not modify any
>> values here:
>>
>> - uaddr, val and val3 are kept the same
>>
>> - op is masked with FUTEX_CMD_MASK, but is always set to FUTEX_WAKE.
>>   Therefore, val2 is always 0.
>>
>> - as utime is set to NULL, *timeout is NULL
>>
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Peter Zijlstra 
>> Cc: Darren Hart 
>> Cc: Andrew Morton 
>> Signed-off-by: Dominik Brodowski 
>
> Hi Dominik,
>
> I'm missing the "why" part here. What is it you are trying to address?
>
> do_futex is not currently in use outside of the futex implementation,
> while sys_futex is. This decouples the interface from the
> implementation. While this is perhaps less critical within the
> kernel, I don't see a compelling reason to increase the coupling
> between the mm and futex implementations.
>
> Without a compelling WHY, Nack from me.
>

We want to make some changes to the way that the syscall entry code
invokes syscalls, and these changes will make it impossible to call
sys_xyz() functions from the kernel.  So we can make sys_futex() be a
trivial wrapper around a new ksys_futex(), or we can do a patch like
this.

Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()

2018-03-16 Thread Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
> In preparation for adding coordination between truncate operations and
> busy dax-pages, extend xfs_break_layouts() to assume it must be called
> with the mmap lock held. This locking scheme will be required for
> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
> pages mapped into the file's address space).

This requirement wasn't really there in the last series, why do we
require it now?

As far as I can tell all we'd need is to just drop this assert:

> - ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
> + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> + | XFS_MMAPLOCK_EXCL));

entirely.

>   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>   xfs_iunlock(ip, *iolock);
>   error = break_layout(inode, true);
> - *iolock = XFS_IOLOCK_EXCL;
> + *iolock &= ~XFS_IOLOCK_SHARED;
> + *iolock |= XFS_IOLOCK_EXCL;
>   xfs_ilock(ip, *iolock);

And take this one hunk from your patch.

To enable the DAX use case.

Re: [PATCH v9 11/61] xarray: Add definition of struct xarray

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:49AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> This is a direct replacement for struct radix_tree_root.  Some of the
> struct members have changed name; convert those, and use a #define so
> that radix_tree users continue to work without change.
> 
> Signed-off-by: Matthew Wilcox 

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v9 09/61] xarray: Replace exceptional entries

2018-03-16 Thread Matthew Wilcox

On Fri, Mar 16, 2018 at 02:53:50PM -0400, Josef Bacik wrote:
> On Tue, Mar 13, 2018 at 06:25:47AM -0700, Matthew Wilcox wrote:
> > @@ -453,18 +449,14 @@ int ida_get_new_above(struct ida *ida, int start, int 
> > *id)
> > new += bit;
> > if (new < 0)
> > return -ENOSPC;
> > -   if (ebit < BITS_PER_LONG) {
> > -   bitmap = (void *)((1UL << ebit) |
> > -   RADIX_TREE_EXCEPTIONAL_ENTRY);
> > -   radix_tree_iter_replace(root, &iter, slot,
> > -   bitmap);
> > -   *id = new;
> > -   return 0;
> > +   if (bit < BITS_PER_XA_VALUE) {
> > +   bitmap = xa_mk_value(1UL << bit);
> > +   } else {
> > +   bitmap = this_cpu_xchg(ida_bitmap, NULL);
> > +   if (!bitmap)
> > +   return -EAGAIN;
> > +   __set_bit(bit, bitmap->bitmap);
> > }
> > -   bitmap = this_cpu_xchg(ida_bitmap, NULL);
> > -   if (!bitmap)
> > -   return -EAGAIN;
> > -   __set_bit(bit, bitmap->bitmap);
> > radix_tree_iter_replace(root, &iter, slot, bitmap);
> > }
> >  
> 
> This threw me off a bit, but we do *id = new below.

Yep.  Fortunately, I have a test-suite for the IDA, so I'm relatively
sure this works.

> > @@ -495,9 +487,9 @@ void ida_remove(struct ida *ida, int id)
> > goto err;
> >  
> > bitmap = rcu_dereference_raw(*slot);
> > -   if (radix_tree_exception(bitmap)) {
> > +   if (xa_is_value(bitmap)) {
> > btmp = (unsigned long *)slot;
> > -   offset += RADIX_TREE_EXCEPTIONAL_SHIFT;
> > +   offset += 1; /* Intimate knowledge of the xa_data encoding */
> > if (offset >= BITS_PER_LONG)
> > goto err;
> > } else {
> 
> Ick.

I know.  I feel quite ashamed of this code.  I do have a rewrite to use
the XArray, but I didn't want to include it as part of *this* merge request.
And that rewrite decodes the value into an unsigned long, sets the bit,
reencodes it as an xa_value and stores it.

> > @@ -393,11 +393,11 @@ void ida_check_conv(void)
> > for (i = 0; i < 100; i++) {
> > int err = ida_get_new(&ida, &id);
> > if (err == -EAGAIN) {
> > -   assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 2));
> > +   assert((i % IDA_BITMAP_BITS) == (BITS_PER_LONG - 1));
> > assert(ida_pre_get(&ida, GFP_KERNEL));
> > err = ida_get_new(&ida, &id);
> > } else {
> > -   assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 2));
> > +   assert((i % IDA_BITMAP_BITS) != (BITS_PER_LONG - 1));
> 
> Can we just use BITS_PER_XA_VALUE here?

Yes!  I'll change that.

> Reviewed-by: Josef Bacik 

Thanks!

Re: [PATCH v6 13/15] xfs: communicate lock drop events from xfs_break_layouts()

2018-03-16 Thread Christoph Hellwig

On Thu, Mar 15, 2018 at 08:52:39AM -0700, Dan Williams wrote:
> In preparation for adding a new layout type, teach xfs_break_layouts()
> to return a positive number if it needed to drop locks while trying to
> break leases. For all layouts to be successfully broken each layout type
> needs to be able to assert that the layouts were broken with the locks
> held.
> 
> The existing a xfs_break_layouts() is pushed down a level to
> xfs_break_leased_layouts() and the new xfs_break_layouts() will
> coordinate interpreting the return code from the low level 'break'
> helpers.

With that the subject line is rather confusing, given that the
externally visible xfs_break_layouts does not communicate the lock
drop events.  So maybe this should just be titled something about
refactoring.  Or just merged into the next patch which reshuffles
everything again anyway.

>  int
> -xfs_break_layouts(
> +xfs_break_leased_layouts(
>   struct inode*inode,
>   uint*iolock)
>  {
>   struct xfs_inode*ip = XFS_I(inode);
>   int error;
> -
> - ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
> - | XFS_MMAPLOCK_EXCL));
> + int did_unlock = 0;
>  
>   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>   xfs_iunlock(ip, *iolock);
> + did_unlock = 1;
>   error = break_layout(inode, true);
>   *iolock &= ~XFS_IOLOCK_SHARED;
>   *iolock |= XFS_IOLOCK_EXCL;
>   xfs_ilock(ip, *iolock);
>   }
>  
> - return error;
> + if (error < 0)
> + return error;
> + return did_unlock;

And I suspect the cleaner interface would be to just pass a 
bool *did_unlock argument.

Re: [PATCH v6 14/15] xfs: prepare xfs_break_layouts() for another layout type

2018-03-16 Thread Christoph Hellwig

Looks fine:

Reviewed-by: Christoph Hellwig

Re: [PATCH v6 12/15] xfs: require mmap lock for xfs_break_layouts()

2018-03-16 Thread Dan Williams

On Fri, Mar 16, 2018 at 12:04 PM, Christoph Hellwig  wrote:
> On Thu, Mar 15, 2018 at 08:52:29AM -0700, Dan Williams wrote:
>> In preparation for adding coordination between truncate operations and
>> busy dax-pages, extend xfs_break_layouts() to assume it must be called
>> with the mmap lock held. This locking scheme will be required for
>> coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
>> pages mapped into the file's address space).
>
> This requirement wasn't really there in the last series, why do we
> require it now?

It seems I misinterpreted your feedback.

>
> As far as I can tell all we'd need is to just drop this assert:
>
>> - ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL));
>> + ASSERT(xfs_isilocked(ip, XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL
>> + | XFS_MMAPLOCK_EXCL));
>
> entirely.
>
>>   while ((error = break_layout(inode, false) == -EWOULDBLOCK)) {
>>   xfs_iunlock(ip, *iolock);
>>   error = break_layout(inode, true);
>> - *iolock = XFS_IOLOCK_EXCL;
>> + *iolock &= ~XFS_IOLOCK_SHARED;
>> + *iolock |= XFS_IOLOCK_EXCL;
>>   xfs_ilock(ip, *iolock);
>
> And take this one hunk from your patch.
>
> To enable the DAX use case.

Yeah, that looks good to me.

Re: [PATCH v6 15/15] xfs, dax: introduce xfs_break_dax_layouts()

2018-03-16 Thread Christoph Hellwig

Looks fine,

Reviewed-by: Christoph Hellwig

Re: [PATCH 0/2] net: phy: relax error checking when creating sysfs link netdev->phydev

2018-03-16 Thread Florian Fainelli

On March 16, 2018 11:42:21 AM PDT, Grygorii Strashko  
wrote:
>
>
>On 03/16/2018 12:34 PM, Florian Fainelli wrote:
>> 
>> 
>> On 03/16/2018 10:22 AM, Andrew Lunn wrote:
>>> On Wed, Mar 14, 2018 at 05:26:22PM -0500, Grygorii Strashko wrote:
 Some ethernet drivers (like TI CPSW) may connect and manage >1 Net
>PHYs per
 one netdevice, as result such drivers will produce warning during
>system
 boot and fail to connect second phy to netdevice when PHYLIB
>framework
 will try to create sysfs link netdev->phydev for second PHY
 in phy_attach_direct(), because sysfs link with the same name has
>been
 created already for the first PHY.
 As result, second CPSW external port will became unusable.
 This issue was introduced by commits:
 5568363f0cb3 ("net: phy: Create sysfs reciprocal links for
>attached_dev/phydev"
 a3995460491d ("net: phy: Relax error checking on
>sysfs_create_link()"
>>>
>>> I wonder if it would be better to add a flag to the phydev that
>>> indicates it is the second PHY connected to a MAC? Add a bit to
>>> phydrv->mdiodrv.flags. If that bit is set, don't create the sysfs
>>> file.
>> 
>> We could indeed do that, I am fine with Grygorii's approach though in
>> making the creation more silent and non fatal.
>
>The link phydev->netdev still can be created. And failure to create
>links
>is non fatal error in my opinion. 

They should not be fatal I agree, but it's nice to know when you are doing 
something wrong anyway.

>
>> 
>>>
>>> For 99% of MAC drivers, having two PHYs is an error, so we want to
>aid
>>> debug by reporting the sysfs error.
>> That is true, either way is fine with me, really.
>> 
>
>Error still will be reported, just not warning and it will be
>non-fatal.
>So, with this patch set it will be possible now to continue boot (NFS
>for example),
>connect to the system and gather logs.

The point Andrew is trying to make is that you address one particular failure 
in the PHY creation path when using > 1 PHY devices with a network device. 
Using a flag would easily allow us to be more future proof with other parts of 
PHYLIB  for your particular use case if that becomes necessary. This gives you 
less incentive to fix this use case though.

-- 
Florian

Re: [PATCH v9 12/61] xarray: Define struct xa_node

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:50AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> This is a direct replacement for struct radix_tree_node.  A couple of
> struct members have changed name, so convert those.  Use a #define so
> that radix tree users continue to work without change.
> 
> Signed-off-by: Matthew Wilcox 

Reviewed-by: Josef Bacik 

Thanks,

Josef

Re: [PATCH v9 13/61] xarray: Add documentation

2018-03-16 Thread Josef Bacik

On Tue, Mar 13, 2018 at 06:25:51AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> This is documentation on how to use the XArray, not details about its
> internal implementation.
> 
> Signed-off-by: Matthew Wilcox 

I'm just going to assume you know what you are talking about here

Acked-by: Josef Bacik 

Thanks,

Josef

[PATCH 0/4] hmm: fixes and documentations v2

2018-03-16 Thread jglisse

From: Jérôme Glisse 

Removed pointless VM_BUG_ON() cced stable when appropriate and splitted
the last patch into _many_ smaller patches to make it easier to review.
The end result is same modulo comments i received so far and the extra
documentation i added while splitting thing up. Below is previous cover
letter (everything in it is still true):

--

All patches only impact HMM user, there is no implication outside HMM.

First patch improve documentation to better reflect what HMM is. Second
patch fix #if/#else placement in hmm.h. The third patch add a call on
mm release which helps device driver who use HMM to clean up early when
a process quit. Finaly last patch modify the CPU snapshot and page fault
helper to simplify device driver. The nouveau patchset i posted last
week already depends on all of those patches.

You can find them in a hmm-for-4.17 branch:

git://people.freedesktop.org/~glisse/linux
https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-for-4.17

Cc: Ralph Campbell 
Cc: Evgeny Baskakov 
Cc: Mark Hairgrove 
Cc: John Hubbard 

Jérôme Glisse (12):
  mm/hmm: fix header file if/else/endif maze
  mm/hmm: hmm_pfns_bad() was accessing wrong struct
  mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters
  mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture
  mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to
ulong
  mm/hmm: cleanup special vma handling (VM_SPECIAL)
  mm/hmm: do not differentiate between empty entry or missing directory
  mm/hmm: rename HMM_PFN_DEVICE_UNADDRESSABLE to HMM_PFN_DEVICE_PRIVATE
  mm/hmm: move hmm_pfns_clear() closer to where it is use
  mm/hmm: factor out pte and pmd handling to simplify hmm_vma_walk_pmd()
  mm/hmm: change hmm_vma_fault() to allow write fault on page basis
  mm/hmm: use device driver encoding for HMM pfn

Ralph Campbell (2):
  mm/hmm: documentation editorial update to HMM documentation
  mm/hmm: HMM should have a callback before MM is destroyed v2

 Documentation/vm/hmm.txt | 360 +-
 MAINTAINERS  |   1 +
 include/linux/hmm.h  | 156 ---
 mm/hmm.c | 495 +--
 4 files changed, 582 insertions(+), 430 deletions(-)

-- 
2.14.3

[PATCH 02/14] mm/hmm: fix header file if/else/endif maze

2018-03-16 Thread jglisse

From: Jérôme Glisse 

The #if/#else/#endif for IS_ENABLED(CONFIG_HMM) were wrong.

Signed-off-by: Jérôme Glisse 
Acked-by: Balbir Singh 
Cc: sta...@vger.kernel.org
Cc: Andrew Morton 
Cc: Ralph Campbell 
Cc: John Hubbard 
Cc: Evgeny Baskakov 
---
 include/linux/hmm.h | 7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 325017ad9311..ef6044d08cc5 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -498,6 +498,9 @@ struct hmm_device {
 struct hmm_device *hmm_device_new(void *drvdata);
 void hmm_device_put(struct hmm_device *hmm_device);
 #endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
+#else /* IS_ENABLED(CONFIG_HMM) */
+static inline void hmm_mm_destroy(struct mm_struct *mm) {}
+static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM) */
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
@@ -513,8 +516,4 @@ static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
-
-#else /* IS_ENABLED(CONFIG_HMM) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
-static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* LINUX_HMM_H */
-- 
2.14.3

[PATCH 08/14] mm/hmm: cleanup special vma handling (VM_SPECIAL)

2018-03-16 Thread jglisse

From: Jérôme Glisse 

Special vma (one with any of the VM_SPECIAL flags) can not be access by
device because there is no consistent model accross device drivers on
those vma and their backing memory.

This patch directly use hmm_range struct for hmm_pfns_special() argument
as it is always affecting the whole vma and thus the whole range.

It also make behavior consistent after this patch both hmm_vma_fault()
and hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but
both were filling the HMM pfn array with special entry.

Signed-off-by: Jérôme Glisse 
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 mm/hmm.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index f674b73e7f4a..04595a994542 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -281,14 +281,6 @@ static int hmm_vma_do_fault(struct mm_walk *walk,
return -EAGAIN;
 }
 
-static void hmm_pfns_special(uint64_t *pfns,
-unsigned long addr,
-unsigned long end)
-{
-   for (; addr < end; addr += PAGE_SIZE, pfns++)
-   *pfns = HMM_PFN_SPECIAL;
-}
-
 static int hmm_pfns_bad(unsigned long addr,
unsigned long end,
struct mm_walk *walk)
@@ -486,6 +478,14 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return 0;
 }
 
+static void hmm_pfns_special(struct hmm_range *range)
+{
+   unsigned long addr = range->start, i = 0;
+
+   for (; addr < range->end; addr += PAGE_SIZE, i++)
+   range->pfns[i] = HMM_PFN_SPECIAL;
+}
+
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual 
addresses
  * @range: range being snapshoted and all needed informations
@@ -509,12 +509,6 @@ int hmm_vma_get_pfns(struct hmm_range *range)
struct mm_walk mm_walk;
struct hmm *hmm;
 
-   /* FIXME support hugetlb fs */
-   if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-   hmm_pfns_special(range->pfns, range->start, range->end);
-   return -EINVAL;
-   }
-
/* Sanity check, this really should not happen ! */
if (range->start < vma->vm_start || range->start >= vma->vm_end)
return -EINVAL;
@@ -528,6 +522,12 @@ int hmm_vma_get_pfns(struct hmm_range *range)
if (!hmm->mmu_notifier.ops)
return -EINVAL;
 
+   /* FIXME support hugetlb fs */
+   if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+   hmm_pfns_special(range);
+   return -EINVAL;
+   }
+
/* Initialize range to track CPU page table update */
spin_lock(&hmm->lock);
range->valid = true;
@@ -693,6 +693,12 @@ int hmm_vma_fault(struct hmm_range *range, bool write, 
bool block)
if (!hmm->mmu_notifier.ops)
return -EINVAL;
 
+   /* FIXME support hugetlb fs */
+   if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
+   hmm_pfns_special(range);
+   return -EINVAL;
+   }
+
/* Initialize range to track CPU page table update */
spin_lock(&hmm->lock);
range->valid = true;
@@ -710,12 +716,6 @@ int hmm_vma_fault(struct hmm_range *range, bool write, 
bool block)
return 0;
}
 
-   /* FIXME support hugetlb fs */
-   if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-   hmm_pfns_special(range->pfns, range->start, range->end);
-   return 0;
-   }
-
hmm_vma_walk.fault = true;
hmm_vma_walk.write = write;
hmm_vma_walk.block = block;
-- 
2.14.3

[PATCH 01/14] mm/hmm: documentation editorial update to HMM documentation

2018-03-16 Thread jglisse

From: Ralph Campbell 

This patch updates the documentation for HMM to fix minor typos and
phrasing to be a bit more readable.

Signed-off-by: Ralph Campbell 
Signed-off-by: Jérôme Glisse 
Cc: Stephen  Bates 
Cc: Jason Gunthorpe 
Cc: Logan Gunthorpe 
Cc: Evgeny Baskakov 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 Documentation/vm/hmm.txt | 360 ---
 MAINTAINERS  |   1 +
 2 files changed, 187 insertions(+), 174 deletions(-)

diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
index 4d3aac9f4a5d..c7418c56a0ac 100644
--- a/Documentation/vm/hmm.txt
+++ b/Documentation/vm/hmm.txt
@@ -1,151 +1,159 @@
 Heterogeneous Memory Management (HMM)
 
-Transparently allow any component of a program to use any memory region of said
-program with a device without using device specific memory allocator. This is
-becoming a requirement to simplify the use of advance heterogeneous computing
-where GPU, DSP or FPGA are use to perform various computations.
-
-This document is divided as follow, in the first section i expose the problems
-related to the use of a device specific allocator. The second section i expose
-the hardware limitations that are inherent to many platforms. The third section
-gives an overview of HMM designs. The fourth section explains how CPU page-
-table mirroring works and what is HMM purpose in this context. Fifth section
-deals with how device memory is represented inside the kernel. Finaly the last
-section present the new migration helper that allow to leverage the device DMA
-engine.
-
-
-1) Problems of using device specific memory allocator:
-2) System bus, device memory characteristics
-3) Share address space and migration
+Provide infrastructure and helpers to integrate non conventional memory (device
+memory like GPU on board memory) into regular kernel code path. Corner stone of
+this being specialize struct page for such memory (see sections 5 to 7 of this
+document).
+
+HMM also provide optional helpers for SVM (Share Virtual Memory) ie allowing a
+device to transparently access program address coherently with the CPU meaning
+that any valid pointer on the CPU is also a valid pointer for the device. This
+is becoming a mandatory to simplify the use of advance heterogeneous computing
+where GPU, DSP, or FPGA are used to perform various computations on behalf of
+a process.
+
+This document is divided as follows: in the first section I expose the problems
+related to using device specific memory allocators. In the second section, I
+expose the hardware limitations that are inherent to many platforms. The third
+section gives an overview of the HMM design. The fourth section explains how
+CPU page-table mirroring works and what is HMM's purpose in this context. The
+fifth section deals with how device memory is represented inside the kernel.
+Finally, the last section presents a new migration helper that allows lever-
+aging the device DMA engine.
+
+
+1) Problems of using a device specific memory allocator:
+2) I/O bus, device memory characteristics
+3) Shared address space and migration
 4) Address space mirroring implementation and API
 5) Represent and manage device memory from core kernel point of view
-6) Migrate to and from device memory
+6) Migration to and from device memory
 7) Memory cgroup (memcg) and rss accounting
 
 
 ---
 
-1) Problems of using device specific memory allocator:
+1) Problems of using a device specific memory allocator:
 
-Device with large amount of on board memory (several giga bytes) like GPU have
-historically manage their memory through dedicated driver specific API. This
-creates a disconnect between memory allocated and managed by device driver and
-regular application memory (private anonymous, share memory or regular file
-back memory). From here on i will refer to this aspect as split address space.
-I use share address space to refer to the opposite situation ie one in which
-any memory region can be use by device transparently.
+Devices with a large amount of on board memory (several giga bytes) like GPUs
+have historically managed their memory through dedicated driver specific APIs.
+This creates a disconnect between memory allocated and managed by a device
+driver and regular application memory (private anonymous, shared memory, or
+regular file backed memory). From here on I will refer to this aspect as split
+address space. I use shared address space to refer to the opposite situation:
+i.e., one in which any application memory region can be used by a device
+transparently.
 
 Split address space because device can only access memory allocated through the
-device specific API. This imply that all memory object in a program are not
-equal from device point of view which complicate large program that rely on a
-wide set of libraries.
+device specific API. This implies that all memory objects in a program are not

[PATCH 09/14] mm/hmm: do not differentiate between empty entry or missing directory

2018-03-16 Thread jglisse

From: Jérôme Glisse 

There is no point in differentiating between a range for which there
is not even a directory (and thus entries) and empty entry (pte_none()
or pmd_none() returns true).

Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

Signed-off-by: Jérôme Glisse 
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 include/linux/hmm.h |  8 +++-
 mm/hmm.c| 45 +++--
 2 files changed, 18 insertions(+), 35 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 78b3ed6d7977..6d2b6bf6da4b 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
- * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
  * HMM_PFN_SPECIAL: corresponding CPU page table entry is special; i.e., the
  *  result of vm_insert_pfn() or vm_insert_page(). Therefore, it should not
  *  be mirrored by a device, because the entry will never have 
HMM_PFN_VALID
@@ -94,10 +93,9 @@ struct hmm;
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
-#define HMM_PFN_EMPTY (1 << 3)
-#define HMM_PFN_SPECIAL (1 << 4)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
-#define HMM_PFN_SHIFT 6
+#define HMM_PFN_SPECIAL (1 << 3)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 4)
+#define HMM_PFN_SHIFT 5
 
 /*
  * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
diff --git a/mm/hmm.c b/mm/hmm.c
index 04595a994542..2118e42cb838 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -305,6 +305,16 @@ static void hmm_pfns_clear(uint64_t *pfns,
*pfns = 0;
 }
 
+/*
+ * hmm_vma_walk_hole() - handle a range back by no pmd or no pte
+ * @start: range virtual start address (inclusive)
+ * @end: range virtual end address (exclusive)
+ * @walk: mm_walk structure
+ * Returns: 0 on success, -EAGAIN after page fault, or page fault error
+ *
+ * This is an helper call whenever pmd_none() or pte_none() returns true
+ * or when there is no directory covering the range.
+ */
 static int hmm_vma_walk_hole(unsigned long addr,
 unsigned long end,
 struct mm_walk *walk)
@@ -314,31 +324,6 @@ static int hmm_vma_walk_hole(unsigned long addr,
uint64_t *pfns = range->pfns;
unsigned long i;
 
-   hmm_vma_walk->last = addr;
-   i = (addr - range->start) >> PAGE_SHIFT;
-   for (; addr < end; addr += PAGE_SIZE, i++) {
-   pfns[i] = HMM_PFN_EMPTY;
-   if (hmm_vma_walk->fault) {
-   int ret;
-
-   ret = hmm_vma_do_fault(walk, addr, &pfns[i]);
-   if (ret != -EAGAIN)
-   return ret;
-   }
-   }
-
-   return hmm_vma_walk->fault ? -EAGAIN : 0;
-}
-
-static int hmm_vma_walk_clear(unsigned long addr,
- unsigned long end,
- struct mm_walk *walk)
-{
-   struct hmm_vma_walk *hmm_vma_walk = walk->private;
-   struct hmm_range *range = hmm_vma_walk->range;
-   uint64_t *pfns = range->pfns;
-   unsigned long i;
-
hmm_vma_walk->last = addr;
i = (addr - range->start) >> PAGE_SHIFT;
for (; addr < end; addr += PAGE_SIZE, i++) {
@@ -397,10 +382,10 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
goto again;
if (pmd_protnone(pmd))
-   return hmm_vma_walk_clear(start, end, walk);
+   return hmm_vma_walk_hole(start, end, walk);
 
if (write_fault && !pmd_write(pmd))
-   return hmm_vma_walk_clear(start, end, walk);
+   return hmm_vma_walk_hole(start, end, walk);
 
pfn = pmd_pfn(pmd) + pte_index(addr);
flag |= pmd_write(pmd) ? HMM_PFN_WRITE : 0;
@@ -419,7 +404,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
pfns[i] = 0;
 
if (pte_none(pte)) {
-   pfns[i] = HMM_PFN_EMPTY;
+   pfns[i] = 0;
if (hmm_vma_walk->fault)
goto fault;
continue;
@@ -470,8 +455,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 fault:
pte_unmap(ptep);
-   /* Fault all pages in range */
-   return hmm_vma_walk_clear(start, end, walk);
+   /* Fault all pages in range if ask for */
+   return hmm_vma_walk_hole(start, end, walk);
}
pte_unmap(ptep - 1);
 
-- 
2.14.3

[PATCH 06/14] mm/hmm: remove HMM_PFN_READ flag and ignore peculiar architecture

2018-03-16 Thread jglisse

From: Jérôme Glisse 

Only peculiar architecture allow write without read thus assume that
any valid pfn do allow for read. Note we do not care for write only
because it does make sense with thing like atomic compare and exchange
or any other operations that allow you to get the memory value through
them.

Signed-off-by: Jérôme Glisse 
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 include/linux/hmm.h | 14 ++
 mm/hmm.c| 28 
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index b65e527dd120..4bdc58ffe9f3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -84,7 +84,6 @@ struct hmm;
  *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
- * HMM_PFN_READ:  CPU page table has read permission set
  * HMM_PFN_WRITE: CPU page table has write permission set
  * HMM_PFN_ERROR: corresponding CPU page table entry points to poisoned memory
  * HMM_PFN_EMPTY: corresponding CPU page table entry is pte_none()
@@ -97,13 +96,12 @@ struct hmm;
 typedef unsigned long hmm_pfn_t;
 
 #define HMM_PFN_VALID (1 << 0)
-#define HMM_PFN_READ (1 << 1)
-#define HMM_PFN_WRITE (1 << 2)
-#define HMM_PFN_ERROR (1 << 3)
-#define HMM_PFN_EMPTY (1 << 4)
-#define HMM_PFN_SPECIAL (1 << 5)
-#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 6)
-#define HMM_PFN_SHIFT 7
+#define HMM_PFN_WRITE (1 << 1)
+#define HMM_PFN_ERROR (1 << 2)
+#define HMM_PFN_EMPTY (1 << 3)
+#define HMM_PFN_SPECIAL (1 << 4)
+#define HMM_PFN_DEVICE_UNADDRESSABLE (1 << 5)
+#define HMM_PFN_SHIFT 6
 
 /*
  * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
diff --git a/mm/hmm.c b/mm/hmm.c
index 49f0f6b337ed..fa3c605c4b96 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -374,11 +374,9 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
hmm_pfn_t *pfns = range->pfns;
unsigned long addr = start, i;
bool write_fault;
-   hmm_pfn_t flag;
pte_t *ptep;
 
i = (addr - range->start) >> PAGE_SHIFT;
-   flag = vma->vm_flags & VM_READ ? HMM_PFN_READ : 0;
write_fault = hmm_vma_walk->fault & hmm_vma_walk->write;
 
 again:
@@ -390,6 +388,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
if (pmd_devmap(*pmdp) || pmd_trans_huge(*pmdp)) {
unsigned long pfn;
+   hmm_pfn_t flag = 0;
pmd_t pmd;
 
/*
@@ -454,7 +453,6 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
} else if (write_fault)
goto fault;
pfns[i] |= HMM_PFN_DEVICE_UNADDRESSABLE;
-   pfns[i] |= flag;
} else if (is_migration_entry(entry)) {
if (hmm_vma_walk->fault) {
pte_unmap(ptep);
@@ -474,7 +472,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
if (write_fault && !pte_write(pte))
goto fault;
 
-   pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte)) | flag;
+   pfns[i] = hmm_pfn_t_from_pfn(pte_pfn(pte));
pfns[i] |= pte_write(pte) ? HMM_PFN_WRITE : 0;
continue;
 
@@ -536,6 +534,17 @@ int hmm_vma_get_pfns(struct hmm_range *range)
list_add_rcu(&range->list, &hmm->ranges);
spin_unlock(&hmm->lock);
 
+   if (!(vma->vm_flags & VM_READ)) {
+   /*
+* If vma do not allow read assume it does not allow write as
+* only peculiar architecture allow write without read and this
+* is not a case we care about (some operation like atomic no
+* longer make sense).
+*/
+   hmm_pfns_clear(range->pfns, range->start, range->end);
+   return 0;
+   }
+
hmm_vma_walk.fault = false;
hmm_vma_walk.range = range;
mm_walk.private = &hmm_vma_walk;
@@ -690,6 +699,17 @@ int hmm_vma_fault(struct hmm_range *range, bool write, 
bool block)
list_add_rcu(&range->list, &hmm->ranges);
spin_unlock(&hmm->lock);
 
+   if (!(vma->vm_flags & VM_READ)) {
+   /*
+* If vma do not allow read assume it does not allow write as
+* only peculiar architecture allow write without read and this
+* is not a case we care about (some operation like atomic no
+* longer make sense).
+*/
+   hmm_pfns_clear(range->pfns, range->start, range->end);
+   return 0;
+   }
+
/* FIXME support hugetlb fs */
if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
hmm_pfns_special(range->pfns, range->start, range->end);
-- 
2.14.3

[PATCH 05/14] mm/hmm: use struct for hmm_vma_fault(), hmm_vma_get_pfns() parameters

2018-03-16 Thread jglisse

From: Jérôme Glisse 

Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range
struct as parameter and were initializing that struct with others of
their parameters. Have caller of those function do this as they are
likely to already do and only pass this struct to both function this
shorten function signature and make it easiers in the future to add
new parameters by simply adding them to the structure.

Signed-off-by: Jérôme Glisse 
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 include/linux/hmm.h | 18 -
 mm/hmm.c| 78 +++--
 2 files changed, 33 insertions(+), 63 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 61b0e1c05ee1..b65e527dd120 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -274,6 +274,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @vma: the vm area struct for the range
  * @list: all range lock are on a list
  * @start: range virtual start address (inclusive)
  * @end: range virtual end address (exclusive)
@@ -281,6 +282,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+   struct vm_area_struct   *vma;
struct list_headlist;
unsigned long   start;
unsigned long   end;
@@ -301,12 +303,8 @@ struct hmm_range {
  *
  * IF YOU DO NOT FOLLOW THE ABOVE RULE THE SNAPSHOT CONTENT MIGHT BE INVALID !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-struct hmm_range *range,
-unsigned long start,
-unsigned long end,
-hmm_pfn_t *pfns);
-bool hmm_vma_range_done(struct vm_area_struct *vma, struct hmm_range *range);
+int hmm_vma_get_pfns(struct hmm_range *range);
+bool hmm_vma_range_done(struct hmm_range *range);
 
 
 /*
@@ -327,13 +325,7 @@ bool hmm_vma_range_done(struct vm_area_struct *vma, struct 
hmm_range *range);
  *
  * See the function description in mm/hmm.c for further documentation.
  */
-int hmm_vma_fault(struct vm_area_struct *vma,
- struct hmm_range *range,
- unsigned long start,
- unsigned long end,
- hmm_pfn_t *pfns,
- bool write,
- bool block);
+int hmm_vma_fault(struct hmm_range *range, bool write, bool block);
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 64d9e7dae712..49f0f6b337ed 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -490,11 +490,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 
 /*
  * hmm_vma_get_pfns() - snapshot CPU page table for a range of virtual 
addresses
- * @vma: virtual memory area containing the virtual address range
- * @range: used to track snapshot validity
- * @start: range virtual start address (inclusive)
- * @end: range virtual end address (exclusive)
- * @entries: array of hmm_pfn_t: provided by the caller, filled in by function
+ * @range: range being snapshoted and all needed informations
  * Returns: -EINVAL if invalid argument, -ENOMEM out of memory, 0 success
  *
  * This snapshots the CPU page table for a range of virtual addresses. Snapshot
@@ -508,26 +504,23 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
  * NOT CALLING hmm_vma_range_done() IF FUNCTION RETURNS 0 WILL LEAD TO SERIOUS
  * MEMORY CORRUPTION ! YOU HAVE BEEN WARNED !
  */
-int hmm_vma_get_pfns(struct vm_area_struct *vma,
-struct hmm_range *range,
-unsigned long start,
-unsigned long end,
-hmm_pfn_t *pfns)
+int hmm_vma_get_pfns(struct hmm_range *range)
 {
+   struct vm_area_struct *vma = range->vma;
struct hmm_vma_walk hmm_vma_walk;
struct mm_walk mm_walk;
struct hmm *hmm;
 
/* FIXME support hugetlb fs */
if (is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_SPECIAL)) {
-   hmm_pfns_special(pfns, start, end);
+   hmm_pfns_special(range->pfns, range->start, range->end);
return -EINVAL;
}
 
/* Sanity check, this really should not happen ! */
-   if (start < vma->vm_start || start >= vma->vm_end)
+   if (range->start < vma->vm_start || range->start >= vma->vm_end)
return -EINVAL;
-   if (end < vma->vm_start || end > vma->vm_end)
+   if (range->end < vma->vm_start || range->end > vma->vm_end)
return -EINVAL;
 
hmm = hmm_register(vma->vm_mm);
@@ -538,9 +531,6 @@ int hmm_vma_get_pfns(struct vm_area_struct *vma,
return -EINVAL;
 
/* Initialize range to track CPU page table update */
-   range->start = start;
-   range->pfns = pfns;
-   range->end = end;
spin_lock(&hmm->lock);
range->valid = tru

[PATCH 03/14] mm/hmm: HMM should have a callback before MM is destroyed v2

2018-03-16 Thread jglisse

From: Ralph Campbell 

The hmm_mirror_register() function registers a callback for when
the CPU pagetable is modified. Normally, the device driver will
call hmm_mirror_unregister() when the process using the device is
finished. However, if the process exits uncleanly, the struct_mm
can be destroyed with no warning to the device driver.

Changed since v1:
  - dropped VM_BUG_ON()
  - cc stable

Signed-off-by: Ralph Campbell 
Signed-off-by: Jérôme Glisse 
Cc: sta...@vger.kernel.org
Cc: Evgeny Baskakov 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 include/linux/hmm.h | 10 ++
 mm/hmm.c| 18 +-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index ef6044d08cc5..61b0e1c05ee1 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,6 +218,16 @@ enum hmm_update_type {
  * @update: callback to update range on a device
  */
 struct hmm_mirror_ops {
+   /* release() - release hmm_mirror
+*
+* @mirror: pointer to struct hmm_mirror
+*
+* This is called when the mm_struct is being released.
+* The callback should make sure no references to the mirror occur
+* after the callback returns.
+*/
+   void (*release)(struct hmm_mirror *mirror);
+
/* sync_cpu_device_pagetables() - synchronize page tables
 *
 * @mirror: pointer to struct hmm_mirror
diff --git a/mm/hmm.c b/mm/hmm.c
index 320545b98ff5..6088fa6ed137 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -160,6 +160,21 @@ static void hmm_invalidate_range(struct hmm *hmm,
up_read(&hmm->mirrors_sem);
 }
 
+static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
+{
+   struct hmm *hmm = mm->hmm;
+   struct hmm_mirror *mirror;
+   struct hmm_mirror *mirror_next;
+
+   down_write(&hmm->mirrors_sem);
+   list_for_each_entry_safe(mirror, mirror_next, &hmm->mirrors, list) {
+   list_del_init(&mirror->list);
+   if (mirror->ops->release)
+   mirror->ops->release(mirror);
+   }
+   up_write(&hmm->mirrors_sem);
+}
+
 static void hmm_invalidate_range_start(struct mmu_notifier *mn,
   struct mm_struct *mm,
   unsigned long start,
@@ -185,6 +200,7 @@ static void hmm_invalidate_range_end(struct mmu_notifier 
*mn,
 }
 
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
+   .release= hmm_release,
.invalidate_range_start = hmm_invalidate_range_start,
.invalidate_range_end   = hmm_invalidate_range_end,
 };
@@ -230,7 +246,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
struct hmm *hmm = mirror->hmm;
 
down_write(&hmm->mirrors_sem);
-   list_del(&mirror->list);
+   list_del_init(&mirror->list);
up_write(&hmm->mirrors_sem);
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
-- 
2.14.3

[PATCH 07/14] mm/hmm: use uint64_t for HMM pfn instead of defining hmm_pfn_t to ulong

2018-03-16 Thread jglisse

From: Jérôme Glisse 

All device driver we care about are using 64bits page table entry. In
order to match this and to avoid useless define convert all HMM pfn to
directly use uint64_t. It is a first step on the road to allow driver
to directly use pfn value return by HMM (saving memory and CPU cycles
use for convertion between the two).

Signed-off-by: Jérôme Glisse 
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 include/linux/hmm.h | 46 +-
 mm/hmm.c| 26 +-
 2 files changed, 34 insertions(+), 38 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bdc58ffe9f3..78b3ed6d7977 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -80,8 +80,6 @@
 struct hmm;
 
 /*
- * hmm_pfn_t - HMM uses its own pfn type to keep several flags per page
- *
  * Flags:
  * HMM_PFN_VALID: pfn is valid
  * HMM_PFN_WRITE: CPU page table has write permission set
@@ -93,8 +91,6 @@ struct hmm;
  *  set and the pfn value is undefined.
  * HMM_PFN_DEVICE_UNADDRESSABLE: unaddressable device memory (ZONE_DEVICE)
  */
-typedef unsigned long hmm_pfn_t;
-
 #define HMM_PFN_VALID (1 << 0)
 #define HMM_PFN_WRITE (1 << 1)
 #define HMM_PFN_ERROR (1 << 2)
@@ -104,14 +100,14 @@ typedef unsigned long hmm_pfn_t;
 #define HMM_PFN_SHIFT 6
 
 /*
- * hmm_pfn_t_to_page() - return struct page pointed to by a valid hmm_pfn_t
- * @pfn: hmm_pfn_t to convert to struct page
- * Returns: struct page pointer if pfn is a valid hmm_pfn_t, NULL otherwise
+ * hmm_pfn_to_page() - return struct page pointed to by a valid HMM pfn
+ * @pfn: HMM pfn value to get corresponding struct page from
+ * Returns: struct page pointer if pfn is a valid HMM pfn, NULL otherwise
  *
- * If the hmm_pfn_t is valid (ie valid flag set) then return the struct page
- * matching the pfn value stored in the hmm_pfn_t. Otherwise return NULL.
+ * If the uint64_t is valid (ie valid flag set) then return the struct page
+ * matching the pfn value stored in the HMM pfn. Otherwise return NULL.
  */
-static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t pfn)
+static inline struct page *hmm_pfn_to_page(uint64_t pfn)
 {
if (!(pfn & HMM_PFN_VALID))
return NULL;
@@ -119,11 +115,11 @@ static inline struct page *hmm_pfn_t_to_page(hmm_pfn_t 
pfn)
 }
 
 /*
- * hmm_pfn_t_to_pfn() - return pfn value store in a hmm_pfn_t
- * @pfn: hmm_pfn_t to extract pfn from
- * Returns: pfn value if hmm_pfn_t is valid, -1UL otherwise
+ * hmm_pfn_to_pfn() - return pfn value store in a HMM pfn
+ * @pfn: HMM pfn value to extract pfn from
+ * Returns: pfn value if HMM pfn is valid, -1UL otherwise
  */
-static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t pfn)
+static inline unsigned long hmm_pfn_to_pfn(uint64_t pfn)
 {
if (!(pfn & HMM_PFN_VALID))
return -1UL;
@@ -131,21 +127,21 @@ static inline unsigned long hmm_pfn_t_to_pfn(hmm_pfn_t 
pfn)
 }
 
 /*
- * hmm_pfn_t_from_page() - create a valid hmm_pfn_t value from struct page
- * @page: struct page pointer for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the page
+ * hmm_pfn_from_page() - create a valid HMM pfn value from struct page
+ * @page: struct page pointer for which to create the HMM pfn
+ * Returns: valid HMM pfn for the page
  */
-static inline hmm_pfn_t hmm_pfn_t_from_page(struct page *page)
+static inline uint64_t hmm_pfn_from_page(struct page *page)
 {
return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
 
 /*
- * hmm_pfn_t_from_pfn() - create a valid hmm_pfn_t value from pfn
- * @pfn: pfn value for which to create the hmm_pfn_t
- * Returns: valid hmm_pfn_t for the pfn
+ * hmm_pfn_from_pfn() - create a valid HMM pfn value from pfn
+ * @pfn: pfn value for which to create the HMM pfn
+ * Returns: valid HMM pfn for the pfn
  */
-static inline hmm_pfn_t hmm_pfn_t_from_pfn(unsigned long pfn)
+static inline uint64_t hmm_pfn_from_pfn(unsigned long pfn)
 {
return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
 }
@@ -284,7 +280,7 @@ struct hmm_range {
struct list_headlist;
unsigned long   start;
unsigned long   end;
-   hmm_pfn_t   *pfns;
+   uint64_t*pfns;
boolvalid;
 };
 
@@ -307,7 +303,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
 
 /*
  * Fault memory on behalf of device driver. Unlike handle_mm_fault(), this will
- * not migrate any device memory back to system memory. The hmm_pfn_t array 
will
+ * not migrate any device memory back to system memory. The HMM pfn array will
  * be updated with the fault result and current snapshot of the CPU page table
  * for the range.
  *
@@ -316,7 +312,7 @@ bool hmm_vma_range_done(struct hmm_range *range);
  * function returns -EAGAIN.
  *
  * Return value does not reflect if the fault was successful for every single
- * address or not. Therefore, the caller must to inspect the

[PATCH 04/14] mm/hmm: hmm_pfns_bad() was accessing wrong struct

2018-03-16 Thread jglisse

From: Jérôme Glisse 

The private field of mm_walk struct point to an hmm_vma_walk struct and
not to the hmm_range struct desired. Fix to get proper struct pointer.

Signed-off-by: Jérôme Glisse 
Cc: sta...@vger.kernel.org
Cc: Evgeny Baskakov 
Cc: Ralph Campbell 
Cc: Mark Hairgrove 
Cc: John Hubbard 
---
 mm/hmm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6088fa6ed137..64d9e7dae712 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -293,7 +293,8 @@ static int hmm_pfns_bad(unsigned long addr,
unsigned long end,
struct mm_walk *walk)
 {
-   struct hmm_range *range = walk->private;
+   struct hmm_vma_walk *hmm_vma_walk = walk->private;
+   struct hmm_range *range = hmm_vma_walk->range;
hmm_pfn_t *pfns = range->pfns;
unsigned long i;
 
-- 
2.14.3

Re: [PATCH 4/4] mm/hmm: change CPU page table snapshot functions to simplify drivers

2018-03-16 Thread Jerome Glisse

On Thu, Mar 15, 2018 at 10:08:21PM -0700, John Hubbard wrote:
> On 03/15/2018 11:37 AM, jgli...@redhat.com wrote:
> > From: Jérôme Glisse 
> > 
> > This change hmm_vma_fault() and hmm_vma_get_pfns() API to allow HMM
> > to directly write entry that can match any device page table entry
> > format. Device driver now provide an array of flags value and we use
> > enum to index this array for each flag.
> > 
> > This also allow the device driver to ask for write fault on a per page
> > basis making API more flexible to service multiple device page faults
> > in one go.
> > 
> 
> Hi Jerome,
> 
> This is a large patch, so I'm going to review it in two passes. The first 
> pass is just an overview plus the hmm.h changes (now), and tomorrow I will
> review the hmm.c, which is where the real changes are.
> 
> Overview: the hmm.c changes are doing several things, and it is difficult to
> review, because refactoring, plus new behavior, makes diffs less useful here.
> It would probably be good to split the hmm.c changes into a few patches, such
> as:
> 
>   -- HMM_PFN_FLAG_* changes, plus function signature changes (mm_range* 
>being passed to functions), and
> -- New behavior in the page handling loops, and 
>   -- Refactoring into new routines (hmm_vma_handle_pte, and others)
> 
> That way, reviewers can see more easily that things are correct. 

So i resent patchset and i splitted this patch in many (11 patches now).
I included your comments so far in the new version so probably better if
you look at new one.

[...[

> > - * HMM_PFN_READ:  CPU page table has read permission set
> 
> So why is it that we don't need the _READ flag anymore? I looked at the 
> corresponding
> hmm.c but still don't quite get it. Is it that we just expect that _READ is
> always set if there is an entry at all? Or something else?

Explained why in the commit message, !READ when WRITE make no sense so
now VALID imply READ as does WRITE (write by itself is meaningless
without valid).

Cheers,
Jérôme

Re: [PATCH v5 1/3] x86/msr: Add AMD Core Perf Extension MSRs

2018-03-16 Thread Thomas Gleixner

On Fri, 16 Mar 2018, Paolo Bonzini wrote:

> On 06/03/2018 22:03, Radim Krcmar wrote:
> >>  /* Fam 15h MSRs */
> >>  #define MSR_F15H_PERF_CTL 0xc0010200
> >> +#define MSR_F15H_PERF_CTL0MSR_F15H_PERF_CTL
> >> +#define MSR_F15H_PERF_CTL1(MSR_F15H_PERF_CTL + 2)
> >> +#define MSR_F15H_PERF_CTL2(MSR_F15H_PERF_CTL + 4)
> >> +#define MSR_F15H_PERF_CTL3(MSR_F15H_PERF_CTL + 6)
> >> +#define MSR_F15H_PERF_CTL4(MSR_F15H_PERF_CTL + 8)
> >> +#define MSR_F15H_PERF_CTL5(MSR_F15H_PERF_CTL + 10)
> >> +
> >>  #define MSR_F15H_PERF_CTR 0xc0010201
> >> +#define MSR_F15H_PERF_CTR0MSR_F15H_PERF_CTR
> >> +#define MSR_F15H_PERF_CTR1(MSR_F15H_PERF_CTR + 2)
> >> +#define MSR_F15H_PERF_CTR2(MSR_F15H_PERF_CTR + 4)
> >> +#define MSR_F15H_PERF_CTR3(MSR_F15H_PERF_CTR + 6)
> >> +#define MSR_F15H_PERF_CTR4(MSR_F15H_PERF_CTR + 8)
> >> +#define MSR_F15H_PERF_CTR5(MSR_F15H_PERF_CTR + 10)
> >> +
> > x86 maintainers,
> > 
> > are you ok with this going through the kvm tree?

yes.

Acked-by: Thomas Gleixner

Re: [PATCH 13/22] signal: Move addr_lsb into the _sigfault union for clarity

2018-03-16 Thread Dave Hansen

On 03/16/2018 12:00 PM, Dave Hansen wrote:
> On 01/15/2018 04:40 PM, Eric W. Biederman wrote:
>> The addr_lsb fields is only valid and available when the
>> signal is SIGBUS and the si_code is BUS_MCEERR_AR or BUS_MCEERR_AO.
>> Document this with a comment and place the field in the _sigfault union
>> to make this clear.
>>
>> All of the fields stay in the same physical location so both the old
>> and new definitions of struct siginfo will continue to work.
> 
> This breaks the ABI and breaks protection keys.  The physical locations
> *DO* change.
> 
> Before this patch:
> 
> #define si_pkey _sifields._sigfault._pkey
> (gdb) print &((siginfo_t *)0)->_sifields._sigfault._pkey
> $1 = (__u32 *) 0x20 
> 
> and after:
> 
> +#define si_pkey_sifields._sigfault._addr_pkey._pkey
> (gdb) print &((siginfo_t *)0)->_sifields._sigfault._addr_pkey._pkey
> $1 = (__u32 *) 0x1c
> 
> Can we revert this, please?

It does not revert cleanly so I reverted it manually.  Patch doing that
is attached.  Should we do this, or is there a better option?

index e698ec1..8f8e3ef 100644

---

 b/include/linux/compat.h |   12 ++--
 b/include/uapi/asm-generic/siginfo.h |   14 +++---
 2 files changed, 5 insertions(+), 21 deletions(-)

diff -puN include/linux/compat.h~revert-b68a68d3dcc15ebbf23cbe91af1abf57591bd96b include/linux/compat.h
--- a/include/linux/compat.h~revert-b68a68d3dcc15ebbf23cbe91af1abf57591bd96b	2018-03-16 12:02:22.156310058 -0700
+++ b/include/linux/compat.h	2018-03-16 12:03:11.341309936 -0700
@@ -222,23 +222,15 @@ typedef struct compat_siginfo {
 #ifdef __ARCH_SI_TRAPNO
 			int _trapno;	/* TRAP # which caused the signal */
 #endif
+			short int _addr_lsb;	/* Valid LSB of the reported address. */
 			union {
-/*
- * used when si_code=BUS_MCEERR_AR or
- * used when si_code=BUS_MCEERR_AO
- */
-short int _addr_lsb;	/* Valid LSB of the reported address. */
 /* used when si_code=SEGV_BNDERR */
 struct {
-	compat_uptr_t _dummy_bnd;
 	compat_uptr_t _lower;
 	compat_uptr_t _upper;
 } _addr_bnd;
 /* used when si_code=SEGV_PKUERR */
-struct {
-	compat_uptr_t _dummy_pkey;
-	u32 _pkey;
-} _addr_pkey;
+u32 _pkey;
 			};
 		} _sigfault;
 
diff -puN include/uapi/asm-generic/siginfo.h~revert-b68a68d3dcc15ebbf23cbe91af1abf57591bd96b include/uapi/asm-generic/siginfo.h
--- a/include/uapi/asm-generic/siginfo.h~revert-b68a68d3dcc15ebbf23cbe91af1abf57591bd96b	2018-03-16 12:02:22.157310058 -0700
+++ b/include/uapi/asm-generic/siginfo.h	2018-03-16 12:03:37.071309872 -0700
@@ -94,23 +94,15 @@ typedef struct siginfo {
 			unsigned int _flags;	/* see ia64 si_flags */
 			unsigned long _isr;	/* isr */
 #endif
+			short _addr_lsb; /* LSB of the reported address */
 			union {
-/*
- * used when si_code=BUS_MCEERR_AR or
- * used when si_code=BUS_MCEERR_AO
- */
-short _addr_lsb; /* LSB of the reported address */
 /* used when si_code=SEGV_BNDERR */
 struct {
-	void *_dummy_bnd;
 	void __user *_lower;
 	void __user *_upper;
 } _addr_bnd;
 /* used when si_code=SEGV_PKUERR */
-struct {
-	void *_dummy_pkey;
-	__u32 _pkey;
-} _addr_pkey;
+__u32 _pkey;
 			};
 		} _sigfault;
 
@@ -150,7 +142,7 @@ typedef struct siginfo {
 #define si_addr_lsb	_sifields._sigfault._addr_lsb
 #define si_lower	_sifields._sigfault._addr_bnd._lower
 #define si_upper	_sifields._sigfault._addr_bnd._upper
-#define si_pkey		_sifields._sigfault._addr_pkey._pkey
+#define si_pkey		_sifields._sigfault._pkey
 #define si_band		_sifields._sigpoll._band
 #define si_fd		_sifields._sigpoll._fd
 #define si_call_addr	_sifields._sigsys._call_addr
_

[PATCH v6 0/2] hwmon: (ucd9000) Add gpio and debugfs interfaces

2018-03-16 Thread Eddie James

The ucd9000 series chips have gpio pins. Add a gpio chip interface to the ucd
device so that users can query and set the state of the gpio pins.

Add a debugfs interface using the existing pmbus debugfs directory to provide
MFR_STATUS and the status of the gpi faults to users.

Changes since v5:
 - enclose gpio code in #ifdef GPIOLIB
 - don't initialize buffers for mfr_status; set last char to 0 instead
 - cap the size argument to bin2hex

Changes since v4:
 - max-sized buffers for smbus transfers
 - used bin2hex instead of my own code

Changes since v3:
 - remove setting of gpio_chip->owner
 - format the mfr_status data
 - switch to #ifdef rather than #if IS_ENABLED for debugfs

Changes since v2:
 - split the gpio registration into it's own function

Changes since v1:
 - dropped dev_err messages
 - made gpio chip registration conditional on having gpio pins
 - made mfr_status debugfs attribute more simple

Christopher Bostic (2):
  hwmon: (ucd9000) Add gpio chip interface
  hwmon: (ucd9000) Add debugfs attributes to provide mfr_status

Christopher Bostic (2):
  hwmon: (ucd9000) Add gpio chip interface
  hwmon: (ucd9000) Add debugfs attributes to provide mfr_status

 drivers/hwmon/pmbus/ucd9000.c | 350 +-
 1 file changed, 349 insertions(+), 1 deletion(-)

-- 
1.8.3.1

[PATCH v6 2/2] hwmon: (ucd9000) Add debugfs attributes to provide mfr_status

2018-03-16 Thread Eddie James

From: Christopher Bostic 

Expose the gpiN_fault fields of mfr_status as individual debugfs
attributes. This provides a way for users to be easily notified of gpi
faults. Also provide the whole mfr_status register in debugfs.

Signed-off-by: Christopher Bostic 
Signed-off-by: Andrew Jeffery 
Signed-off-by: Eddie James 
---
 drivers/hwmon/pmbus/ucd9000.c | 138 +-
 1 file changed, 137 insertions(+), 1 deletion(-)

diff --git a/drivers/hwmon/pmbus/ucd9000.c b/drivers/hwmon/pmbus/ucd9000.c
index ef2c5bf..88c98fb 100644
--- a/drivers/hwmon/pmbus/ucd9000.c
+++ b/drivers/hwmon/pmbus/ucd9000.c
@@ -19,6 +19,7 @@
  * Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -37,6 +38,7 @@
 #define UCD9000_NUM_PAGES  0xd6
 #define UCD9000_FAN_CONFIG_INDEX   0xe7
 #define UCD9000_FAN_CONFIG 0xe8
+#define UCD9000_MFR_STATUS 0xf3
 #define UCD9000_GPIO_SELECT0xfa
 #define UCD9000_GPIO_CONFIG0xfb
 #define UCD9000_DEVICE_ID  0xfd
@@ -64,15 +66,24 @@
 #define UCD901XX_NUM_GPIOS 26
 #define UCD90910_NUM_GPIOS 26
 
+#define UCD9000_DEBUGFS_NAME_LEN   24
+#define UCD9000_GPI_COUNT  8
+
 struct ucd9000_data {
u8 fan_data[UCD9000_NUM_FAN][I2C_SMBUS_BLOCK_MAX];
struct pmbus_driver_info info;
 #ifdef CONFIG_GPIOLIB
struct gpio_chip gpio;
 #endif
+   struct dentry *debugfs;
 };
 #define to_ucd9000_data(_info) container_of(_info, struct ucd9000_data, info)
 
+struct ucd9000_debugfs_entry {
+   struct i2c_client *client;
+   u8 index;
+};
+
 static int ucd9000_get_fan_config(struct i2c_client *client, int fan)
 {
int fan_config = 0;
@@ -359,6 +370,122 @@ static void ucd9000_probe_gpio(struct i2c_client *client,
 }
 #endif /* CONFIG_GPIOLIB */
 
+#ifdef CONFIG_DEBUG_FS
+static int ucd9000_get_mfr_status(struct i2c_client *client, u8 *buffer)
+{
+   int ret = pmbus_set_page(client, 0);
+
+   if (ret < 0)
+   return ret;
+
+   return i2c_smbus_read_block_data(client, UCD9000_MFR_STATUS, buffer);
+}
+
+static int ucd9000_debugfs_show_mfr_status_bit(void *data, u64 *val)
+{
+   struct ucd9000_debugfs_entry *entry = data;
+   struct i2c_client *client = entry->client;
+   u8 buffer[I2C_SMBUS_BLOCK_MAX];
+   int ret;
+
+   ret = ucd9000_get_mfr_status(client, buffer);
+   if (ret < 0)
+   return ret;
+
+   /*
+* Attribute only created for devices with gpi fault bits at bits
+* 16-23, which is the second byte of the response.
+*/
+   *val = !!(buffer[1] & BIT(entry->index));
+
+   return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(ucd9000_debugfs_mfr_status_bit,
+ucd9000_debugfs_show_mfr_status_bit, NULL, "%1lld\n");
+
+static ssize_t ucd9000_debugfs_read_mfr_status(struct file *file,
+  char __user *buf, size_t count,
+  loff_t *ppos)
+{
+   struct i2c_client *client = file->private_data;
+   u8 buffer[I2C_SMBUS_BLOCK_MAX];
+   char str[(I2C_SMBUS_BLOCK_MAX * 2) + 2];
+   char *res;
+   int rc;
+
+   rc = ucd9000_get_mfr_status(client, buffer);
+   if (rc < 0)
+   return rc;
+
+   res = bin2hex(str, buffer, min(rc, I2C_SMBUS_BLOCK_MAX));
+   *res++ = '\n';
+   *res++ = 0;
+
+   return simple_read_from_buffer(buf, count, ppos, str, res - str);
+}
+
+static const struct file_operations ucd9000_debugfs_show_mfr_status_fops = {
+   .llseek = noop_llseek,
+   .read = ucd9000_debugfs_read_mfr_status,
+   .open = simple_open,
+};
+
+static int ucd9000_init_debugfs(struct i2c_client *client,
+   const struct i2c_device_id *mid,
+   struct ucd9000_data *data)
+{
+   struct dentry *debugfs;
+   struct ucd9000_debugfs_entry *entries;
+   int i;
+   char name[UCD9000_DEBUGFS_NAME_LEN];
+
+   debugfs = pmbus_get_debugfs_dir(client);
+   if (!debugfs)
+   return -ENOENT;
+
+   data->debugfs = debugfs_create_dir(client->name, debugfs);
+   if (!data->debugfs)
+   return -ENOENT;
+
+   /*
+* Of the chips this driver supports, only the UCD9090, UCD90160,
+* and UCD90910 report GPI faults in their MFR_STATUS register, so only
+* create the GPI fault debugfs attributes for those chips.
+*/
+   if (mid->driver_data == ucd9090 || mid->driver_data == ucd90160 ||
+   mid->driver_data == ucd90910) {
+   entries = devm_kzalloc(&client->dev,
+  sizeof(*entries) * UCD9000_GPI_COUNT,
+  GFP_KERNEL);
+   if (!entries)
+   return -ENOMEM;
+
+   for (i = 0; i < UCD9000_GPI_COUNT; i++) {
+

Re: [PATCH] net: ethernet: arc: Fix a potential memory leak if an optional regulator is deferred

2018-03-16 Thread David Miller

From: Christophe JAILLET 
Date: Wed, 14 Mar 2018 22:09:34 +0100

> diff --git a/drivers/net/ethernet/arc/emac_rockchip.c 
> b/drivers/net/ethernet/arc/emac_rockchip.c
> index 16f9bee992fe..8ee9dfd0e363 100644
> --- a/drivers/net/ethernet/arc/emac_rockchip.c
> +++ b/drivers/net/ethernet/arc/emac_rockchip.c
> @@ -169,8 +169,10 @@ static int emac_rockchip_probe(struct platform_device 
> *pdev)
>   /* Optional regulator for PHY */
>   priv->regulator = devm_regulator_get_optional(dev, "phy");
>   if (IS_ERR(priv->regulator)) {
> - if (PTR_ERR(priv->regulator) == -EPROBE_DEFER)
> - return -EPROBE_DEFER;
> + if (PTR_ERR(priv->regulator) == -EPROBE_DEFER) {
> + ret = -EPROBE_DEFER;
> + goto out_clk_disable;
> + }

Please build test your changes.

There is no 'ret' variable in this function, perhaps you meant 'err'.

[PATCH v6 1/2] hwmon: (ucd9000) Add gpio chip interface

2018-03-16 Thread Eddie James

From: Christopher Bostic 

Add a struct gpio_chip and define some methods so that this device's
I/O can be accessed via /sys/class/gpio.

Signed-off-by: Christopher Bostic 
Signed-off-by: Andrew Jeffery 
Signed-off-by: Eddie James 
---
 drivers/hwmon/pmbus/ucd9000.c | 212 ++
 1 file changed, 212 insertions(+)

diff --git a/drivers/hwmon/pmbus/ucd9000.c b/drivers/hwmon/pmbus/ucd9000.c
index b74dbec..ef2c5bf 100644
--- a/drivers/hwmon/pmbus/ucd9000.c
+++ b/drivers/hwmon/pmbus/ucd9000.c
@@ -27,6 +27,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include "pmbus.h"
 
 enum chips { ucd9000, ucd90120, ucd90124, ucd90160, ucd9090, ucd90910 };
@@ -35,8 +37,18 @@
 #define UCD9000_NUM_PAGES  0xd6
 #define UCD9000_FAN_CONFIG_INDEX   0xe7
 #define UCD9000_FAN_CONFIG 0xe8
+#define UCD9000_GPIO_SELECT0xfa
+#define UCD9000_GPIO_CONFIG0xfb
 #define UCD9000_DEVICE_ID  0xfd
 
+/* GPIO CONFIG bits */
+#define UCD9000_GPIO_CONFIG_ENABLE BIT(0)
+#define UCD9000_GPIO_CONFIG_OUT_ENABLE BIT(1)
+#define UCD9000_GPIO_CONFIG_OUT_VALUE  BIT(2)
+#define UCD9000_GPIO_CONFIG_STATUS BIT(3)
+#define UCD9000_GPIO_INPUT 0
+#define UCD9000_GPIO_OUTPUT1
+
 #define UCD9000_MON_TYPE(x)(((x) >> 5) & 0x07)
 #define UCD9000_MON_PAGE(x)((x) & 0x0f)
 
@@ -47,9 +59,17 @@
 
 #define UCD9000_NUM_FAN4
 
+#define UCD9000_GPIO_NAME_LEN  16
+#define UCD9090_NUM_GPIOS  23
+#define UCD901XX_NUM_GPIOS 26
+#define UCD90910_NUM_GPIOS 26
+
 struct ucd9000_data {
u8 fan_data[UCD9000_NUM_FAN][I2C_SMBUS_BLOCK_MAX];
struct pmbus_driver_info info;
+#ifdef CONFIG_GPIOLIB
+   struct gpio_chip gpio;
+#endif
 };
 #define to_ucd9000_data(_info) container_of(_info, struct ucd9000_data, info)
 
@@ -149,6 +169,196 @@ static int ucd9000_read_byte_data(struct i2c_client 
*client, int page, int reg)
 };
 MODULE_DEVICE_TABLE(of, ucd9000_of_match);
 
+#ifdef CONFIG_GPIOLIB
+static int ucd9000_gpio_read_config(struct i2c_client *client,
+   unsigned int offset)
+{
+   int ret;
+
+   /* No page set required */
+   ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_SELECT, offset);
+   if (ret < 0)
+   return ret;
+
+   return i2c_smbus_read_byte_data(client, UCD9000_GPIO_CONFIG);
+}
+
+static int ucd9000_gpio_get(struct gpio_chip *gc, unsigned int offset)
+{
+   struct i2c_client *client  = gpiochip_get_data(gc);
+   int ret;
+
+   ret = ucd9000_gpio_read_config(client, offset);
+   if (ret < 0)
+   return ret;
+
+   return !!(ret & UCD9000_GPIO_CONFIG_STATUS);
+}
+
+static void ucd9000_gpio_set(struct gpio_chip *gc, unsigned int offset,
+int value)
+{
+   struct i2c_client *client = gpiochip_get_data(gc);
+   int ret;
+
+   ret = ucd9000_gpio_read_config(client, offset);
+   if (ret < 0) {
+   dev_dbg(&client->dev, "failed to read GPIO %d config: %d\n",
+   offset, ret);
+   return;
+   }
+
+   if (value) {
+   if (ret & UCD9000_GPIO_CONFIG_STATUS)
+   return;
+
+   ret |= UCD9000_GPIO_CONFIG_STATUS;
+   } else {
+   if (!(ret & UCD9000_GPIO_CONFIG_STATUS))
+   return;
+
+   ret &= ~UCD9000_GPIO_CONFIG_STATUS;
+   }
+
+   ret |= UCD9000_GPIO_CONFIG_ENABLE;
+
+   /* Page set not required */
+   ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
+   if (ret < 0) {
+   dev_dbg(&client->dev, "Failed to write GPIO %d config: %d\n",
+   offset, ret);
+   return;
+   }
+
+   ret &= ~UCD9000_GPIO_CONFIG_ENABLE;
+
+   ret = i2c_smbus_write_byte_data(client, UCD9000_GPIO_CONFIG, ret);
+   if (ret < 0)
+   dev_dbg(&client->dev, "Failed to write GPIO %d config: %d\n",
+   offset, ret);
+}
+
+static int ucd9000_gpio_get_direction(struct gpio_chip *gc,
+ unsigned int offset)
+{
+   struct i2c_client *client = gpiochip_get_data(gc);
+   int ret;
+
+   ret = ucd9000_gpio_read_config(client, offset);
+   if (ret < 0)
+   return ret;
+
+   return !(ret & UCD9000_GPIO_CONFIG_OUT_ENABLE);
+}
+
+static int ucd9000_gpio_set_direction(struct gpio_chip *gc,
+ unsigned int offset, bool direction_out,
+ int requested_out)
+{
+   struct i2c_client *client = gpiochip_get_data(gc);
+   int ret, config, out_val;
+
+   ret = ucd9000_gpio_read_config(client, offset);
+   if (ret < 0)
+   return ret;
+
+   if (direction_out) {
+   out_val = requested_out ? UCD9000_GPIO_CONFIG_OUT_VALUE : 0;
+
+   if (ret & UCD900

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-16 Thread Linus Torvalds

On Fri, Mar 16, 2018 at 10:55 AM, Al Viro  wrote:
>
> That's not them, that's C standard regarding ICE.

Yes. The C standard talks about "integer constant expression". I know.
It's come up in this very thread before.

The C standard at no point talks about - or forbids - "variable length
arrays". That never comes up in the whole standard, I checked.

So we are right now hindered by a _syntactic_ check, without any way
to have a _semantic_ check.

That's my problem. The warnings are misleading and imply semantics.

And apparently there is no way to actually check semantics.

> 1,100 is *not* a constant expression as far as the standard is concerned,

I very much know.

But it sure isn't "variable" either as far as the standard is
concerned, because the standard doesn't even have that concept (it
uses "variable" for argument numbers and for variables).

So being pedantic doesn't really change anything.

> Would you argue that in
> void foo(char c)
> {
> int a[(c<<1) + 10 - c + 2 - c];

Yeah, I don't think that even counts as a constant value, even if it
can be optimized to one. I would not at all be unhppy to see
__builtin_constant_p() to return zero.

But that is very much different from the syntax issue.

So I would like to get some way to get both type-checking and constant
checking without the annoying syntax issue.

> expr, constant_expression is not a constant_expression.  And in
> this particular case the standard is not insane - the only reason
> for using that is typechecking and _that_ can be achieved without
> violating 6.6p6:
> sizeof(expr,0) * 0 + ICE
> *is* an integer constant expression, and it gives you exact same
> typechecking.  So if somebody wants to play odd games, they can
> do that just fine, without complicating the logics for compilers...

Now that actually looks like a good trick. Maybe we can use that
instead of the comma expression that causes problems.

And using sizeof() to make sure that __builtin_choose_expression()
really gets an integer constant expression and that there should be no
ambiguity looks good.

Hmm.

This works for me, and I'm being *very* careful (those casts to
pointer types are inside that sizeof, because it's not an integral
type, and non-integral casts are not valid in an ICE either) but
somebody needs to check gcc-4.4:

  #define __typecheck(a,b) \
(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))

  #define __no_side_effects(a,b) \
(__builtin_constant_p(a)&&__builtin_constant_p(b))

  #define __safe_cmp(a,b) \
(__typecheck(a,b) && __no_side_effects(a,b))

  #define __cmp(a,b,op) ((a)op(b)?(a):(b))

  #define __cmp_once(a,b,op) ({ \
typeof(a) __a = (a);\
typeof(b) __b = (b);\
__cmp(__a,__b,op); })

  #define __careful_cmp(a,b,op) \
__builtin_choose_expr(__safe_cmp(a,b), __cmp(a,b,op),
__cmp_once(a,b,op))

  #define min(a,b)  __careful_cmp(a,b,<)
  #define max(a,b)  __careful_cmp(a,b,>)
  #define min_t(t,a,b)  __careful_cmp((t)(a),(t)(b),<)
  #define max_t(t,a,b)  __careful_cmp((t)(a),(t)(b),>)

and yes, it does cause new warnings for that

comparison between ‘enum tis_defaults’ and ‘enum tpm2_const’

in drivers/char/tpm/tpm_tis_core.h due to

   #define TIS_TIMEOUT_A_MAX   max(TIS_SHORT_TIMEOUT, TPM2_TIMEOUT_A)

but technically that warning is actually correct, I'm just confused
why gcc cares about the cast placement or something.

That warning is easy to fix by turning it into a "max_t(int, enum1,
enum2)' and that is technically the right thing to do, it's just not
warned about for some odd reason with the current code.

Kees - is there some online "gcc-4.4 checker" somewhere? This does
seem to work with my gcc. I actually tested some of those files you
pointed at now.

  Linus
 include/linux/kernel.h | 77 +-
 1 file changed, 20 insertions(+), 57 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..23c31bf1d7fb 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -787,37 +787,29 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
  * strict type-checking.. See the
  * "unnecessary" pointer comparison.
  */
-#define __min(t1, t2, min1, min2, x, y) ({		\
-	t1 min1 = (x);	\
-	t2 min2 = (y);	\
-	(void) (&min1 == &min2);			\
-	min1 < min2 ? min1 : min2; })
+#define __typecheck(a,b) \
+	(!!(sizeof((typeof(a)*)1==(typeof(b)*)1)))

-/**
- * min - return minimum of two values of the same or compatible types
- * @x: first value
- * @y: second value
- */
-#define min(x, y)	\
-	__min(typeof(x), typeof(y),			\
-	  __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),	\
-	  x, y)
+#define __no_side_effects(a,b) \
+	(__builtin_constant_p(a)&&__builtin_constant_p(b))

-#define __max(t1, t2, max1, max2, x, y) ({		\
-	t1 max1 = (x);	\
-	t2 max2 = (y);	\
-	(void) (&max1 == &max2);			\
-	max1 > max2

[PATCH 30/35] x86/ldt: Define LDT_END_ADDR

2018-03-16 Thread Joerg Roedel

From: Joerg Roedel 

It marks the end of the address-space range reserved for the
LDT. The LDT-code will use it when unmapping the LDT for
user-space.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/pgtable_32_types.h | 2 ++
 arch/x86/include/asm/pgtable_64_types.h | 2 ++
 arch/x86/kernel/ldt.c   | 2 +-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable_32_types.h 
b/arch/x86/include/asm/pgtable_32_types.h
index eb2e97a..02bd445 100644
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -51,6 +51,8 @@ extern bool __vmalloc_start_set; /* set once high_memory is 
set */
 #define LDT_BASE_ADDR  \
((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
+#define LDT_END_ADDR   (LDT_BASE_ADDR + PMD_SIZE)
+
 #define PKMAP_BASE \
((LDT_BASE_ADDR - PAGE_SIZE) & PMD_MASK)
 
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index e57003a..15188baa 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -90,12 +90,14 @@ typedef struct { pteval_t pte; } pte_t;
 # define __VMEMMAP_BASE_AC(0xffd4, UL)
 # define LDT_PGD_ENTRY _AC(-112, UL)
 # define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT)
+#define  LDT_END_ADDR  (LDT_BASE_ADDR + PGDIR_SIZE)
 #else
 # define VMALLOC_SIZE_TB   _AC(32, UL)
 # define __VMALLOC_BASE_AC(0xc900, UL)
 # define __VMEMMAP_BASE_AC(0xea00, UL)
 # define LDT_PGD_ENTRY _AC(-3, UL)
 # define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT)
+#define  LDT_END_ADDR  (LDT_BASE_ADDR + PGDIR_SIZE)
 #endif
 
 #ifdef CONFIG_RANDOMIZE_MEMORY
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 26d713e..f3c2fbf 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -202,7 +202,7 @@ static void free_ldt_pgtables(struct mm_struct *mm)
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
struct mmu_gather tlb;
unsigned long start = LDT_BASE_ADDR;
-   unsigned long end = start + (1UL << PGDIR_SHIFT);
+   unsigned long end = LDT_END_ADDR;
 
if (!static_cpu_has(X86_FEATURE_PTI))
return;
-- 
2.7.4

[PATCH 04/35] x86/entry/32: Put ESPFIX code into a macro

2018-03-16 Thread Joerg Roedel

From: Joerg Roedel 

This makes it easier to split up the shared iret code path.

Signed-off-by: Joerg Roedel 
---
 arch/x86/entry/entry_32.S | 97 ---
 1 file changed, 49 insertions(+), 48 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index e659776..0289bde 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -221,6 +221,54 @@
POP_GS_EX
 .endm
 
+.macro CHECK_AND_APPLY_ESPFIX
+#ifdef CONFIG_X86_ESPFIX32
+#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
+
+   ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_ESPFIX
+
+   movlPT_EFLAGS(%esp), %eax   # mix EFLAGS, SS and CS
+   /*
+* Warning: PT_OLDSS(%esp) contains the wrong/random values if we
+* are returning to the kernel.
+* See comments in process.c:copy_thread() for details.
+*/
+   movbPT_OLDSS(%esp), %ah
+   movbPT_CS(%esp), %al
+   andl$(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), 
%eax
+   cmpl$((SEGMENT_LDT << 8) | USER_RPL), %eax
+   jne .Lend_\@# returning to user-space with LDT SS
+
+   /*
+* Setup and switch to ESPFIX stack
+*
+* We're returning to userspace with a 16 bit stack. The CPU will not
+* restore the high word of ESP for us on executing iret... This is an
+* "official" bug of all the x86-compatible CPUs, which we can work
+* around to make dosemu and wine happy. We do this by preloading the
+* high word of ESP with the high word of the userspace ESP while
+* compensating for the offset by changing to the ESPFIX segment with
+* a base address that matches for the difference.
+*/
+   mov %esp, %edx  /* load kernel esp */
+   mov PT_OLDESP(%esp), %eax   /* load userspace esp */
+   mov %dx, %ax/* eax: new kernel esp */
+   sub %eax, %edx  /* offset (low word is 0) */
+   shr $16, %edx
+   mov %dl, GDT_ESPFIX_SS + 4  /* bits 16..23 */
+   mov %dh, GDT_ESPFIX_SS + 7  /* bits 24..31 */
+   pushl   $__ESPFIX_SS
+   pushl   %eax/* new kernel esp */
+   /*
+* Disable interrupts, but do not irqtrace this section: we
+* will soon execute iret and the tracer was already set to
+* the irqstate after the IRET:
+*/
+   DISABLE_INTERRUPTS(CLBR_ANY)
+   lss (%esp), %esp/* switch to espfix segment */
+.Lend_\@:
+#endif /* CONFIG_X86_ESPFIX32 */
+.endm
 /*
  * %eax: prev task
  * %edx: next task
@@ -548,21 +596,7 @@ ENTRY(entry_INT80_32)
 restore_all:
TRACE_IRQS_IRET
 .Lrestore_all_notrace:
-#ifdef CONFIG_X86_ESPFIX32
-   ALTERNATIVE "jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
-
-   movlPT_EFLAGS(%esp), %eax   # mix EFLAGS, SS and CS
-   /*
-* Warning: PT_OLDSS(%esp) contains the wrong/random values if we
-* are returning to the kernel.
-* See comments in process.c:copy_thread() for details.
-*/
-   movbPT_OLDSS(%esp), %ah
-   movbPT_CS(%esp), %al
-   andl$(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), 
%eax
-   cmpl$((SEGMENT_LDT << 8) | USER_RPL), %eax
-   je .Lldt_ss # returning to user-space with 
LDT SS
-#endif
+   CHECK_AND_APPLY_ESPFIX
 .Lrestore_nocheck:
RESTORE_REGS 4  # skip orig_eax/error_code
 .Lirq_return:
@@ -575,39 +609,6 @@ ENTRY(iret_exc )
jmp common_exception
 .previous
_ASM_EXTABLE(.Lirq_return, iret_exc)
-
-#ifdef CONFIG_X86_ESPFIX32
-.Lldt_ss:
-/*
- * Setup and switch to ESPFIX stack
- *
- * We're returning to userspace with a 16 bit stack. The CPU will not
- * restore the high word of ESP for us on executing iret... This is an
- * "official" bug of all the x86-compatible CPUs, which we can work
- * around to make dosemu and wine happy. We do this by preloading the
- * high word of ESP with the high word of the userspace ESP while
- * compensating for the offset by changing to the ESPFIX segment with
- * a base address that matches for the difference.
- */
-#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
-   mov %esp, %edx  /* load kernel esp */
-   mov PT_OLDESP(%esp), %eax   /* load userspace esp */
-   mov %dx, %ax/* eax: new kernel esp */
-   sub %eax, %edx  /* offset (low word is 0) */
-   shr $16, %edx
-   mov %dl, GDT_ESPFIX_SS + 4  /* bits 16..23 */
-   mov %dh, GDT_ESPFIX_SS + 7  /* bits 24..31 */
-   pushl   $__ESPFIX_SS
-   pushl   %eax/* new kernel esp

[PATCH 26/35] x86/mm/pti: Clone CPU_ENTRY_AREA on PMD level on x86_32

2018-03-16 Thread Joerg Roedel

From: Joerg Roedel 

Cloning on the P4D level would clone the complete kernel
address space into the user-space page-tables for PAE
kernels. Cloning on PMD level is fine for PAE and legacy
paging.

Signed-off-by: Joerg Roedel 
---
 arch/x86/mm/pti.c | 20 
 1 file changed, 20 insertions(+)

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index 96a690e..3ffd923 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -312,6 +312,7 @@ pti_clone_pmds(unsigned long start, unsigned long end, 
pmdval_t clear)
}
 }
 
+#ifdef CONFIG_X86_64
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
  * next-level entry on 5-level systems.
@@ -335,6 +336,25 @@ static void __init pti_clone_user_shared(void)
pti_clone_p4d(CPU_ENTRY_AREA_BASE);
 }
 
+#else /* CONFIG_X86_64 */
+
+/*
+ * On 32 bit PAE systems with 1GB of Kernel address space there is only
+ * one pgd/p4d for the whole kernel. Cloning that would map the whole
+ * address space into the user page-tables, making PTI useless. So clone
+ * the page-table on the PMD level to prevent that.
+ */
+static void __init pti_clone_user_shared(void)
+{
+   unsigned long start, end;
+
+   start = CPU_ENTRY_AREA_BASE;
+   end   = start + (PAGE_SIZE * CPU_ENTRY_AREA_PAGES);
+
+   pti_clone_pmds(start, end, _PAGE_GLOBAL);
+}
+#endif /* CONFIG_X86_64 */
+
 /*
  * Clone the ESPFIX P4D into the user space visinble page table
  */
-- 
2.7.4

[PATCH 32/35] x86/ldt: Enable LDT user-mapping for PAE

2018-03-16 Thread Joerg Roedel

From: Joerg Roedel 

This adds the needed special case for PAE to get the LDT
mapped into the user page-table when PTI is enabled. The big
difference to the other paging modes is that we don't have a
full top-level PGD entry available for the LDT, but only PMD
entry.

Signed-off-by: Joerg Roedel 
---
 arch/x86/include/asm/mmu_context.h |  4 ---
 arch/x86/kernel/ldt.c  | 53 ++
 2 files changed, 53 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index c931b88..af96cfb 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -70,11 +70,7 @@ struct ldt_struct {
 
 static inline void *ldt_slot_va(int slot)
 {
-#ifdef CONFIG_X86_64
return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
-#else
-   BUG();
-#endif
 }
 
 /*
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 8ab7df9..7787451 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -126,6 +126,57 @@ static void do_sanity_check(struct mm_struct *mm,
}
 }
 
+#ifdef CONFIG_X86_PAE
+
+static pmd_t *pgd_to_pmd_walk(pgd_t *pgd, unsigned long va)
+{
+   p4d_t *p4d;
+   pud_t *pud;
+
+   if (pgd->pgd == 0)
+   return NULL;
+
+   p4d = p4d_offset(pgd, va);
+   if (p4d_none(*p4d))
+   return NULL;
+
+   pud = pud_offset(p4d, va);
+   if (pud_none(*pud))
+   return NULL;
+
+   return pmd_offset(pud, va);
+}
+
+static void map_ldt_struct_to_user(struct mm_struct *mm)
+{
+   pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+   pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+   pmd_t *k_pmd, *u_pmd;
+
+   k_pmd = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+   u_pmd = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+
+   if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
+   set_pmd(u_pmd, *k_pmd);
+}
+
+static void sanity_check_ldt_mapping(struct mm_struct *mm)
+{
+   pgd_t *k_pgd = pgd_offset(mm, LDT_BASE_ADDR);
+   pgd_t *u_pgd = kernel_to_user_pgdp(k_pgd);
+   bool had_kernel, had_user;
+   pmd_t *k_pmd, *u_pmd;
+
+   k_pmd  = pgd_to_pmd_walk(k_pgd, LDT_BASE_ADDR);
+   u_pmd  = pgd_to_pmd_walk(u_pgd, LDT_BASE_ADDR);
+   had_kernel = (k_pmd->pmd != 0);
+   had_user   = (u_pmd->pmd != 0);
+
+   do_sanity_check(mm, had_kernel, had_user);
+}
+
+#else /* !CONFIG_X86_PAE */
+
 static void map_ldt_struct_to_user(struct mm_struct *mm)
 {
pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
@@ -143,6 +194,8 @@ static void sanity_check_ldt_mapping(struct mm_struct *mm)
do_sanity_check(mm, had_kernel, had_user);
 }
 
+#endif /* CONFIG_X86_PAE */
+
 /*
  * If PTI is enabled, this maps the LDT into the kernelmode and
  * usermode tables for the given mm.
-- 
2.7.4

[PATCH 33/35] x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32

2018-03-16 Thread Joerg Roedel

From: Joerg Roedel 

Allow PTI to be compiled on x86_32.

Signed-off-by: Joerg Roedel 
---
 security/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/Kconfig b/security/Kconfig
index b0cb9a5..93d85fd 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -57,7 +57,7 @@ config SECURITY_NETWORK
 config PAGE_TABLE_ISOLATION
bool "Remove the kernel mapping in user mode"
default y
-   depends on X86_64 && !UML
+   depends on X86 && !UML
help
  This feature reduces the number of hardware side channels by
  ensuring that the majority of kernel addresses are not mapped
-- 
2.7.4

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 1393 matches

Mail list logo