date:20190612

Re: [Xen-devel] [RFC PATCH 02/16] x86/xen: cpuid support in xenhost_t

2019-06-12 Thread Andrew Cooper

On 09/05/2019 18:25, Ankur Arora wrote:
> xen_cpuid_base() is used to probe and setup features early in a
> guest's lifetime.
>
> We want this to behave differently depending on xenhost->type: for
> instance, local xenhosts cannot intercept the cpuid instruction at all.
>
> Add op (*cpuid_base)() in xenhost_ops_t.
>
> Signed-off-by: Ankur Arora 

What is the real layout of hypervisor nesting here?

When Xen is at L0, all HVM guests get working CPUID faulting to combat
this problem, because CPUID faulting can be fully emulated even on older
Intel hardware, and AMD hardware.

It is a far cleaner way of fixing the problem.

~Andrew

Re: [Xen-devel] [RFC PATCH 04/16] x86/xen: hypercall support for xenhost_t

2019-06-12 Thread Andrew Cooper

On 09/05/2019 18:25, Ankur Arora wrote:
> Allow for different hypercall implementations for different xenhost types.
> Nested xenhost, which has two underlying xenhosts, can use both
> simultaneously.
>
> The hypercall macros (HYPERVISOR_*) implicitly use the default xenhost.x
> A new macro (hypervisor_*) takes xenhost_t * as a parameter and does the
> right thing.
>
> TODO:
>   - Multicalls for now assume the default xenhost
>   - xen_hypercall_* symbols are only generated for the default xenhost.
>
> Signed-off-by: Ankur Arora 

Again, what is the hypervisor nesting and/or guest layout here?

I can't think of any case where a single piece of software can
legitimately have two hypercall pages, because if it has one working
one, it is by definition a guest, and therefore not privileged enough to
use the outer one.

~Andrew

Re: [RESEND PATCH v1 1/5] of/platform: Speed up of_find_device_by_node()

2019-06-12 Thread Rob Herring

On Wed, Jun 12, 2019 at 1:29 PM Saravana Kannan  wrote:
>
> On Wed, Jun 12, 2019 at 11:19 AM Rob Herring  wrote:
> >
> > On Wed, Jun 12, 2019 at 11:08 AM Greg Kroah-Hartman
> >  wrote:
> > >
> > > On Wed, Jun 12, 2019 at 10:53:09AM -0600, Rob Herring wrote:
> > > > On Wed, Jun 12, 2019 at 8:22 AM Greg Kroah-Hartman
> > > >  wrote:
> > > > >
> > > > > On Wed, Jun 12, 2019 at 07:53:39AM -0600, Rob Herring wrote:
> > > > > > On Tue, Jun 11, 2019 at 3:52 PM Sandeep Patil  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Tue, Jun 11, 2019 at 01:56:25PM -0700, 'Saravana Kannan' via 
> > > > > > > kernel-team wrote:
> > > > > > > > On Tue, Jun 11, 2019 at 8:18 AM Frank Rowand 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > Hi Saravana,
> > > > > > > > >
> > > > > > > > > On 6/10/19 10:36 AM, Rob Herring wrote:
> > > > > > > > > > Why are you resending this rather than replying to Frank's 
> > > > > > > > > > last
> > > > > > > > > > comments on the original?
> > > > > > > > >
> > > > > > > > > Adding on a different aspect...  The independent replies from 
> > > > > > > > > three different
> > > > > > > > > maintainers (Rob, Mark, myself) pointed out architectural 
> > > > > > > > > issues with the
> > > > > > > > > patch series.  There were also some implementation issues 
> > > > > > > > > brought out.
> > > > > > > > > (Although I refrained from bringing up most of my 
> > > > > > > > > implementation issues
> > > > > > > > > as they are not relevant until architecture issues are 
> > > > > > > > > resolved.)
> > > > > > > >
> > > > > > > > Right, I'm not too worried about the implementation issues 
> > > > > > > > before we
> > > > > > > > settle on the architectural issues. Those are easy to fix.
> > > > > > > >
> > > > > > > > Honestly, the main points that the maintainers raised are:
> > > > > > > > 1) This is a configuration property and not describing the 
> > > > > > > > device.
> > > > > > > > Just use the implicit dependencies coming from existing 
> > > > > > > > bindings.
> > > > > > > >
> > > > > > > > I gave a bunch of reasons for why I think it isn't an OS 
> > > > > > > > configuration
> > > > > > > > property. But even if that's not something the maintainers can 
> > > > > > > > agree
> > > > > > > > to, I gave a concrete example (cyclic dependencies between clock
> > > > > > > > provider hardware) where the implicit dependencies would 
> > > > > > > > prevent one
> > > > > > > > of the devices from probing till the end of time. So even if the
> > > > > > > > maintainers don't agree we should always look at "depends-on" to
> > > > > > > > decide the dependencies, we still need some means to override 
> > > > > > > > the
> > > > > > > > implicit dependencies where they don't match the real 
> > > > > > > > dependency. Can
> > > > > > > > we use depends-on as an override when the implicit dependencies 
> > > > > > > > aren't
> > > > > > > > correct?
> > > > > > > >
> > > > > > > > 2) This doesn't need to be solved because this is just 
> > > > > > > > optimizing
> > > > > > > > probing or saving power ("we should get rid of this auto 
> > > > > > > > disabling"):
> > > > > > > >
> > > > > > > > I explained why this patch series is not just about optimizing 
> > > > > > > > probe
> > > > > > > > ordering or saving power. And why we can't ignore auto disabling
> > > > > > > > (because it's more than just auto disabling). The kernel is 
> > > > > > > > currently
> > > > > > > > broken when trying to use modules in ARM SoCs (probably in other
> > > > > > > > systems/archs too, but I can't speak for those).
> > > > > > > >
> > > > > > > > 3) Concerns about backwards compatibility
> > > > > > > >
> > > > > > > > I pointed out why the current scheme (depends-on being the only 
> > > > > > > > source
> > > > > > > > of dependency) doesn't break compatibility. And if we go with
> > > > > > > > "depends-on" as an override what we could do to keep backwards
> > > > > > > > compatibility. Happy to hear more thoughts or discuss options.
> > > > > > > >
> > > > > > > > 4) How the "sync_state" would work for a device that supplies 
> > > > > > > > multiple
> > > > > > > > functionalities but a limited driver.
> > > > > > >
> > > > > > > 
> > > > > > > To be clear, all of above are _real_ problems that stops us from 
> > > > > > > efficiently
> > > > > > > load device drivers as modules for Android.
> > > > > > >
> > > > > > > So, if 'depends-on' doesn't seem like the right approach and 
> > > > > > > "going back to
> > > > > > > the drawing board" is the ask, could you please point us in the 
> > > > > > > right
> > > > > > > direction?
> > > > > >
> > > > > > Use the dependencies which are already there in DT. That's clocks,
> > > > > > pinctrl, regulators, interrupts, gpio at a minimum. I'm simply not
> > > > > > going to accept duplicating all those dependencies in DT. The 
> > > > > > downside
> > > > > > for the kernel is you have to address these one by one and can't 
> > > > > > have
> > >

Re: [PATCHv5 2/2] mtd: spi-nor: cadence-quadspi: add reset control

2019-06-12 Thread Dinh Nguyen




On 6/12/19 10:07 AM, tudor.amba...@microchip.com wrote:
> 
> 
> On 06/12/2019 05:37 PM, Dinh Nguyen wrote:
>> External E-Mail
>>
>>
>> Get the reset control properties for the QSPI controller and bring them
>> out of reset. Most will have just one reset bit, but there is an additional
>> OCP reset bit that is used ECC. The OCP reset bit will also need to get
>> de-asserted as well. [1]
>>
>> The reason this patch is needed is in the case where a bootloader leaves
>> the QSPI controller in a reset state, or a state where init cannot occur
>> successfully, this patch will put the QSPI controller into a clean state.
>>
>> [1] 
>> https://www.intel.com/content/www/us/en/programmable/hps/arria-10/hps.html#reg_soc_top/sfo1429890575955.html
>>
>> Suggested-by: Tien-Fong Chee 
>> Signed-off-by: Dinh Nguyen 
>> ---
>> v5: remove udelay(not needed) on tested hardware
>> group reset assert/deassert together
>> update commit message with reasoning for patch
>> v4: fix compile error
>> v3: return full error by using PTR_ERR(rtsc)
>> move reset control calls until after the clock enables
>> use udelay(2) to be safe
>> Add optional OCP(Open Core Protocol) reset signal
>> v2: use devm_reset_control_get_optional_exclusive
>> print an error message
>> return -EPROBE_DEFER
>> ---
>>  drivers/mtd/spi-nor/cadence-quadspi.c | 26 ++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/drivers/mtd/spi-nor/cadence-quadspi.c 
>> b/drivers/mtd/spi-nor/cadence-quadspi.c
>> index 792628750eec..f8b1009e488c 100644
>> --- a/drivers/mtd/spi-nor/cadence-quadspi.c
>> +++ b/drivers/mtd/spi-nor/cadence-quadspi.c
>> @@ -34,6 +34,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  #include 
>>  #include 
>>  #include 
>> @@ -1336,6 +1337,8 @@ static int cqspi_probe(struct platform_device *pdev)
>>  struct cqspi_st *cqspi;
>>  struct resource *res;
>>  struct resource *res_ahb;
>> +struct reset_control *rstc;
>> +struct reset_control *rstc_ocp;
>>  const struct cqspi_driver_platdata *ddata;
>>  int ret;
>>  int irq;
>> @@ -1402,6 +1405,29 @@ static int cqspi_probe(struct platform_device *pdev)
>>  goto probe_clk_failed;
>>  }
>>  
>> +/* Obtain QSPI reset control */
>> +rstc = devm_reset_control_get_optional_exclusive(dev, "qspi");
>> +if (IS_ERR(rstc)) {
>> +dev_err(dev, "Cannot get QSPI reset.\n");
>> +return PTR_ERR(rstc);
>> +}
>> +
>> +rstc_ocp = devm_reset_control_get_optional_exclusive(dev, "qspi-ocp");
>> +if (IS_ERR(rstc_ocp)) {
>> +dev_err(dev, "Cannot get QSPI OCP reset.\n");
>> +return PTR_ERR(rstc_ocp);
>> +}
>> +
>> +if (rstc) {
> 
> Hi, Dinh,
> 
> reset_control_assert/deassert() already have checks for null, you can call 
> them
> directly without checking for null.
> 
>> +reset_control_assert(rstc);
>> +reset_control_deassert(rstc);
> 
> Is there any difference between:
> reset_control_assert(rstc);
> reset_control_assert(rstc_ocp);
> 
> reset_control_deassert(rstc);
> reset_control_deassert(rstc_ocp);
> 
> and:
> 
> reset_control_assert(rstc);
> reset_control_deassert(rstc);
> 
> reset_control_assert(rstc_ocp);
> reset_control_deassert(rstc_ocp);
> 
> Which one would you choose?
> 

I prefer grouping the assert/deassert for each reset pointer together.

Dinh

Re: [PATCH net-next v2 1/1] net: stmmac: use GPIO descriptors in stmmac_mdio_reset

2019-06-12 Thread Andrew Lunn

On Wed, Jun 12, 2019 at 09:31:15PM +0200, Martin Blumenstingl wrote:
> Switch stmmac_mdio_reset to use GPIO descriptors. GPIO core handles the
> "snps,reset-gpio" for GPIO descriptors so we don't need to take care of
> it inside the driver anymore.
> 
> The advantage of this is that we now preserve the GPIO flags which are
> passed via devicetree. This is required on some newer Amlogic boards
> which use an Open Drain pin for the reset GPIO. This pin can only output
> a LOW signal or switch to input mode but it cannot output a HIGH signal.
> There are already devicetree bindings for these special cases and GPIO
> core already takes care of them but only if we use GPIO descriptors
> instead of GPIO numbers.
> 
> Signed-off-by: Martin Blumenstingl 
> Reviewed-by: Linus Walleij 

Reviewed-by: Andrew Lunn 

Andrew

[GIT PULL] SELinux fixes for v5.2 (#2)

2019-06-12 Thread Paul Moore

Hi Linus,

Three patches for v5.2; one fixes a problem where we weren't correctly
logging raw SELinux labels, the other two fix problems where we
weren't properly checking calls to kmemdup().  Please merge for the
next v5.2-rc release.

Thanks,
-Paul
--
The following changes since commit 05174c95b83f8aca0c47b87115abb7a6387aafa5:

 selinux: do not report error on connect(AF_UNSPEC) (2019-05-20 21:46:02 -0400)

are available in the Git repository at:

 git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux.git
   tags/selinux-pr-20190612

for you to fetch changes up to fec6375320c6399c708fa9801f8cfbf950fee623:

 selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()
   (2019-06-12 12:27:26 -0400)


selinux/stable-5.2 PR 20190612


Gen Zhang (2):
 selinux: fix a missing-check bug in selinux_add_mnt_opt( )
 selinux: fix a missing-check bug in selinux_sb_eat_lsm_opts()

Ondrej Mosnacek (1):
 selinux: log raw contexts as untrusted strings

security/selinux/avc.c   | 10 --
security/selinux/hooks.c | 39 ---
2 files changed, 36 insertions(+), 13 deletions(-)

-- 
paul moore
www.paul-moore.com

Re: MHI code review

2019-06-12 Thread Daniele Palmas

Hi Sujeev,

Il giorno mer 12 giu 2019 alle ore 19:54 Sujeev Dias
 ha scritto:
>
> Hi Daniels
>
> Sorry for delay response.  Yes, we will be pushing new set of series very
> soon that will have support for 55 as well.  The series that's pushed should
> already work for SDX20, 24 and 55.   There are some new features related to
> 55 that's not yet in series.
>

great, thanks for the update! I'll wait for you new patch-set.

Thanks,
Daniele

> Thanks
> Sujeev
>
> -Original Message-
> From: Daniele Palmas 
> Sent: Tuesday, April 30, 2019 8:11 AM
> To: sd...@codeaurora.org
> Cc: linux-kernel@vger.kernel.org; tru...@codeaurora.org; dnl...@gmail.com
> Subject: Re: MHI code review
>
> Hi Sujeev,
>
> > Hi Greg Kroah-Hartman\Arnd Bergmann and community
> >
> > Thank you for all the feedback, I believe I have addressed all the
> > comments from previous patches. Also, I am excluding mhi network
> > driver in this series. I still have some modifications to do.
> >
> > Please review the new patch series and share your feedback.
> >
> > Thanks again
> >
> > Sincerely,
> > Sujeev
>
> are you going to continue working on this series?
>
> Can this series be used with PCIe SDX20/24/55 based modems?
>
> If yes, it would really be important to have this integrated into an
> official kernel.
>
> Thanks,
> Daniele
>

Re: perf build failure with newer glibc headers

2019-06-12 Thread Arnaldo Carvalho de Melo

Em Wed, Jun 12, 2019 at 03:23:12PM -0400, Laura Abbott escreveu:
> Hi,
> 
> While doing some build experiments, I found a compile failure with perf and 
> jvmti:
> 
> BUILDSTDERR:   gcc -Wp,-MD,./.xsk.o.d -Wp,-MT,xsk.o -O2 -g -pipe -Wall 
> -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
> -fexceptions -fstack-protector-strong -grecord-gcc-jvmti/jvmti_agent.c:48:21: 
> error: static declaration of 'gettid' follows non-static declaration
> BUILDSTDERR:48 | static inline pid_t gettid(void)
> BUILDSTDERR:   | ^~
> BUILDSTDERR: In file included from /usr/include/unistd.h:1170,
> BUILDSTDERR:  from jvmti/jvmti_agent.c:33:
> BUILDSTDERR: /usr/include/bits/unistd_ext.h:40:16: note: previous declaration 
> of 'gettid' was here
> BUILDSTDERR:40 | extern __pid_t gettid (void) __THROW;
> BUILDSTDERR:   |^~
> 
> 
> This is with the newer glibc headers that came into Fedora earlier this week
> (glibc-2.29.9000-27.fc31)  It looks like the newer headers now define gettid
> so the in file gettid no longer works. Note this was a custom build with
> jvmti enabled as regular Fedora doesn't have it enabled which is why this
> wasn't reported elsewhere.
> 
> I don't know enough about either the glibc headers or perf to make a 
> suggestion
> on how to fix this but I'm happy to test.

Bummer, I haven't noticed this because my fedora:rawhide perf build test
container wasn't building the jvmti code:

Makefile.config:925: No openjdk development package found, please install JDK 
package, e.g. openjdk-8-jdk, java-1.8.0-openjdk-devel

i.e.:

[perfbuilder@c0326e8b6511 perf]$ cat 
/tmp/build/perf/feature/test-jvmti.make.output 
test-jvmti.c:2:10: fatal error: jvmti.h: No such file or directory
2 | #include 
  |  ^
compilation terminated.
[perfbuilder@c0326e8b6511 perf]$

Installing it I get:

[root@2d7fe307ad20 perf]# rpm -qa | grep openjdk
java-1.8.0-openjdk-1.8.0.212.b04-4.fc31.x86_64
java-1.8.0-openjdk-headless-1.8.0.212.b04-4.fc31.x86_64
java-1.8.0-openjdk-devel-1.8.0.212.b04-4.fc31.x86_64
[root@2d7fe307ad20 perf]# cat
/tmp/build/perf/feature/test-jvmti.make.output 
[root@2d7fe307ad20 perf]# ls -la /tmp/build/perf/feature/test-jvmti.bin 
-rwxr-xr-x. 1 root root 21592 Jun 12 20:48
/tmp/build/perf/feature/test-jvmti.bin
[root@2d7fe307ad20 perf]# 

And reproduce the problem you reported:

jvmti/jvmti_agent.c:48:21: error: static declaration of ‘gettid’ follows
non-static declaration
   48 | static inline pid_t gettid(void)
  | ^~
In file included from /usr/include/unistd.h:1170,
 from jvmti/jvmti_agent.c:33:

So, we'll have to have a feature test, that defines some HAVE_GETTID
that then ifdefs out our inline copy, working on it.

Thanks for the report!

- Arnaldo

Re: [PATCH] locking/static_key: always define static_branch_deferred_inc

2019-06-12 Thread Jakub Kicinski

On Wed, 12 Jun 2019 16:25:16 -0400, Willem de Bruijn wrote:
> On Wed, Jun 12, 2019 at 3:59 PM Jakub Kicinski
>  wrote:
> >
> > On Wed, 12 Jun 2019 15:44:09 -0400, Willem de Bruijn wrote:  
> > > From: Willem de Bruijn 
> > >
> > > This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
> > > available also when jump labels are disabled.
> > >
> > > Fixes: ad282a8117d50 ("locking/static_key: Add support for deferred 
> > > static branches")
> > > Signed-off-by: Willem de Bruijn 
> > >
> > > ---
> > >
> > > The original patch went into 5.2-rc1, but this interface is not yet
> > > used, so this could target either 5.2 or 5.3.  
> >
> > Can we drop the Fixes tag?  It's an ugly omission but not a bug fix.
> >
> > Are you planning to switch clean_acked_data_enable() to the helper once
> > merged?  
> 
> Definitely, can do.
> 
> Perhaps it's easiest to send both as a single patch set through net-next, 
> then?

I'd think so too, perhaps we can get a blessing from Peter for that :)

Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets

2019-06-12 Thread Andy Lutomirski




> On Jun 12, 2019, at 1:41 PM, Dave Hansen  wrote:
> 
> On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>>> We've discussed having per-cpu page tables where a given PGD is
>>> only in use from one CPU at a time.  I *think* this scheme still
>>> works in such a case, it just adds one more PGD entry that would
>>> have to context-switched.
>> Fair warning: Linus is on record as absolutely hating this idea. He
>> might change his mind, but it’s an uphill battle.
> 
> Just to be clear, are you referring to the per-cpu PGDs, or to this
> patch set with a per-mm kernel area?

per-CPU PGDs

[PATCH v2 3/4] arm64: dts: meson: use the generic Ethernet PHY reset GPIO bindings

2019-06-12 Thread Martin Blumenstingl

The snps,reset-gpio bindings are deprecated in favour of the generic
"Ethernet PHY reset" bindings.

Replace snps,reset-gpio from the  node with reset-gpios in the
ethernet-phy node. The old snps,reset-active-low property is now encoded
directly as GPIO flag inside the reset-gpios property.

snps,reset-delays-us is converted to reset-assert-us and
reset-deassert-us. reset-assert-us is the second cell from
snps,reset-delays-us while reset-deassert-us was the third cell.

Instead of blindly copying the old values (which seems strange since
they gave the PHY one second to come out of reset) over this also
updates the delays based on the datasheets:
- the Realtek RTL8211F PHY needs a 10ms assert delay (the datasheet
  mentions: "For a complete PHY reset, this pin must be asserted low
  for at least 10ms") and a 30ms deassert delay (the datasheet
  mentions: "Wait for a further 30ms (for internal circuits settling
  time) before accessing the PHY register". This applies to the
  following boards: GXBB NanoPi K2, GXBB Odroid-C2, GXBB Vega S95
  variants, GXBB Wetek variants, GXL P230, GXM Khadas VIM2, GXM Nexbox
  A1, GXM Q200, GXM RBox Pro boards.
- the ICPlus IP101GR PHY needs a 10ms assert delay (the datasheet
  mentions: "Trst | Reset period | 10ms") and a deassert delay of 10ms
  as well (the datasheet mentions: "Tclk_MII_rdy | MII/RMII clock
  output ready after reset released | 10ms"). This applies to the GXBB
  Nexbox A95X board.
- the Micrel KSZ9031 seems to require a 100us delay but use the same
  (seemingly safe) values from RTL8211F due to lack of a board to verify
  this. This applies to the GXBB P200 board.

The GXBB P201 board is left out from this conversion because it doesn't
have a dedicated PHY node (because it's not clear which PHY is used on
that board).

Signed-off-by: Martin Blumenstingl 
---
 arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts  |  9 +
 .../arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts |  8 
 arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts   |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-p200.dts   |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-vega-s95.dtsi  |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-wetek.dtsi |  8 
 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-p230.dts  | 11 ++-
 arch/arm64/boot/dts/amlogic/meson-gxm-khadas-vim2.dts | 10 +-
 arch/arm64/boot/dts/amlogic/meson-gxm-nexbox-a1.dts   |  8 
 arch/arm64/boot/dts/amlogic/meson-gxm-q200.dts| 11 ++-
 arch/arm64/boot/dts/amlogic/meson-gxm-rbox-pro.dts|  8 
 11 files changed, 53 insertions(+), 47 deletions(-)

diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts 
b/arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts
index 849c01650c4d..c34c1c90ccb6 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts
@@ -154,10 +154,6 @@
 
amlogic,tx-delay-ns = <2>;
 
-   snps,reset-gpio = < GPIOZ_14 0>;
-   snps,reset-delays-us = <0 1 100>;
-   snps,reset-active-low;
-
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
@@ -166,6 +162,11 @@
eth_phy0: ethernet-phy@0 {
/* Realtek RTL8211F (0x001cc916) */
reg = <0>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <3>;
+   reset-gpios = < GPIOZ_14 GPIO_ACTIVE_LOW>;
+
interrupt-parent = <_intc>;
/* MAC_INTR on GPIOZ_15 */
interrupts = <29 IRQ_TYPE_LEVEL_LOW>;
diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts 
b/arch/arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts
index 3c54f26eef15..b636912a2715 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts
@@ -162,10 +162,6 @@
phy-handle = <_phy0>;
phy-mode = "rmii";
 
-   snps,reset-gpio = < GPIOZ_14 0>;
-   snps,reset-delays-us = <0 1 100>;
-   snps,reset-active-low;
-
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
@@ -174,6 +170,10 @@
eth_phy0: ethernet-phy@0 {
/* IC Plus IP101GR (0x02430c54) */
reg = <0>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <1>;
+   reset-gpios = < GPIOZ_14 GPIO_ACTIVE_LOW>;
};
};
 };
diff --git a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts 
b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
index 5a139e7b1c60..9972b1515da6 100644
--- a/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts
@@ -126,10 +126,6 @@

[PATCH v2 2/4] ARM: dts: meson: switch to the generic Ethernet PHY reset bindings

2019-06-12 Thread Martin Blumenstingl

The snps,reset-gpio bindings are deprecated in favour of the generic
"Ethernet PHY reset" bindings.

Replace snps,reset-gpio from the  node with reset-gpios in the
ethernet-phy node. The old snps,reset-active-low property is now encoded
directly as GPIO flag inside the reset-gpios property.

snps,reset-delays-us is converted to reset-assert-us and
reset-deassert-us. reset-assert-us is the second cell from
snps,reset-delays-us while reset-deassert-us was the third cell.
Instead of blindly copying the old values (which seems strange since
they gave the PHY one second to come out of reset) over this also
updates the delays based on the datasheets:
- RTL8211F PHY on the Odroid-C1 and MXIII-Plus needs a 10ms assert
  delay (the datasheet mentions: "For a complete PHY reset, this pin
  must be asserted low for at least 10ms") and a 30ms deassert delay
  (the datasheet mentions: "Wait for a further 30ms (for internal
  circuits settling time) before accessing the PHY register"). The
  old settings used 10ms for assert and 1000ms for deassert.
- IP101GR PHY on the EC-100 and MXQ needs a 10ms assert delay (the
  datasheet mentions: "Trst | Reset period | 10ms") and a 10ms deassert
  delay as well (the datasheet mentions: "Tclk_MII_rdy | MII/RMII clock
  output ready after reset released | 10ms")). The old settings used
  10ms for assert and 1000ms for deassert.

No functional changes intended.

Signed-off-by: Martin Blumenstingl 
---
 arch/arm/boot/dts/meson8b-ec100.dts   | 9 +
 arch/arm/boot/dts/meson8b-mxq.dts | 9 +
 arch/arm/boot/dts/meson8b-odroidc1.dts| 9 +
 arch/arm/boot/dts/meson8m2-mxiii-plus.dts | 8 
 4 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/arm/boot/dts/meson8b-ec100.dts 
b/arch/arm/boot/dts/meson8b-ec100.dts
index 9bf4249cb60d..96d239d8334e 100644
--- a/arch/arm/boot/dts/meson8b-ec100.dts
+++ b/arch/arm/boot/dts/meson8b-ec100.dts
@@ -234,10 +234,6 @@
phy-handle = <_phy0>;
phy-mode = "rmii";
 
-   snps,reset-gpio = < GPIOH_4 0>;
-   snps,reset-delays-us = <0 1 100>;
-   snps,reset-active-low;
-
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
@@ -246,6 +242,11 @@
eth_phy0: ethernet-phy@0 {
/* IC Plus IP101A/G (0x02430c54) */
reg = <0>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <1>;
+   reset-gpios = < GPIOH_4 GPIO_ACTIVE_LOW>;
+
icplus,select-interrupt;
interrupt-parent = <_intc>;
/* GPIOH_3 */
diff --git a/arch/arm/boot/dts/meson8b-mxq.dts 
b/arch/arm/boot/dts/meson8b-mxq.dts
index ef602ab45efd..bb27b34eb346 100644
--- a/arch/arm/boot/dts/meson8b-mxq.dts
+++ b/arch/arm/boot/dts/meson8b-mxq.dts
@@ -91,10 +91,6 @@
phy-handle = <_phy0>;
phy-mode = "rmii";
 
-   snps,reset-gpio = < GPIOH_4 0>;
-   snps,reset-delays-us = <0 1 100>;
-   snps,reset-active-low;
-
mdio {
compatible = "snps,dwmac-mdio";
#address-cells = <1>;
@@ -103,6 +99,11 @@
eth_phy0: ethernet-phy@0 {
/* IC Plus IP101A/G (0x02430c54) */
reg = <0>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <1>;
+   reset-gpios = < GPIOH_4 GPIO_ACTIVE_LOW>;
+
icplus,select-interrupt;
interrupt-parent = <_intc>;
/* GPIOH_3 */
diff --git a/arch/arm/boot/dts/meson8b-odroidc1.dts 
b/arch/arm/boot/dts/meson8b-odroidc1.dts
index 018695b2b83a..86c4614e0a38 100644
--- a/arch/arm/boot/dts/meson8b-odroidc1.dts
+++ b/arch/arm/boot/dts/meson8b-odroidc1.dts
@@ -176,10 +176,6 @@
  {
status = "okay";
 
-   snps,reset-gpio = < GPIOH_4 GPIO_ACTIVE_HIGH>;
-   snps,reset-active-low;
-   snps,reset-delays-us = <0 1 3>;
-
pinctrl-0 = <_rgmii_pins>;
pinctrl-names = "default";
 
@@ -195,6 +191,11 @@
/* Realtek RTL8211F (0x001cc916) */
eth_phy: ethernet-phy@0 {
reg = <0>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <3>;
+   reset-gpios = < GPIOH_4 GPIO_ACTIVE_LOW>;
+
interrupt-parent = <_intc>;
/* GPIOH_3 */
interrupts = <17 IRQ_TYPE_LEVEL_LOW>;
diff --git a/arch/arm/boot/dts/meson8m2-mxiii-plus.dts 
b/arch/arm/boot/dts/meson8m2-mxiii-plus.dts
index 59b07a55e461..d54477b1001c 100644
--- a/arch/arm/boot/dts/meson8m2-mxiii-plus.dts
+++ b/arch/arm/boot/dts/meson8m2-mxiii-plus.dts
@@ -73,10 +73,6 @@
 
amlogic,tx-delay-ns = <4>;
 
-   snps,reset-gpio = < GPIOH_4 0>;
-

[PATCH v2 1/4] arm64: dts: meson: g12a: x96-max: fix the Ethernet PHY reset line

2019-06-12 Thread Martin Blumenstingl

The Odroid-N2 schematics show that the following pins are used for the
reset and interrupt lines:
- GPIOZ_14 is the PHY interrupt line
- GPIOZ_15 is the PHY reset line

The GPIOZ_14 and GPIOZ_15 pins are special. The datasheet describes that
they are "3.3V input tolerant open drain (OD) output pins". This means
the GPIO controller can drive the output LOW to reset the PHY. To
release the reset it can only switch the pin to input mode. The output
cannot be driven HIGH for these pins.
This requires configuring the reset line as GPIO_OPEN_DRAIN because
otherwise the PHY will be stuck in "reset" state (because driving the
pin HIGH seems to result in the same signal as driving it LOW).

The reset line works together with a pull-up resistor (R143 in the
Odroid-N2 schematics). The SoC can drive GPIOZ_14 LOW to assert the PHY
reset. However, since the SoC can't drive the pin HIGH (to release the
reset) we switch the mode to INPUT and let the pull-up resistor take
care of driving the reset line HIGH.

Switch to GPIOZ_15 for the PHY reset line instead of using GPIOZ_14
(which actually is the interrupt line).
Move from the "snps" specific resets to the MDIO framework's
reset-gpios because only the latter honors the GPIO flags.
Use the GPIO flags (GPIO_ACTIVE_LOW | GPIO_OPEN_DRAIN) to match with
the pull-up resistor because this will:
- drive the output LOW to reset the PHY (= active low)
- switch the pin to INPUT mode so the pull-up will take the PHY out of
  reset

Fixes: 51d116557b2044 ("arm64: dts: meson-g12a-x96-max: Add Gigabit Ethernet 
Support")
Signed-off-by: Martin Blumenstingl 
---
 arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts 
b/arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts
index 98bc56e650a0..de58d7817836 100644
--- a/arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts
@@ -176,6 +176,10 @@
reg = <0>;
max-speed = <1000>;
eee-broken-1000t;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <3>;
+   reset-gpios = < GPIOZ_15 (GPIO_ACTIVE_LOW | 
GPIO_OPEN_DRAIN)>;
};
 };
 
@@ -186,9 +190,6 @@
phy-mode = "rgmii";
phy-handle = <_phy>;
amlogic,tx-delay-ns = <2>;
-   snps,reset-gpio = < GPIOZ_14 0>;
-   snps,reset-delays-us = <0 1 100>;
-   snps,reset-active-low;
 };
 
 _ef {
-- 
2.22.0

[PATCH v2 0/4] Ethernet PHY reset GPIO updates for Amlogic SoCs

2019-06-12 Thread Martin Blumenstingl

While trying to add the Ethernet PHY interrupt on the X96 Max I found
that the current reset line definition is incorrect. Patch #1 fixes
this.

Since the fix requires moving from the deprecated "snps,reset-gpio"
property to the generic Ethernet PHY reset bindings I decided to move
all Amlogic boards over to the non-deprecated bindings. That's what
patches #2 and #3 do.

Finally I found that Odroid-N2 doesn't define the Ethernet PHY's reset
GPIO yet. I don't have that board so I can't test whether it really
works but based on the schematics it should. 

This series is a partial successor to "stmmac: honor the GPIO flags
for the PHY reset GPIO" from [0]. I decided not to take Linus W.'s
Reviewed-by from patch #4 of that series because I had to change the
wording and I want to be sure that he's happy with that now.

One quick note regarding patches #1 and #4: I decided to violate the
"max 80 characters per line" (by 4 characters) limit because I find
that the result is easier to read then it would be if I split the
line.


Changes since v1 at [1]:
- fixed the reset deassert delay for RTL8211F PHYs - spotted by Robin
  Murphy (thank you). according to the public RTL8211E datasheet the
  correct values seem to be: 10ms assert, 30ms deassert
- fixed the reset assert and deassert delays for IP101GR PHYs. There
  are two values given in the public datasheet, use the higher one
  (10ms instead of 2.5)
- update the patch descriptions to quote the datasheets (the RTL8211F
  quotes are taken from the public RTL8211E datasheet because as far
  as I can tell the reset sequence is identical on both PHYs)


[0] https://patchwork.kernel.org/cover/10983801/
[1] https://patchwork.kernel.org/cover/10985155/


Martin Blumenstingl (4):
  arm64: dts: meson: g12a: x96-max: fix the Ethernet PHY reset line
  ARM: dts: meson: switch to the generic Ethernet PHY reset bindings
  arm64: dts: meson: use the generic Ethernet PHY reset GPIO bindings
  arm64: dts: meson: g12b: odroid-n2: add the Ethernet PHY reset line

 arch/arm/boot/dts/meson8b-ec100.dts   |  9 +
 arch/arm/boot/dts/meson8b-mxq.dts |  9 +
 arch/arm/boot/dts/meson8b-odroidc1.dts|  9 +
 arch/arm/boot/dts/meson8m2-mxiii-plus.dts |  8 
 arch/arm64/boot/dts/amlogic/meson-g12a-x96-max.dts|  7 ---
 arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts  |  4 
 arch/arm64/boot/dts/amlogic/meson-gxbb-nanopi-k2.dts  |  9 +
 .../arm64/boot/dts/amlogic/meson-gxbb-nexbox-a95x.dts |  8 
 arch/arm64/boot/dts/amlogic/meson-gxbb-odroidc2.dts   |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-p200.dts   |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-vega-s95.dtsi  |  9 +
 arch/arm64/boot/dts/amlogic/meson-gxbb-wetek.dtsi |  8 
 arch/arm64/boot/dts/amlogic/meson-gxl-s905d-p230.dts  | 11 ++-
 arch/arm64/boot/dts/amlogic/meson-gxm-khadas-vim2.dts | 10 +-
 arch/arm64/boot/dts/amlogic/meson-gxm-nexbox-a1.dts   |  8 
 arch/arm64/boot/dts/amlogic/meson-gxm-q200.dts| 11 ++-
 arch/arm64/boot/dts/amlogic/meson-gxm-rbox-pro.dts|  8 
 17 files changed, 80 insertions(+), 66 deletions(-)

-- 
2.22.0

[PATCH v2 4/4] arm64: dts: meson: g12b: odroid-n2: add the Ethernet PHY reset line

2019-06-12 Thread Martin Blumenstingl

The reset line of the RTL8211F PHY is routed to the GPIOZ_15 pad.
Describe this in the device tree so the PHY framework can bring the PHY
into a known state when initializing it. GPIOZ_15 doesn't support
driving the output HIGH (to take the PHY out of reset, only output LOW
to reset the PHY is supported). The datasheet states it's an "3.3V input
tolerant open drain (OD) output pin". Instead there's a pull-up resistor
on the board to take the PHY out of reset. The GPIO itself will be set
to INPUT mode to take the PHY out of reset and LOW to reset the PHY,
which is achieved with the flags (GPIO_ACTIVE_LOW | GPIO_OPEN_DRAIN).

Signed-off-by: Martin Blumenstingl 
---
 arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts 
b/arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts
index 4146cd84989c..f911bbdc4e70 100644
--- a/arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts
+++ b/arch/arm64/boot/dts/amlogic/meson-g12b-odroid-n2.dts
@@ -186,6 +186,10 @@
/* Realtek RTL8211F (0x001cc916) */ 
reg = <0>;
max-speed = <1000>;
+
+   reset-assert-us = <1>;
+   reset-deassert-us = <3>;
+   reset-gpios = < GPIOZ_15 (GPIO_ACTIVE_LOW | 
GPIO_OPEN_DRAIN)>;
};
 };
 
-- 
2.22.0

INFO: task syz-executor can't die for more than 143 seconds.

2019-06-12 Thread syzbot


Hello,

syzbot found the following crash on:

HEAD commit:81a72c79 Add linux-next specific files for 20190612
git tree:   linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=1451d31ea0
kernel config:  https://syzkaller.appspot.com/x/.config?x=8aa46bbce201b8b6
dashboard link: https://syzkaller.appspot.com/bug?extid=8ab2d0f39fb79fe6ca40
compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1250ae3ea0
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1568557aa0

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+8ab2d0f39fb79fe6c...@syzkaller.appspotmail.com

INFO: task syz-executor050:8619 can't die for more than 143 seconds.
syz-executor050 R  running task27536  8619   8618 0x4006
Call Trace:

Showing all locks held in the system:
1 lock held by khungtaskd/1046:
 #0: f58b83ec (rcu_read_lock){}, at:  
debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:5262

1 lock held by rsyslogd/8504:
 #0: b8867a10 (>f_pos_lock){+.+.}, at: __fdget_pos+0xee/0x110  
fs/file.c:801

2 locks held by getty/8594:
 #0: 8c94b07f (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: 6c5169d5 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8595:
 #0: 42bd87ed (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: 9ebc0e1a (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8596:
 #0: ad647db4 (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: f68a3152 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8597:
 #0: 72ec45a9 (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: daa58f5f (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8598:
 #0: 7698feb5 (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: 17a6b41f (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8599:
 #0: f5a5df8a (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: 3ed47aa1 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

2 locks held by getty/8600:
 #0: ab9f490c (>ldisc_sem){}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
 #1: 332ddba5 (>atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2156

1 lock held by syz-executor050/8619:

=

NMI backtrace for cpu 0
CPU: 0 PID: 1046 Comm: khungtaskd Not tainted 5.2.0-rc4-next-20190612 #13
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
 check_hung_uninterruptible_tasks kernel/hung_task.c:249 [inline]
 watchdog+0xb88/0x12b0 kernel/hung_task.c:333
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 8619 Comm: syz-executor050 Not tainted 5.2.0-rc4-next-20190612  
#13
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:get_current arch/x86/include/asm/current.h:15 [inline]
RIP: 0010:__sanitizer_cov_trace_pc+0x8/0x50 kernel/kcov.c:101
Code: f4 ff ff ff e8 9d fa e9 ff 48 c7 05 ce b0 15 09 00 00 00 00 e9 a4 e9  
ff ff 90 90 90 90 90 90 90 90 90 55 48 89 e5 48 8b 75 08 <65> 48 8b 04 25  
c0 fd 01 00 65 8b 15 f0 fa 90 7e 81 e2 00 01 1f 00

RSP: 0018:8880a9acfd80 EFLAGS: 0206
RAX: 1d4720d9 RBX: 1600 RCX: 81682654
RDX:  RSI: 8168263c RDI: ea3906c8
RBP: 8880a9acfd80 R08: 88808fe146c0 R09: 0002
R10: 88808fe14f78 R11: 88808fe146c0 R12: ea390688
R13: dc00 R14: ea390680 R15: 049a5000
FS:  55a3e880() GS:8880ae90() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: ff600400 CR3: a3d59000 CR4: 001406e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 page_to_boot_pfn in

Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets

2019-06-12 Thread Dave Hansen

On 6/12/19 1:27 PM, Andy Lutomirski wrote:
>> We've discussed having per-cpu page tables where a given PGD is
>> only in use from one CPU at a time.  I *think* this scheme still
>> works in such a case, it just adds one more PGD entry that would
>> have to context-switched.
> Fair warning: Linus is on record as absolutely hating this idea. He
> might change his mind, but it’s an uphill battle.

Just to be clear, are you referring to the per-cpu PGDs, or to this
patch set with a per-mm kernel area?

Re: linux-next: Tree for Jun 12 (kernel/bpf/verifier)

2019-06-12 Thread Randy Dunlap

On 6/12/19 12:00 AM, Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20190611:
> 

on x86_64:

ld: kernel/bpf/verifier.o: in function `check_mem_access':
verifier.c:(.text+0x4b90): undefined reference to `bpf_xdp_sock_is_valid_access'
ld: kernel/bpf/verifier.o: in function `convert_ctx_accesses':
verifier.c:(.text+0x79b7): undefined reference to 
`bpf_xdp_sock_convert_ctx_access'


Full randconfig file is attached.


-- 
~Randy
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 5.2.0-rc4 Kernel Configuration
#

#
# Compiler: gcc (SUSE Linux) 4.8.5
#
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=40805
CONFIG_CLANG_VERSION=0
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_WARN_MAYBE_UNINITIALIZED=y
CONFIG_CC_DISABLE_WARN_MAYBE_UNINITIALIZED=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_COMPILE_TEST=y
CONFIG_LOCALVERSION=""
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
# CONFIG_KERNEL_GZIP is not set
CONFIG_KERNEL_BZIP2=y
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SYSVIPC=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_GENERIC_IRQ_DEBUGFS=y
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
CONFIG_PSI=y
# CONFIG_PSI_DEFAULT_DISABLED is not set
# end of CPU/Task time and stats accounting

# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FAST_NO_HZ is not set
# CONFIG_RCU_NOCB_CPU is not set
# end of RCU Subsystem

CONFIG_BUILD_BIN2C=y
# CONFIG_IKCONFIG is not set
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
CONFIG_CGROUP_SCHED=y
# CONFIG_FAIR_GROUP_SCHED is not set
CONFIG_RT_GROUP_SCHED=y
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CPUSETS is not set
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_PERF is not set
# CONFIG_CGROUP_BPF is not set
# CONFIG_CGROUP_DEBUG is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y
# CONFIG_CHECKPOINT_RESTORE is not set
# CONFIG_SCHED_AUTOGROUP is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
# CONFIG_RELAY is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
# CONFIG_RD_LZMA is not set
CONFIG_RD_XZ=y
# CONFIG_RD_LZO is not set
CONFIG_RD_LZ4=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
CONFIG_EXPERT=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
CONFIG_FHANDLE=y
# CONFIG_POSIX_TIMERS is not set
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
# CONFIG_ELF_CORE is not set
# CONFIG_PCSPKR_PLATFORM is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
# CONFIG_EPOLL is not set
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
# CONFIG_EVENTFD is not set
# CONFIG_SHMEM is not set
# CONFIG_AIO is not set
# CONFIG_IO_URING is not set

[PATCH v3] platform/chrome: cros_ec_lpc: Choose Microchip EC at runtime

2019-06-12 Thread Enric Balletbo i Serra

On many boards, communication between the kernel and the Embedded
Controller happens over an LPC bus. In these cases, the kernel config
CONFIG_CROS_EC_LPC is enabled. Some of these LPC boards contain a
Microchip Embedded Controller (MEC) that is different from the regular
EC. On these devices, the same LPC bus is used, but the protocol is
a little different. In these cases, the CONFIG_CROS_EC_LPC_MEC kernel
config is enabled. Currently, the kernel decides at compile-time whether
or not to use the MEC variant, and, when that kernel option is selected
it breaks the other boards. We would like a kind of runtime detection to
avoid this.

This patch adds that detection mechanism by probing the protocol at
runtime, first we assume that a MEC variant is connected, and if the
protocol fails it fallbacks to the regular EC. This adds a bit of
overload because we try to read twice on those LPC boards that doesn't
contain a MEC variant, but is a better solution than having to select the
EC variant at compile-time.

While here also fix the alignment in Kconfig file for this config option
replacing the spaces by tabs.

Signed-off-by: Enric Balletbo i Serra 
Reviewed-by: Ezequiel Garcia 
Tested-by: Nick Crews 
---
Hi,

This is another attempt to solve the issue to be able to select at
runtime the CrOS MEC variant. My first thought was check for a device
ID,
the MEC1322 has a register that contains the device ID, however I am not
sure if we can read that register from the host without modifying the
firmware. Also, I am not sure if the MEC1322 is the only device used
that supports that LPC protocol variant, so I ended with a more easy
solution, check if the protocol fails or not. Some background on this
issue can be found [1] and [2]

The patch has been tested on:
 - Acer Chromebook R11 (Cyan - MEC variant)
 - Pixel Chromebook 2015 (Samus - non-MEC variant)
 - Dell Chromebook 11 (Wolf - non-MEC variant)
 - Toshiba Chromebook (Leon - non-MEC variant)

Best regards,
 Enric

[1] https://bugs.chromium.org/p/chromium/issues/detail?id=932626
[2] 
https://chromium-review.googlesource.com/c/chromiumos/overlays/chromiumos-overlay/+/1474254

Changes in v3:
- Kconfig: Split across multiple lines to keep it under 80 characters.
- Improve kernel-doc as suggested by Nick Crews.
- Convert msg in write function to const.
- Add rb and tb tags.

Changes in v2:
- Remove global bool to indicate the kind of variant as suggested by Ezequiel.
- Create an internal operations struct to allow different variants.

 drivers/platform/chrome/Kconfig   | 29 +++--
 drivers/platform/chrome/Makefile  |  2 +-
 drivers/platform/chrome/cros_ec_lpc.c | 77 ---
 drivers/platform/chrome/cros_ec_lpc_reg.c | 38 +++
 drivers/platform/chrome/cros_ec_lpc_reg.h | 17 -
 drivers/platform/chrome/wilco_ec/Kconfig  |  2 +-
 6 files changed, 89 insertions(+), 76 deletions(-)

diff --git a/drivers/platform/chrome/Kconfig b/drivers/platform/chrome/Kconfig
index 2826f7136f65..453e69733842 100644
--- a/drivers/platform/chrome/Kconfig
+++ b/drivers/platform/chrome/Kconfig
@@ -83,28 +83,17 @@ config CROS_EC_SPI
  'pre-amble' bytes before the response actually starts.
 
 config CROS_EC_LPC
-tristate "ChromeOS Embedded Controller (LPC)"
-depends on MFD_CROS_EC && ACPI && (X86 || COMPILE_TEST)
-help
-  If you say Y here, you get support for talking to the ChromeOS EC
-  over an LPC bus. This uses a simple byte-level protocol with a
-  checksum. This is used for userspace access only. The kernel
-  typically has its own communication methods.
-
-  To compile this driver as a module, choose M here: the
-  module will be called cros_ec_lpc.
-
-config CROS_EC_LPC_MEC
-   bool "ChromeOS Embedded Controller LPC Microchip EC (MEC) variant"
-   depends on CROS_EC_LPC
-   default n
+   tristate "ChromeOS Embedded Controller (LPC)"
+   depends on MFD_CROS_EC && ACPI && (X86 || COMPILE_TEST)
help
- If you say Y here, a variant LPC protocol for the Microchip EC
- will be used. Note that this variant is not backward compatible
- with non-Microchip ECs.
+ If you say Y here, you get support for talking to the ChromeOS EC
+ over an LPC bus, including the LPC Microchip EC (MEC) variant.
+ This uses a simple byte-level protocol with a checksum. This is
+ used for userspace access only. The kernel typically has its own
+ communication methods.
 
- If you have a ChromeOS Embedded Controller Microchip EC variant
- choose Y here.
+ To compile this driver as a module, choose M here: the
+ module will be called cros_ec_lpcs.
 
 config CROS_EC_PROTO
 bool
diff --git a/drivers/platform/chrome/Makefile b/drivers/platform/chrome/Makefile
index 1b2f1dcfcd5c..da9aa08d9fa6 100644
--- a/drivers/platform/chrome/Makefile
+++

[PATCH] binder: fix possible UAF when freeing buffer

2019-06-12 Thread Todd Kjos

There is a race between the binder driver cleaning
up a completed transaction via binder_free_transaction()
and a user calling binder_ioctl(BC_FREE_BUFFER) to
release a buffer. It doesn't matter which is first but
they need to be protected against running concurrently
which can result in a UAF.

Signed-off-by: Todd Kjos 
---
 drivers/android/binder.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 748ac489ef7eb..bc26b5511f0a9 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -1941,8 +1941,18 @@ static void binder_free_txn_fixups(struct 
binder_transaction *t)
 
 static void binder_free_transaction(struct binder_transaction *t)
 {
-   if (t->buffer)
-   t->buffer->transaction = NULL;
+   struct binder_proc *target_proc = t->to_proc;
+
+   if (target_proc) {
+   binder_inner_proc_lock(target_proc);
+   if (t->buffer)
+   t->buffer->transaction = NULL;
+   binder_inner_proc_unlock(target_proc);
+   }
+   /*
+* If the transaction has no target_proc, then
+* t->buffer->transaction has already been cleared.
+*/
binder_free_txn_fixups(t);
kfree(t);
binder_stats_deleted(BINDER_STAT_TRANSACTION);
@@ -3551,10 +3561,12 @@ static void binder_transaction(struct binder_proc *proc,
 static void
 binder_free_buf(struct binder_proc *proc, struct binder_buffer *buffer)
 {
+   binder_inner_proc_lock(proc);
if (buffer->transaction) {
buffer->transaction->buffer = NULL;
buffer->transaction = NULL;
}
+   binder_inner_proc_unlock(proc);
if (buffer->async_transaction && buffer->target_node) {
struct binder_node *buf_node;
struct binder_work *w;
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

[GIT PULL] cpupower update for Linux 5.2-rc6

2019-06-12 Thread Shuah Khan


Hi Rafael,

Please pull the following update for Linux 5.2-rc6 or 5.3 depending on
your pull request schedule for Linus.

This cpupower update for Linux 5.2-rc6 consists of a fix and a minor
spelling correction.

diff is attached.

thanks,
-- Shuah


The following changes since commit f2c7c76c5d0a443053e94adb9f0918fa2fb85c3a:

  Linux 5.2-rc3 (2019-06-02 13:55:33 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux 
tags/linux-cpupower-5.2-rc6


for you to fetch changes up to 04507c0a9385cc8280f794a36bfff567c8cc1042:

  cpupower : frequency-set -r option misses the last cpu in related cpu 
list (2019-06-04 09:06:50 -0600)



linux-cpupower-5.2-rc6

This cpupower update for Linux 5.2-rc6 consists of a fix and a minor
spelling correction.


Abhishek Goel (1):
  cpupower : frequency-set -r option misses the last cpu in related 
cpu list


Nick Black (1):
  cpupower: correct spelling of interval

 tools/power/cpupower/man/cpupower-monitor.1 | 2 +-
 tools/power/cpupower/po/cs.po   | 2 +-
 tools/power/cpupower/po/de.po   | 2 +-
 tools/power/cpupower/po/fr.po   | 2 +-
 tools/power/cpupower/po/it.po   | 2 +-
 tools/power/cpupower/po/pt.po   | 2 +-
 tools/power/cpupower/utils/cpufreq-set.c| 2 ++
 7 files changed, 8 insertions(+), 6 deletions(-)


diff --git a/tools/power/cpupower/man/cpupower-monitor.1 b/tools/power/cpupower/man/cpupower-monitor.1
index 914cbb9d9cd0..70a56476f4b0 100644
--- a/tools/power/cpupower/man/cpupower-monitor.1
+++ b/tools/power/cpupower/man/cpupower-monitor.1
@@ -61,7 +61,7 @@ Only display specific monitors. Use the monitor string(s) provided by \-l option
 .PP
 \-i seconds
 .RS 4
-Measure intervall.
+Measure interval.
 .RE
 .PP
 \-c
diff --git a/tools/power/cpupower/po/cs.po b/tools/power/cpupower/po/cs.po
index cb22c45c5069..bfc7e1702ec9 100644
--- a/tools/power/cpupower/po/cs.po
+++ b/tools/power/cpupower/po/cs.po
@@ -98,7 +98,7 @@ msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:74
 #, c-format
-msgid "\t -i: time intervall to measure for in seconds (default 1)\n"
+msgid "\t -i: time interval to measure for in seconds (default 1)\n"
 msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:75
diff --git a/tools/power/cpupower/po/de.po b/tools/power/cpupower/po/de.po
index 840c17cc450a..70887bb8ba95 100644
--- a/tools/power/cpupower/po/de.po
+++ b/tools/power/cpupower/po/de.po
@@ -95,7 +95,7 @@ msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:74
 #, c-format
-msgid "\t -i: time intervall to measure for in seconds (default 1)\n"
+msgid "\t -i: time interval to measure for in seconds (default 1)\n"
 msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:75
diff --git a/tools/power/cpupower/po/fr.po b/tools/power/cpupower/po/fr.po
index b46ca2548f86..b6e505b34e4a 100644
--- a/tools/power/cpupower/po/fr.po
+++ b/tools/power/cpupower/po/fr.po
@@ -95,7 +95,7 @@ msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:74
 #, c-format
-msgid "\t -i: time intervall to measure for in seconds (default 1)\n"
+msgid "\t -i: time interval to measure for in seconds (default 1)\n"
 msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:75
diff --git a/tools/power/cpupower/po/it.po b/tools/power/cpupower/po/it.po
index f80c4ddb9bda..a1deeb52c9e0 100644
--- a/tools/power/cpupower/po/it.po
+++ b/tools/power/cpupower/po/it.po
@@ -95,7 +95,7 @@ msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:74
 #, c-format
-msgid "\t -i: time intervall to measure for in seconds (default 1)\n"
+msgid "\t -i: time interval to measure for in seconds (default 1)\n"
 msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:75
diff --git a/tools/power/cpupower/po/pt.po b/tools/power/cpupower/po/pt.po
index 990f5267ffe8..902186585bb9 100644
--- a/tools/power/cpupower/po/pt.po
+++ b/tools/power/cpupower/po/pt.po
@@ -93,7 +93,7 @@ msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:74
 #, c-format
-msgid "\t -i: time intervall to measure for in seconds (default 1)\n"
+msgid "\t -i: time interval to measure for in seconds (default 1)\n"
 msgstr ""
 
 #: utils/idle_monitor/cpupower-monitor.c:75
diff --git a/tools/power/cpupower/utils/cpufreq-set.c b/tools/power/cpupower/utils/cpufreq-set.c
index f49bc4aa2a08..6ed82fba5aaa 100644
--- a/tools/power/cpupower/utils/cpufreq-set.c
+++ b/tools/power/cpupower/utils/cpufreq-set.c
@@ -305,6 +305,8 @@ int cmd_freq_set(int argc, char **argv)
 bitmask_setbit(cpus_chosen, cpus->cpu);
 cpus = cpus->next;
 			}
+			/* Set the last cpu in related cpus list */
+			bitmask_setbit(cpus_chosen, cpus->cpu);
 			cpufreq_put_related_cpus(cpus);
 		}
 	}

Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets

2019-06-12 Thread Andy Lutomirski




> On Jun 12, 2019, at 12:55 PM, Dave Hansen  wrote:
> 
>> On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
>> This patch series proposes to introduce a region for what we call
>> process-local memory into the kernel's virtual address space. 
> 
> It might be fun to cc some x86 folks on this series.  They might have
> some relevant opinions. ;)
> 
> A few high-level questions:
> 
> Why go to all this trouble to hide guest state like registers if all the
> guest data itself is still mapped?
> 
> Where's the context-switching code?  Did I just miss it?
> 
> We've discussed having per-cpu page tables where a given PGD is only in
> use from one CPU at a time.  I *think* this scheme still works in such a
> case, it just adds one more PGD entry that would have to context-switched.

Fair warning: Linus is on record as absolutely hating this idea. He might 
change his mind, but it’s an uphill battle.

Re: [PATCH] locking/static_key: always define static_branch_deferred_inc

2019-06-12 Thread Willem de Bruijn

On Wed, Jun 12, 2019 at 3:59 PM Jakub Kicinski
 wrote:
>
> On Wed, 12 Jun 2019 15:44:09 -0400, Willem de Bruijn wrote:
> > From: Willem de Bruijn 
> >
> > This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
> > available also when jump labels are disabled.
> >
> > Fixes: ad282a8117d50 ("locking/static_key: Add support for deferred static 
> > branches")
> > Signed-off-by: Willem de Bruijn 
> >
> > ---
> >
> > The original patch went into 5.2-rc1, but this interface is not yet
> > used, so this could target either 5.2 or 5.3.
>
> Can we drop the Fixes tag?  It's an ugly omission but not a bug fix.
>
> Are you planning to switch clean_acked_data_enable() to the helper once
> merged?

Definitely, can do.

Perhaps it's easiest to send both as a single patch set through net-next, then?

Re: [RESEND PATCH v1 1/5] of/platform: Speed up of_find_device_by_node()

2019-06-12 Thread Frank Rowand

On 6/12/19 12:29 PM, Saravana Kannan wrote:
> On Wed, Jun 12, 2019 at 11:19 AM Rob Herring  wrote:
>>
>> On Wed, Jun 12, 2019 at 11:08 AM Greg Kroah-Hartman
>>  wrote:
>>>
>>> On Wed, Jun 12, 2019 at 10:53:09AM -0600, Rob Herring wrote:
 On Wed, Jun 12, 2019 at 8:22 AM Greg Kroah-Hartman
  wrote:
>
> On Wed, Jun 12, 2019 at 07:53:39AM -0600, Rob Herring wrote:
>> On Tue, Jun 11, 2019 at 3:52 PM Sandeep Patil  
>> wrote:
>>>
>>> On Tue, Jun 11, 2019 at 01:56:25PM -0700, 'Saravana Kannan' via 
>>> kernel-team wrote:
 On Tue, Jun 11, 2019 at 8:18 AM Frank Rowand  
 wrote:
>
> Hi Saravana,
>
> On 6/10/19 10:36 AM, Rob Herring wrote:
>> Why are you resending this rather than replying to Frank's last
>> comments on the original?
>
> Adding on a different aspect...  The independent replies from three 
> different
> maintainers (Rob, Mark, myself) pointed out architectural issues with 
> the
> patch series.  There were also some implementation issues brought out.
> (Although I refrained from bringing up most of my implementation 
> issues
> as they are not relevant until architecture issues are resolved.)

 Right, I'm not too worried about the implementation issues before we
 settle on the architectural issues. Those are easy to fix.

 Honestly, the main points that the maintainers raised are:
 1) This is a configuration property and not describing the device.
 Just use the implicit dependencies coming from existing bindings.

 I gave a bunch of reasons for why I think it isn't an OS configuration
 property. But even if that's not something the maintainers can agree
 to, I gave a concrete example (cyclic dependencies between clock
 provider hardware) where the implicit dependencies would prevent one
 of the devices from probing till the end of time. So even if the
 maintainers don't agree we should always look at "depends-on" to
 decide the dependencies, we still need some means to override the
 implicit dependencies where they don't match the real dependency. Can
 we use depends-on as an override when the implicit dependencies aren't
 correct?

 2) This doesn't need to be solved because this is just optimizing
 probing or saving power ("we should get rid of this auto disabling"):

 I explained why this patch series is not just about optimizing probe
 ordering or saving power. And why we can't ignore auto disabling
 (because it's more than just auto disabling). The kernel is currently
 broken when trying to use modules in ARM SoCs (probably in other
 systems/archs too, but I can't speak for those).

 3) Concerns about backwards compatibility

 I pointed out why the current scheme (depends-on being the only source
 of dependency) doesn't break compatibility. And if we go with
 "depends-on" as an override what we could do to keep backwards
 compatibility. Happy to hear more thoughts or discuss options.

 4) How the "sync_state" would work for a device that supplies multiple
 functionalities but a limited driver.
>>>
>>> 
>>> To be clear, all of above are _real_ problems that stops us from 
>>> efficiently
>>> load device drivers as modules for Android.
>>>
>>> So, if 'depends-on' doesn't seem like the right approach and "going 
>>> back to
>>> the drawing board" is the ask, could you please point us in the right
>>> direction?
>>
>> Use the dependencies which are already there in DT. That's clocks,
>> pinctrl, regulators, interrupts, gpio at a minimum. I'm simply not
>> going to accept duplicating all those dependencies in DT. The downside
>> for the kernel is you have to address these one by one and can't have
>> a generic property the driver core code can parse. After that's in
>> place, then maybe we can consider handling any additional dependencies
>> not already captured in DT. Once all that is in place, we can probably
>> sort device and/or driver lists to optimize the probe order (maybe the
>> driver core already does that now?).
>>
>> Get rid of the auto disabling of clocks and regulators in
>> late_initcall. It's simply not a valid marker that boot is done when
>> modules are involved. We probably can't get rid of it as lot's of
>> platforms rely on that, so it will have to be opt out. Make it the
>> platform's responsibility for ensuring a consistent state.
>>
>> Perhaps we need a 'boot done' or 'stop deferring probe' trigger from
>> userspace in order to make progress if dependencies are missing.
>
> People have tried to do

Re: [PATCH v4 0/7] cpufreq support for Raspberry Pi

2019-06-12 Thread Stefan Wahren

Hi Nicolas,

Am 12.06.19 um 20:24 schrieb Nicolas Saenz Julienne:
> Hi all,
> this aims at adding cpufreq support to the Raspberry Pi family of
> boards.
>
> The series first factors out 'pllb' from clk-bcm2385 and creates a new
> clk driver that operates it over RPi's firmware interface[1]. We are
> forced to do so as the firmware 'owns' the pll and we're not allowed to
> change through the register interface directly as we might race with the
> over-temperature and under-voltage protections provided by the firmware.
>
> Next it creates a minimal cpufreq driver that populates the CPU's opp
> table, and registers cpufreq-dt. Which is needed as the firmware
> controls the max and min frequencies available.
>
> This was tested on a RPi3b+ and RPI2b, both using multi_v7_defconfig and
> arm64's defconfig.
>
this whole series is:

Acked-by: Stefan Wahren 

Thanks

Re: WARNING in binder_transaction_buffer_release

2019-06-12 Thread Todd Kjos

On Wed, Jun 12, 2019 at 12:23 PM Eric Biggers  wrote:
>
> On Mon, May 20, 2019 at 07:18:06AM -0700, syzbot wrote:
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:72cf0b07 Merge tag 'sound-fix-5.2-rc1' of git://git.kernel..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17c7d4bca0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=d103f114f9010324
> > dashboard link: https://syzkaller.appspot.com/bug?extid=8b3c354d33c4ac78bfad
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > userspace arch: i386
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15b99b44a0
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+8b3c354d33c4ac78b...@syzkaller.appspotmail.com
> >
> > WARNING: CPU: 1 PID: 8535 at drivers/android/binder.c:2368
> > binder_transaction_buffer_release+0x673/0x8f0 drivers/android/binder.c:2368
> > Kernel panic - not syncing: panic_on_warn set ...
> > CPU: 1 PID: 8535 Comm: syz-executor.2 Not tainted 5.1.0+ #19
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > Google 01/01/2011
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
> >  panic+0x2cb/0x715 kernel/panic.c:214
> >  __warn.cold+0x20/0x4c kernel/panic.c:571
> >  report_bug+0x263/0x2b0 lib/bug.c:186
> >  fixup_bug arch/x86/kernel/traps.c:179 [inline]
> >  fixup_bug arch/x86/kernel/traps.c:174 [inline]
> >  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
> >  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
> >  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:986
> > RIP: 0010:binder_transaction_buffer_release+0x673/0x8f0
> > drivers/android/binder.c:2368
> > Code: 31 ff 41 89 c5 89 c6 e8 7b 04 1f fc 45 85 ed 0f 85 1f 41 01 00 49 8d
> > 47 40 48 89 85 50 fe ff ff e9 9d fa ff ff e8 dd 02 1f fc <0f> 0b e9 7f fc ff
> > ff e8 d1 02 1f fc 48 89 d8 45 31 c9 4c 89 fe 4c
> > RSP: 0018:88807b2775f0 EFLAGS: 00010293
> > RAX: 888092b1e040 RBX: 0060 RCX: 111012563caa
> > RDX:  RSI: 85519e13 RDI: 888097a2d248
> > RBP: 88807b2777d8 R08: 888092b1e040 R09: ed100f64eee3
> > R10: ed100f64eee2 R11: 88807b277717 R12: 88808fd2c340
> > R13: 0068 R14: 88807b2777b0 R15: 88809f7ea580
> >  binder_transaction+0x153d/0x6620 drivers/android/binder.c:3484
> >  binder_thread_write+0x87e/0x2820 drivers/android/binder.c:3792
> >  binder_ioctl_write_read drivers/android/binder.c:4836 [inline]
> >  binder_ioctl+0x102f/0x1833 drivers/android/binder.c:5013
> >  __do_compat_sys_ioctl fs/compat_ioctl.c:1052 [inline]
> >  __se_compat_sys_ioctl fs/compat_ioctl.c:998 [inline]
> >  __ia32_compat_sys_ioctl+0x195/0x620 fs/compat_ioctl.c:998
> >  do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
> >  do_fast_syscall_32+0x27b/0xd7d arch/x86/entry/common.c:408
> >  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> > RIP: 0023:0xf7f9e849
> > Code: 85 d2 74 02 89 0a 5b 5d c3 8b 04 24 c3 8b 14 24 c3 8b 3c 24 c3 90 90
> > 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90
> > 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
> > RSP: 002b:f7f9a0cc EFLAGS: 0296 ORIG_RAX: 0036
> > RAX: ffda RBX: 0004 RCX: c0306201
> > RDX: 2140 RSI:  RDI: 
> > RBP:  R08:  R09: 
> > R10:  R11:  R12: 
> > R13:  R14:  R15: 
> > Kernel Offset: disabled
> > Rebooting in 86400 seconds..
> >
> >
> > ---
> > This bug is generated by a bot. It may contain errors.
> > See https://goo.gl/tpsmEJ for more information about syzbot.
> > syzbot engineers can be reached at syzkal...@googlegroups.com.
> >
> > syzbot will keep track of this bug report. See:
> > https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> > syzbot can test patches for this bug, for details see:
> > https://goo.gl/tpsmEJ#testing-patches
> >
>
> Are any of the binder maintainers planning to fix this?  This seems to be the
> only open syzbot report for binder on the upstream kernel.

Taking a look.

>
> - Eric

[PATCH 1/2] thermal/drivers/core: Add init section table for self-encapsulation

2019-06-12 Thread Daniel Lezcano

Currently the governors are declared in their respective files but they
export their [un]register functions which in turn call the [un]register
governors core's functions. That implies a cyclic dependency which is
not desirable. There is a way to self-encapsulate the governors by letting
them to declare themselves in a __init section table.

Define the table in the asm generic linker description like the other
tables and provide the specific macros to deal with.

Reviewed-by: Amit Kucheria 
Signed-off-by: Daniel Lezcano 
---
 drivers/thermal/thermal_core.h| 15 +++
 include/asm-generic/vmlinux.lds.h | 11 +++
 2 files changed, 26 insertions(+)

diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h
index 0df190ed82a7..be901e84aa65 100644
--- a/drivers/thermal/thermal_core.h
+++ b/drivers/thermal/thermal_core.h
@@ -15,6 +15,21 @@
 /* Initial state of a cooling device during binding */
 #define THERMAL_NO_TARGET -1UL
 
+/* Init section thermal table */
+extern struct thermal_governor *__governor_thermal_table[];
+extern struct thermal_governor *__governor_thermal_table_end[];
+
+#define THERMAL_TABLE_ENTRY(table, name)   \
+   (static typeof(name) *__thermal_table_entry_##name  \
+   __used __section(__##table##_thermal_table) = )
+
+#define THERMAL_GOVERNOR_DECLARE(name) THERMAL_TABLE_ENTRY(governor, name)
+
+#define for_each_governor_table(__governor)\
+   for (__governor = __governor_thermal_table; \
+__governor < __governor_thermal_table_end; \
+__governor++)
+
 /*
  * This structure is used to describe the behavior of
  * a certain cooling device on a certain trip point
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index f8f6f04c4453..8312fdc2b2fa 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -239,6 +239,16 @@
 #define ACPI_PROBE_TABLE(name)
 #endif
 
+#ifdef CONFIG_THERMAL
+#define THERMAL_TABLE(name)\
+   . = ALIGN(8);   \
+   __##name##_thermal_table = .;   \
+   KEEP(*(__##name##_thermal_table))   \
+   __##name##_thermal_table_end = .;
+#else
+#define THERMAL_TABLE(name)
+#endif
+
 #define KERNEL_DTB()   \
STRUCT_ALIGN(); \
__dtb_start = .;\
@@ -609,6 +619,7 @@
IRQCHIP_OF_MATCH_TABLE()\
ACPI_PROBE_TABLE(irqchip)   \
ACPI_PROBE_TABLE(timer) \
+   THERMAL_TABLE(governor) \
EARLYCON_TABLE()\
LSM_TABLE()
 
-- 
2.17.1

[PATCH 2/2] thermal/drivers/core: Use governor table to initialize

2019-06-12 Thread Daniel Lezcano

Now that the governor table is in place and the macro allows to browse the
table, declare the governor so the entry is added in the governor table
in the init section.

The [un]register_thermal_governors function does no longer need to use the
exported [un]register thermal governor's specific function which in turn
call the [un]register_thermal_governor. The governors are fully
self-encapsulated.

The cyclic dependency is no longer needed, remove it.

Reviewed-by: Amit Kucheria 
Signed-off-by: Daniel Lezcano 
---
 drivers/thermal/fair_share.c  | 12 +--
 drivers/thermal/gov_bang_bang.c   | 11 +--
 drivers/thermal/power_allocator.c | 11 +--
 drivers/thermal/step_wise.c   | 11 +--
 drivers/thermal/thermal_core.c| 52 +--
 drivers/thermal/thermal_core.h| 40 
 drivers/thermal/user_space.c  | 12 +--
 7 files changed, 34 insertions(+), 115 deletions(-)

diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c
index d3469fbc5207..bda2afc63471 100644
--- a/drivers/thermal/fair_share.c
+++ b/drivers/thermal/fair_share.c
@@ -129,14 +129,4 @@ static struct thermal_governor thermal_gov_fair_share = {
.name   = "fair_share",
.throttle   = fair_share_throttle,
 };
-
-int thermal_gov_fair_share_register(void)
-{
-   return thermal_register_governor(_gov_fair_share);
-}
-
-void thermal_gov_fair_share_unregister(void)
-{
-   thermal_unregister_governor(_gov_fair_share);
-}
-
+THERMAL_GOVERNOR_DECLARE(thermal_gov_fair_share);
diff --git a/drivers/thermal/gov_bang_bang.c b/drivers/thermal/gov_bang_bang.c
index fc5e5057f0de..c5e19c7d63da 100644
--- a/drivers/thermal/gov_bang_bang.c
+++ b/drivers/thermal/gov_bang_bang.c
@@ -126,13 +126,4 @@ static struct thermal_governor thermal_gov_bang_bang = {
.name   = "bang_bang",
.throttle   = bang_bang_control,
 };
-
-int thermal_gov_bang_bang_register(void)
-{
-   return thermal_register_governor(_gov_bang_bang);
-}
-
-void thermal_gov_bang_bang_unregister(void)
-{
-   thermal_unregister_governor(_gov_bang_bang);
-}
+THERMAL_GOVERNOR_DECLARE(thermal_gov_bang_bang);
diff --git a/drivers/thermal/power_allocator.c 
b/drivers/thermal/power_allocator.c
index 3055f9a12a17..44636475b2a3 100644
--- a/drivers/thermal/power_allocator.c
+++ b/drivers/thermal/power_allocator.c
@@ -651,13 +651,4 @@ static struct thermal_governor thermal_gov_power_allocator 
= {
.unbind_from_tz = power_allocator_unbind,
.throttle   = power_allocator_throttle,
 };
-
-int thermal_gov_power_allocator_register(void)
-{
-   return thermal_register_governor(_gov_power_allocator);
-}
-
-void thermal_gov_power_allocator_unregister(void)
-{
-   thermal_unregister_governor(_gov_power_allocator);
-}
+THERMAL_GOVERNOR_DECLARE(thermal_gov_power_allocator);
diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index ee047ca43084..6cd251ab56fc 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -218,13 +218,4 @@ static struct thermal_governor thermal_gov_step_wise = {
.name   = "step_wise",
.throttle   = step_wise_throttle,
 };
-
-int thermal_gov_step_wise_register(void)
-{
-   return thermal_register_governor(_gov_step_wise);
-}
-
-void thermal_gov_step_wise_unregister(void)
-{
-   thermal_unregister_governor(_gov_step_wise);
-}
+THERMAL_GOVERNOR_DECLARE(thermal_gov_step_wise);
diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
index 3ac0e2b564e2..533530529607 100644
--- a/drivers/thermal/thermal_core.c
+++ b/drivers/thermal/thermal_core.c
@@ -243,36 +243,42 @@ int thermal_build_list_of_policies(char *buf)
return count;
 }
 
-static int __init thermal_register_governors(void)
+static void __init thermal_unregister_governors(void)
 {
-   int result;
+   struct thermal_governor **governor;
 
-   result = thermal_gov_step_wise_register();
-   if (result)
-   return result;
+   for_each_governor_table(governor)
+   thermal_unregister_governor(*governor);
+}
 
-   result = thermal_gov_fair_share_register();
-   if (result)
-   return result;
+static int __init thermal_register_governors(void)
+{
+   int ret = 0;
+   struct thermal_governor **governor;
 
-   result = thermal_gov_bang_bang_register();
-   if (result)
-   return result;
+   for_each_governor_table(governor) {
+   ret = thermal_register_governor(*governor);
+   if (ret) {
+   pr_err("Failed to register governor: '%s'",
+  (*governor)->name);
+   break;
+   }
 
-   result = thermal_gov_user_space_register();
-   if (result)
-   return result;
+   pr_info("Registered thermal governor '%s'",
+

Re: [PATCH] arm64: defconfig: Enable FSL_EDMA driver

2019-06-12 Thread Li Yang

On Thu, May 9, 2019 at 10:15 PM Shawn Guo  wrote:
>
> On Mon, Apr 22, 2019 at 01:30:56PM -0500, Li Yang wrote:
> > Enables the FSL EDMA driver by default.  This also works around an issue
> > that imx-i2c driver keeps deferring the probe because of the DMA is not
> > ready.  And currently the DMA engine framework can not correctly tell
> > if the DMA channels will truly become available later (it will never be
> > available if the DMA driver is not enabled).
> >
> > This will cause indefinite messages like below:
> > [3.335829] imx-i2c 218.i2c: can't get pinctrl, bus recovery not 
> > supported
> > [3.344455] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm)
> > [3.350917] lm90 0-004c: 0-004c supply vcc not found, using dummy 
> > regulator
> > [3.362089] imx-i2c 218.i2c: can't get pinctrl, bus recovery not 
> > supported
> > [3.370741] ina2xx 0-0040: power monitor ina220 (Rshunt = 1000 uOhm)
> > [3.377205] lm90 0-004c: 0-004c supply vcc not found, using dummy 
> > regulator
> > [3.388455] imx-i2c 218.i2c: can't get pinctrl, bus recovery not 
> > supported
> > .
> >
> > Signed-off-by: Li Yang 
>
> Applied, thanks.

Hi Shawn,

Is it possible to move this patch to the -fix series so that it can
reach the mainline earlier?  It is having a boot failure in mainline
for platforms using this device without this workaround.

I see Rob added a new API driver_deferred_probe_check_state() last
year.  Probably we should update the imx-i2c driver to use the new API
for optional dependencies to avoid this kind of situation completely?

Regards,
Leo

[PATCH v3 00/14] pwm-meson: cleanups and improvements

2019-06-12 Thread Martin Blumenstingl

This series consists of various cleanups and improvements for the
pwm-meson driver.

Patches 1 to 6 are small code cleanups with the goal of making the code
easier to read.

Patches 7 to 9 are reworking the way the per-channel settings are
accessed. This is a first preparation step for adding full support to
meson_pwm_get_state() in the pwm-meson driver. Patch 7 makes struct
meson_pwm_channel accessible from struct meson_pwm because
meson_pwm_get_state() cannot use pwm_get_chip_data(). Patch 8 removes
redundant switch/case statements and ensures that we don't have to
add another redundant one for the upcoming full meson_pwm_get_state()
implementation. Patch 9 gets rid of meson_pwm_add_channels() and moves
the pwm_set_chip_data() call to meson_pwm_request() (like all other PWM
drivers do - except two).

Patch 10 is based on a suggestion by Uwe to simplify the calculation of
the values which the PWM IP requires. The nice benefit of this is that
we have an easier calculation which we can do "in reverse" for the
meson_pwm_get_state() (which calculates nanoseconds from the hardware
values).

Patch 11 implements reading the period and duty cycle in the
meson_pwm_get_state() callback.

Patch 12 removes some internal caching which we don't need anymore now
meson_pwm_get_state() is fully implemented. The PWM core now takes care
of not calling pwm_ops.apply() if "nothing has changed".

Patch 13 adds support for PWM_POLARITY_INVERSED when disabling the
output as suggested by Uwe.

Patch 14 completes this series by adding some documentation to the
driver. Thanks to Neil for summarizing how the hardware works
internally.

Due to the changed PWM calculation in patch 10 I have verified that
we don't break any existing boards. The patch itself contains two
examples which show that the new calculation improves precision. I
made screenshots of the measurements in pulseview [0] for the second
case ("PWM LED on Khadas VIM"):
- old algorithm: [1]
- old algorithm: [2]

Dependencies:
This series applies on top of Neil's patch "pwm: pwm-meson: update with
SPDX Licence identifier" [3]

Changes since v1 at [4]:
- fixed MESON_NUM_PWM vs MESON_NUM_PWMS typo in patch #7
- add another example to patch #10 where the pre_div has changed with
  the new calculation. the generated PWM signal is still the same as
  measuring shows
- added Neil's Reviewed-by's and Uwe's Acked-by (thank you!)

Changes since v2 at [5]:
- fix the SoC name in the documentation patch (#14). The link points
  to the S912 datasheet so we shouldn't call it the "S922X datasheet".
  Spotted by Chris Moore (thank you!)
- add the link to the S922X datasheet in the documentation patch (#14)
  because that SoC generation contains an updated version of the IP
  block with hardware support for "inversion" and "constant mode"
- put my Signed-off-by after all Reviewed-by/Acked-by to indicate that
  I was the one who put the R-b/A-b there (spotted by Uwe - thank you)
- added Uwe's Reviewed-by to three patches (thank you!)


[0] https://sigrok.org/wiki/PulseView
[1] https://abload.de/img/old-algormjs9.png
[2] https://abload.de/img/new-algo4ckjo.png
[3] https://patchwork.kernel.org/patch/10951319/
[4] https://patchwork.kernel.org/cover/10961073/
[5] https://patchwork.kernel.org/cover/10983279/


Martin Blumenstingl (14):
  pwm: meson: unify the parameter list of meson_pwm_{enable,disable}
  pwm: meson: use devm_clk_get_optional() to get the input clock
  pwm: meson: use GENMASK and FIELD_PREP for the lo and hi values
  pwm: meson: change MISC_CLK_SEL_WIDTH to MISC_CLK_SEL_MASK
  pwm: meson: don't duplicate the polarity internally
  pwm: meson: pass struct pwm_device to meson_pwm_calc()
  pwm: meson: add the meson_pwm_channel data to struct meson_pwm
  pwm: meson: add the per-channel register offsets and bits in a struct
  pwm: meson: move pwm_set_chip_data() to meson_pwm_request()
  pwm: meson: simplify the calculation of the pre-divider and count
  pwm: meson: read the full hardware state in meson_pwm_get_state()
  pwm: meson: don't cache struct pwm_state internally
  pwm: meson: add support PWM_POLARITY_INVERSED when disabling
  pwm: meson: add documentation to the driver

 drivers/pwm/pwm-meson.c | 327 +---
 1 file changed, 173 insertions(+), 154 deletions(-)

-- 
2.22.0

[PATCH v3 03/14] pwm: meson: use GENMASK and FIELD_PREP for the lo and hi values

2019-06-12 Thread Martin Blumenstingl

meson_pwm_calc() ensures that "lo" is always less than 16 bits wide
(otherwise it would overflow into the "hi" part of the REG_PWM_{A,B}
register).
Use GENMASK and FIELD_PREP for the lo and hi values to make it easier to
spot how wide these are internally. Additionally this is a preparation
step for the .get_state() implementation where the GENMASK() for lo and
hi becomes handy because it can be used with FIELD_GET() to extract the
values from the register REG_PWM_{A,B} register.

No functional changes intended.

Reviewed-by: Neil Armstrong 
Reviewed-by: Uwe Kleine-König 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 35b38c7201c3..c62a3ac924d0 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -5,6 +5,8 @@
  * Copyright (C) 2014 Amlogic, Inc.
  */
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -20,7 +22,8 @@
 
 #define REG_PWM_A  0x0
 #define REG_PWM_B  0x4
-#define PWM_HIGH_SHIFT 16
+#define PWM_LOW_MASK   GENMASK(15, 0)
+#define PWM_HIGH_MASK  GENMASK(31, 16)
 
 #define REG_MISC_AB0x8
 #define MISC_B_CLK_EN  BIT(23)
@@ -217,7 +220,8 @@ static void meson_pwm_enable(struct meson_pwm *meson, 
struct pwm_device *pwm)
value |= clk_enable;
writel(value, meson->base + REG_MISC_AB);
 
-   value = (channel->hi << PWM_HIGH_SHIFT) | channel->lo;
+   value = FIELD_PREP(PWM_HIGH_MASK, channel->hi) |
+   FIELD_PREP(PWM_LOW_MASK, channel->lo);
writel(value, meson->base + offset);
 
value = readl(meson->base + REG_MISC_AB);
-- 
2.22.0

[PATCH v3 06/14] pwm: meson: pass struct pwm_device to meson_pwm_calc()

2019-06-12 Thread Martin Blumenstingl

meson_pwm_calc() is the last function that accepts a struct
meson_pwm_channel. meson_pwm_enable(), meson_pwm_disable() and
meson_pwm_apply() for example are all taking a struct pwm_device as
parameter. When they need the struct meson_pwm_channel these functions
simply call pwm_get_chip_data() internally.

Make meson_pwm_calc() consistent with the other functions in the
meson-pwm driver by passing struct pwm_device to it as well. The value
of the "id" parameter is actually pwm->hwpwm, but the driver never read
the "id" parameter, which is why there's no replacement for it in the
new code.

No functional changes.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 39ea119add7b..d6eb4d04d5c9 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -114,10 +114,10 @@ static void meson_pwm_free(struct pwm_chip *chip, struct 
pwm_device *pwm)
clk_disable_unprepare(channel->clk);
 }
 
-static int meson_pwm_calc(struct meson_pwm *meson,
- struct meson_pwm_channel *channel,
+static int meson_pwm_calc(struct meson_pwm *meson, struct pwm_device *pwm,
  struct pwm_state *state)
 {
+   struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
unsigned int duty, period, pre_div, cnt, duty_cnt;
unsigned long fin_freq = -1;
u64 fin_ps;
@@ -280,7 +280,7 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
if (state->period != channel->state.period ||
state->duty_cycle != channel->state.duty_cycle ||
state->polarity != channel->state.polarity) {
-   err = meson_pwm_calc(meson, channel, state);
+   err = meson_pwm_calc(meson, pwm, state);
if (err < 0)
return err;
 
-- 
2.22.0

[PATCH v3 02/14] pwm: meson: use devm_clk_get_optional() to get the input clock

2019-06-12 Thread Martin Blumenstingl

Simplify the code which fetches the input clock for a PWM channel by
using devm_clk_get_optional().
This comes with a small functional change: previously all errors except
EPROBE_DEFER were ignored. Now all other errors are also treated as
errors. If no input clock is present devm_clk_get_optional() will return
NULL instead of an error which matches the behavior of the old code.

Reviewed-by: Neil Armstrong 
Reviewed-by: Uwe Kleine-König 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 3fbbc4128ce8..35b38c7201c3 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -474,14 +474,9 @@ static int meson_pwm_init_channels(struct meson_pwm *meson,
 
snprintf(name, sizeof(name), "clkin%u", i);
 
-   channel->clk_parent = devm_clk_get(dev, name);
-   if (IS_ERR(channel->clk_parent)) {
-   err = PTR_ERR(channel->clk_parent);
-   if (err == -EPROBE_DEFER)
-   return err;
-
-   channel->clk_parent = NULL;
-   }
+   channel->clk_parent = devm_clk_get_optional(dev, name);
+   if (IS_ERR(channel->clk_parent))
+   return PTR_ERR(channel->clk_parent);
}
 
return 0;
-- 
2.22.0

[PATCH v3 09/14] pwm: meson: move pwm_set_chip_data() to meson_pwm_request()

2019-06-12 Thread Martin Blumenstingl

All existing PWM drivers (except pwm-meson and two other ones) call
pwm_set_chip_data() from their pwm_ops.request() callback. Now that we
can access the struct meson_pwm_channel from struct meson_pwm we can do
the same.

Move the call to pwm_set_chip_data() to meson_pwm_request() and drop the
custom meson_pwm_add_channels(). This makes the implementation
consistent with other drivers and makes it slightly more obvious
thatpwm_get_chip_data() cannot be used from pwm_ops.get_state() (because
that's called by the PWM core before pwm_ops.request()).

No functional changes intended.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index ac7e188155fd..27915d6475e3 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -98,12 +98,16 @@ static inline struct meson_pwm *to_meson_pwm(struct 
pwm_chip *chip)
 
 static int meson_pwm_request(struct pwm_chip *chip, struct pwm_device *pwm)
 {
-   struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
+   struct meson_pwm *meson = to_meson_pwm(chip);
+   struct meson_pwm_channel *channel;
struct device *dev = chip->dev;
int err;
 
-   if (!channel)
-   return -ENODEV;
+   channel = pwm_get_chip_data(pwm);
+   if (channel)
+   return 0;
+
+   channel = >channels[pwm->hwpwm];
 
if (channel->clk_parent) {
err = clk_set_parent(channel->clk, channel->clk_parent);
@@ -124,7 +128,7 @@ static int meson_pwm_request(struct pwm_chip *chip, struct 
pwm_device *pwm)
 
chip->ops->get_state(chip, pwm, >state);
 
-   return 0;
+   return pwm_set_chip_data(pwm, channel);
 }
 
 static void meson_pwm_free(struct pwm_chip *chip, struct pwm_device *pwm)
@@ -460,14 +464,6 @@ static int meson_pwm_init_channels(struct meson_pwm *meson)
return 0;
 }
 
-static void meson_pwm_add_channels(struct meson_pwm *meson)
-{
-   unsigned int i;
-
-   for (i = 0; i < meson->chip.npwm; i++)
-   pwm_set_chip_data(>chip.pwms[i], >channels[i]);
-}
-
 static int meson_pwm_probe(struct platform_device *pdev)
 {
struct meson_pwm *meson;
@@ -503,8 +499,6 @@ static int meson_pwm_probe(struct platform_device *pdev)
return err;
}
 
-   meson_pwm_add_channels(meson);
-
platform_set_drvdata(pdev, meson);
 
return 0;
-- 
2.22.0

[PATCH v3 04/14] pwm: meson: change MISC_CLK_SEL_WIDTH to MISC_CLK_SEL_MASK

2019-06-12 Thread Martin Blumenstingl

MISC_CLK_SEL_WIDTH is only used in one place where it's converted into
a bit-mask. Rename and change the macro to be a bit-mask so that
conversion is not needed anymore. No functional changes intended.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index c62a3ac924d0..84b28ba0f903 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -33,7 +33,7 @@
 #define MISC_A_CLK_DIV_SHIFT   8
 #define MISC_B_CLK_SEL_SHIFT   6
 #define MISC_A_CLK_SEL_SHIFT   4
-#define MISC_CLK_SEL_WIDTH 2
+#define MISC_CLK_SEL_MASK  0x3
 #define MISC_B_EN  BIT(1)
 #define MISC_A_EN  BIT(0)
 
@@ -463,7 +463,7 @@ static int meson_pwm_init_channels(struct meson_pwm *meson,
 
channel->mux.reg = meson->base + REG_MISC_AB;
channel->mux.shift = mux_reg_shifts[i];
-   channel->mux.mask = BIT(MISC_CLK_SEL_WIDTH) - 1;
+   channel->mux.mask = MISC_CLK_SEL_MASK;
channel->mux.flags = 0;
channel->mux.lock = >lock;
channel->mux.table = NULL;
-- 
2.22.0

[PATCH v3 11/14] pwm: meson: read the full hardware state in meson_pwm_get_state()

2019-06-12 Thread Martin Blumenstingl

Update the meson_pwm_get_state() implementation to take care of all
information in the registers instead of only reading the "enabled"
state.

The PWM output is only enabled if two conditions are met:
1. the per-channel clock is enabled
2. the PWM output is enabled

Calculate the PWM period and duty cycle using the reverse formula which
we already have in meson_pwm_calc() and update struct pwm_state with the
results.

As result of this /sys/kernel/debug/pwm now shows the PWM state set by
the bootloader (or firmware) after booting Linux.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 52 ++---
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 9afa1e5aaebf..010212166d5d 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -287,19 +287,65 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
return 0;
 }
 
+static unsigned int meson_pwm_cnt_to_ns(struct pwm_chip *chip,
+   struct pwm_device *pwm, u32 cnt)
+{
+   struct meson_pwm *meson = to_meson_pwm(chip);
+   struct meson_pwm_channel *channel;
+   unsigned long fin_freq;
+   u32 fin_ns;
+
+   /* to_meson_pwm() can only be used after .get_state() is called */
+   channel = >channels[pwm->hwpwm];
+
+   fin_freq = clk_get_rate(channel->clk);
+   if (fin_freq == 0)
+   return 0;
+
+   fin_ns = div_u64(NSEC_PER_SEC, fin_freq);
+
+   return cnt * fin_ns * (channel->pre_div + 1);
+}
+
 static void meson_pwm_get_state(struct pwm_chip *chip, struct pwm_device *pwm,
struct pwm_state *state)
 {
struct meson_pwm *meson = to_meson_pwm(chip);
-   u32 value, mask;
+   struct meson_pwm_channel_data *channel_data;
+   struct meson_pwm_channel *channel;
+   u32 value, tmp;
 
if (!state)
return;
 
-   mask = meson_pwm_per_channel_data[pwm->hwpwm].pwm_en_mask;
+   channel = >channels[pwm->hwpwm];
+   channel_data = _pwm_per_channel_data[pwm->hwpwm];
 
value = readl(meson->base + REG_MISC_AB);
-   state->enabled = (value & mask) != 0;
+
+   tmp = channel_data->pwm_en_mask | channel_data->clk_en_mask;
+   state->enabled = (value & tmp) == tmp;
+
+   tmp = value >> channel_data->clk_div_shift;
+   channel->pre_div = FIELD_GET(MISC_CLK_DIV_MASK, tmp);
+
+   value = readl(meson->base + channel_data->reg_offset);
+
+   channel->lo = FIELD_GET(PWM_LOW_MASK, value);
+   channel->hi = FIELD_GET(PWM_HIGH_MASK, value);
+
+   if (channel->lo == 0) {
+   state->period = meson_pwm_cnt_to_ns(chip, pwm, channel->hi);
+   state->duty_cycle = state->period;
+   } else if (channel->lo >= channel->hi) {
+   state->period = meson_pwm_cnt_to_ns(chip, pwm,
+   channel->lo + channel->hi);
+   state->duty_cycle = meson_pwm_cnt_to_ns(chip, pwm,
+   channel->hi);
+   } else {
+   state->period = 0;
+   state->duty_cycle = 0;
+   }
 }
 
 static const struct pwm_ops meson_pwm_ops = {
-- 
2.22.0

[PATCH v3 07/14] pwm: meson: add the meson_pwm_channel data to struct meson_pwm

2019-06-12 Thread Martin Blumenstingl

Make struct meson_pwm_channel accessible from struct meson_pwm.

PWM core has a limitation: per-channel data can only be set after
pwmchip_add() is called. However, pwmchip_add() internally calls
pwm_ops.get_state(). If pwm_ops.get_state() needs access to the
per-channel data it has to obtain it from struct pwm_chip and struct
pwm_device's hwpwm information.

Add a struct meson_pwm_channel for each PWM channel to struct meson_pwm
so the pwm_ops.get_state() callback can be implemented as it needs
access to the clock from struct meson_pwm_channel.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index d6eb4d04d5c9..a4ae3587a3ce 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -37,6 +37,8 @@
 #define MISC_B_EN  BIT(1)
 #define MISC_A_EN  BIT(0)
 
+#define MESON_NUM_PWMS 2
+
 static const unsigned int mux_reg_shifts[] = {
MISC_A_CLK_SEL_SHIFT,
MISC_B_CLK_SEL_SHIFT
@@ -62,6 +64,7 @@ struct meson_pwm_data {
 struct meson_pwm {
struct pwm_chip chip;
const struct meson_pwm_data *data;
+   struct meson_pwm_channel channels[MESON_NUM_PWMS];
void __iomem *base;
/*
 * Protects register (write) access to the REG_MISC_AB register
@@ -435,8 +438,7 @@ static const struct of_device_id meson_pwm_matches[] = {
 };
 MODULE_DEVICE_TABLE(of, meson_pwm_matches);
 
-static int meson_pwm_init_channels(struct meson_pwm *meson,
-  struct meson_pwm_channel *channels)
+static int meson_pwm_init_channels(struct meson_pwm *meson)
 {
struct device *dev = meson->chip.dev;
struct clk_init_data init;
@@ -445,7 +447,7 @@ static int meson_pwm_init_channels(struct meson_pwm *meson,
int err;
 
for (i = 0; i < meson->chip.npwm; i++) {
-   struct meson_pwm_channel *channel = [i];
+   struct meson_pwm_channel *channel = >channels[i];
 
snprintf(name, sizeof(name), "%s#mux%u", dev_name(dev), i);
 
@@ -480,18 +482,16 @@ static int meson_pwm_init_channels(struct meson_pwm 
*meson,
return 0;
 }
 
-static void meson_pwm_add_channels(struct meson_pwm *meson,
-  struct meson_pwm_channel *channels)
+static void meson_pwm_add_channels(struct meson_pwm *meson)
 {
unsigned int i;
 
for (i = 0; i < meson->chip.npwm; i++)
-   pwm_set_chip_data(>chip.pwms[i], [i]);
+   pwm_set_chip_data(>chip.pwms[i], >channels[i]);
 }
 
 static int meson_pwm_probe(struct platform_device *pdev)
 {
-   struct meson_pwm_channel *channels;
struct meson_pwm *meson;
struct resource *regs;
int err;
@@ -509,18 +509,13 @@ static int meson_pwm_probe(struct platform_device *pdev)
meson->chip.dev = >dev;
meson->chip.ops = _pwm_ops;
meson->chip.base = -1;
-   meson->chip.npwm = 2;
+   meson->chip.npwm = MESON_NUM_PWMS;
meson->chip.of_xlate = of_pwm_xlate_with_flags;
meson->chip.of_pwm_n_cells = 3;
 
meson->data = of_device_get_match_data(>dev);
 
-   channels = devm_kcalloc(>dev, meson->chip.npwm,
-   sizeof(*channels), GFP_KERNEL);
-   if (!channels)
-   return -ENOMEM;
-
-   err = meson_pwm_init_channels(meson, channels);
+   err = meson_pwm_init_channels(meson);
if (err < 0)
return err;
 
@@ -530,7 +525,7 @@ static int meson_pwm_probe(struct platform_device *pdev)
return err;
}
 
-   meson_pwm_add_channels(meson, channels);
+   meson_pwm_add_channels(meson);
 
platform_set_drvdata(pdev, meson);
 
-- 
2.22.0

[PATCH v3 08/14] pwm: meson: add the per-channel register offsets and bits in a struct

2019-06-12 Thread Martin Blumenstingl

Introduce struct meson_pwm_channel_data which contains the per-channel
offsets for the PWM register and REG_MISC_AB bits. Replace the existing
switch (pwm->hwpwm) statements with an access to the new struct.

This simplifies the code and will make it easier to implement
pwm_ops.get_state() because the switch-case which all per-channel
registers and offsets (as previously implemented in meson_pwm_enable())
doesn't have to be duplicated.

No functional changes intended.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 90 -
 1 file changed, 34 insertions(+), 56 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index a4ae3587a3ce..ac7e188155fd 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -39,9 +39,27 @@
 
 #define MESON_NUM_PWMS 2
 
-static const unsigned int mux_reg_shifts[] = {
-   MISC_A_CLK_SEL_SHIFT,
-   MISC_B_CLK_SEL_SHIFT
+static struct meson_pwm_channel_data {
+   u8  reg_offset;
+   u8  clk_sel_shift;
+   u8  clk_div_shift;
+   u32 clk_en_mask;
+   u32 pwm_en_mask;
+} meson_pwm_per_channel_data[MESON_NUM_PWMS] = {
+   {
+   .reg_offset = REG_PWM_A,
+   .clk_sel_shift  = MISC_A_CLK_SEL_SHIFT,
+   .clk_div_shift  = MISC_A_CLK_DIV_SHIFT,
+   .clk_en_mask= MISC_A_CLK_EN,
+   .pwm_en_mask= MISC_A_EN,
+   },
+   {
+   .reg_offset = REG_PWM_B,
+   .clk_sel_shift  = MISC_B_CLK_SEL_SHIFT,
+   .clk_div_shift  = MISC_B_CLK_DIV_SHIFT,
+   .clk_en_mask= MISC_B_CLK_EN,
+   .pwm_en_mask= MISC_B_EN,
+   }
 };
 
 struct meson_pwm_channel {
@@ -194,43 +212,26 @@ static int meson_pwm_calc(struct meson_pwm *meson, struct 
pwm_device *pwm,
 static void meson_pwm_enable(struct meson_pwm *meson, struct pwm_device *pwm)
 {
struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
-   u32 value, clk_shift, clk_enable, enable;
-   unsigned int offset;
+   struct meson_pwm_channel_data *channel_data;
unsigned long flags;
+   u32 value;
 
-   switch (pwm->hwpwm) {
-   case 0:
-   clk_shift = MISC_A_CLK_DIV_SHIFT;
-   clk_enable = MISC_A_CLK_EN;
-   enable = MISC_A_EN;
-   offset = REG_PWM_A;
-   break;
-
-   case 1:
-   clk_shift = MISC_B_CLK_DIV_SHIFT;
-   clk_enable = MISC_B_CLK_EN;
-   enable = MISC_B_EN;
-   offset = REG_PWM_B;
-   break;
-
-   default:
-   return;
-   }
+   channel_data = _pwm_per_channel_data[pwm->hwpwm];
 
spin_lock_irqsave(>lock, flags);
 
value = readl(meson->base + REG_MISC_AB);
-   value &= ~(MISC_CLK_DIV_MASK << clk_shift);
-   value |= channel->pre_div << clk_shift;
-   value |= clk_enable;
+   value &= ~(MISC_CLK_DIV_MASK << channel_data->clk_div_shift);
+   value |= channel->pre_div << channel_data->clk_div_shift;
+   value |= channel_data->clk_en_mask;
writel(value, meson->base + REG_MISC_AB);
 
value = FIELD_PREP(PWM_HIGH_MASK, channel->hi) |
FIELD_PREP(PWM_LOW_MASK, channel->lo);
-   writel(value, meson->base + offset);
+   writel(value, meson->base + channel_data->reg_offset);
 
value = readl(meson->base + REG_MISC_AB);
-   value |= enable;
+   value |= channel_data->pwm_en_mask;
writel(value, meson->base + REG_MISC_AB);
 
spin_unlock_irqrestore(>lock, flags);
@@ -238,26 +239,13 @@ static void meson_pwm_enable(struct meson_pwm *meson, 
struct pwm_device *pwm)
 
 static void meson_pwm_disable(struct meson_pwm *meson, struct pwm_device *pwm)
 {
-   u32 value, enable;
unsigned long flags;
-
-   switch (pwm->hwpwm) {
-   case 0:
-   enable = MISC_A_EN;
-   break;
-
-   case 1:
-   enable = MISC_B_EN;
-   break;
-
-   default:
-   return;
-   }
+   u32 value;
 
spin_lock_irqsave(>lock, flags);
 
value = readl(meson->base + REG_MISC_AB);
-   value &= ~enable;
+   value &= ~meson_pwm_per_channel_data[pwm->hwpwm].pwm_en_mask;
writel(value, meson->base + REG_MISC_AB);
 
spin_unlock_irqrestore(>lock, flags);
@@ -309,18 +297,7 @@ static void meson_pwm_get_state(struct pwm_chip *chip, 
struct pwm_device *pwm,
if (!state)
return;
 
-   switch (pwm->hwpwm) {
-   case 0:
-   mask = MISC_A_EN;
-   break;
-
-   case 1:
-   mask = MISC_B_EN;
-   break;
-
-   default:
-   return;
-   }
+   mask = meson_pwm_per_channel_data[pwm->hwpwm].pwm_en_mask;
 
value = readl(meson->base +

[PATCH v3 05/14] pwm: meson: don't duplicate the polarity internally

2019-06-12 Thread Martin Blumenstingl

Let meson_pwm_calc() use the polarity from struct pwm_state directly.
This removes a level of indirection where meson_pwm_apply() first had to
set a driver-internal inverter mask which was then only used by
meson_pwm_calc().

Instead of adding the polarity as parameter to meson_pwm_calc() switch
to struct pwm_state directly to make it easier to see where the
parameters are actually coming from.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 23 ---
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 84b28ba0f903..39ea119add7b 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -63,7 +63,6 @@ struct meson_pwm {
struct pwm_chip chip;
const struct meson_pwm_data *data;
void __iomem *base;
-   u8 inverter_mask;
/*
 * Protects register (write) access to the REG_MISC_AB register
 * that is shared between the two PWMs.
@@ -116,14 +115,17 @@ static void meson_pwm_free(struct pwm_chip *chip, struct 
pwm_device *pwm)
 }
 
 static int meson_pwm_calc(struct meson_pwm *meson,
- struct meson_pwm_channel *channel, unsigned int id,
- unsigned int duty, unsigned int period)
+ struct meson_pwm_channel *channel,
+ struct pwm_state *state)
 {
-   unsigned int pre_div, cnt, duty_cnt;
+   unsigned int duty, period, pre_div, cnt, duty_cnt;
unsigned long fin_freq = -1;
u64 fin_ps;
 
-   if (~(meson->inverter_mask >> id) & 0x1)
+   duty = state->duty_cycle;
+   period = state->period;
+
+   if (state->polarity == PWM_POLARITY_INVERSED)
duty = period - duty;
 
if (period == channel->state.period &&
@@ -278,15 +280,7 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
if (state->period != channel->state.period ||
state->duty_cycle != channel->state.duty_cycle ||
state->polarity != channel->state.polarity) {
-   if (state->polarity != channel->state.polarity) {
-   if (state->polarity == PWM_POLARITY_NORMAL)
-   meson->inverter_mask |= BIT(pwm->hwpwm);
-   else
-   meson->inverter_mask &= ~BIT(pwm->hwpwm);
-   }
-
-   err = meson_pwm_calc(meson, channel, pwm->hwpwm,
-state->duty_cycle, state->period);
+   err = meson_pwm_calc(meson, channel, state);
if (err < 0)
return err;
 
@@ -520,7 +514,6 @@ static int meson_pwm_probe(struct platform_device *pdev)
meson->chip.of_pwm_n_cells = 3;
 
meson->data = of_device_get_match_data(>dev);
-   meson->inverter_mask = BIT(meson->chip.npwm) - 1;
 
channels = devm_kcalloc(>dev, meson->chip.npwm,
sizeof(*channels), GFP_KERNEL);
-- 
2.22.0

[PATCH v3 12/14] pwm: meson: don't cache struct pwm_state internally

2019-06-12 Thread Martin Blumenstingl

The PWM core already caches the "current struct pwm_state" as the
"current state of the hardware registers" inside struct pwm_device.

Drop the struct pwm_state from struct meson_pwm_channel in favour of the
struct pwm_state in struct pwm_device. While here also drop any checks
based on the pwm_state because the PWM core already takes care of this.

No functional changes intended.

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 25 +
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 010212166d5d..900d362ec3c9 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -68,8 +68,6 @@ struct meson_pwm_channel {
unsigned int lo;
u8 pre_div;
 
-   struct pwm_state state;
-
struct clk *clk_parent;
struct clk_mux mux;
struct clk *clk;
@@ -127,8 +125,6 @@ static int meson_pwm_request(struct pwm_chip *chip, struct 
pwm_device *pwm)
return err;
}
 
-   chip->ops->get_state(chip, pwm, >state);
-
return pwm_set_chip_data(pwm, channel);
 }
 
@@ -153,10 +149,6 @@ static int meson_pwm_calc(struct meson_pwm *meson, struct 
pwm_device *pwm,
if (state->polarity == PWM_POLARITY_INVERSED)
duty = period - duty;
 
-   if (period == channel->state.period &&
-   duty == channel->state.duty_cycle)
-   return 0;
-
fin_freq = clk_get_rate(channel->clk);
if (fin_freq == 0) {
dev_err(meson->chip.dev, "invalid source clock frequency\n");
@@ -253,7 +245,6 @@ static void meson_pwm_disable(struct meson_pwm *meson, 
struct pwm_device *pwm)
 static int meson_pwm_apply(struct pwm_chip *chip, struct pwm_device *pwm,
   struct pwm_state *state)
 {
-   struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
struct meson_pwm *meson = to_meson_pwm(chip);
int err = 0;
 
@@ -262,26 +253,12 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
 
if (!state->enabled) {
meson_pwm_disable(meson, pwm);
-   channel->state.enabled = false;
-
-   return 0;
-   }
-
-   if (state->period != channel->state.period ||
-   state->duty_cycle != channel->state.duty_cycle ||
-   state->polarity != channel->state.polarity) {
+   } else {
err = meson_pwm_calc(meson, pwm, state);
if (err < 0)
return err;
 
-   channel->state.polarity = state->polarity;
-   channel->state.period = state->period;
-   channel->state.duty_cycle = state->duty_cycle;
-   }
-
-   if (state->enabled && !channel->state.enabled) {
meson_pwm_enable(meson, pwm);
-   channel->state.enabled = true;
}
 
return 0;
-- 
2.22.0

[PATCH v3 01/14] pwm: meson: unify the parameter list of meson_pwm_{enable,disable}

2019-06-12 Thread Martin Blumenstingl

This is a preparation for a future cleanup. Pass struct pwm_device
instead of passing the individual values required by each function as
these can be obtained for each struct pwm_device instance.

As a nice side-effect the driver now uses "switch (pwm->hwpwm)"
everywhere. Before some functions used "switch (id)" while others used
"switch (pwm->hwpwm)".

No functional changes.

Reviewed-by: Neil Armstrong 
Reviewed-by: Uwe Kleine-König 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 5fef7e925282..3fbbc4128ce8 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -183,15 +183,14 @@ static int meson_pwm_calc(struct meson_pwm *meson,
return 0;
 }
 
-static void meson_pwm_enable(struct meson_pwm *meson,
-struct meson_pwm_channel *channel,
-unsigned int id)
+static void meson_pwm_enable(struct meson_pwm *meson, struct pwm_device *pwm)
 {
+   struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
u32 value, clk_shift, clk_enable, enable;
unsigned int offset;
unsigned long flags;
 
-   switch (id) {
+   switch (pwm->hwpwm) {
case 0:
clk_shift = MISC_A_CLK_DIV_SHIFT;
clk_enable = MISC_A_CLK_EN;
@@ -228,12 +227,12 @@ static void meson_pwm_enable(struct meson_pwm *meson,
spin_unlock_irqrestore(>lock, flags);
 }
 
-static void meson_pwm_disable(struct meson_pwm *meson, unsigned int id)
+static void meson_pwm_disable(struct meson_pwm *meson, struct pwm_device *pwm)
 {
u32 value, enable;
unsigned long flags;
 
-   switch (id) {
+   switch (pwm->hwpwm) {
case 0:
enable = MISC_A_EN;
break;
@@ -266,7 +265,7 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
return -EINVAL;
 
if (!state->enabled) {
-   meson_pwm_disable(meson, pwm->hwpwm);
+   meson_pwm_disable(meson, pwm);
channel->state.enabled = false;
 
return 0;
@@ -293,7 +292,7 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
}
 
if (state->enabled && !channel->state.enabled) {
-   meson_pwm_enable(meson, channel, pwm->hwpwm);
+   meson_pwm_enable(meson, pwm);
channel->state.enabled = true;
}
 
-- 
2.22.0

[PATCH v3 14/14] pwm: meson: add documentation to the driver

2019-06-12 Thread Martin Blumenstingl

Add links to the datasheet and a short summary how the hardware works.
The goal is to make it easier for other developers to understand why the
pwm-meson driver is implemented the way it is.

Suggested-by: Uwe Kleine-König 
Co-authored-by: Neil Armstrong 
Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index bb48ba85f756..31259026484c 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -1,5 +1,27 @@
 // SPDX-License-Identifier: GPL-2.0 OR BSD-3-Clause
 /*
+ * PWM controller driver for Amlogic Meson SoCs.
+ *
+ * This PWM is only a set of Gates, Dividers and Counters:
+ * PWM output is achieved by calculating a clock that permits calculating
+ * two periods (low and high). The counter then has to be set to switch after
+ * N cycles for the first half period.
+ * The hardware has no "polarity" setting. This driver reverses the period
+ * cycles (the low length is inverted with the high length) for
+ * PWM_POLARITY_INVERSED. This means that .get_state cannot read the polarity
+ * from the hardware.
+ * Setting the duty cycle will disable and re-enable the PWM output.
+ * Disabling the PWM stops the output immediately (without waiting for the
+ * current period to complete first).
+ *
+ * The public S912 (GXM) datasheet contains some documentation for this PWM
+ * controller starting on page 543:
+ * 
https://dl.khadas.com/Hardware/VIM2/Datasheet/S912_Datasheet_V0.220170314publicversion-Wesion.pdf
+ * An updated version of this IP block is found in S922X (G12B) SoCs. The
+ * datasheet contains the description for this IP block revision starting at
+ * page 1084:
+ * 
https://dn.odroid.com/S922X/ODROID-N2/Datasheet/S922X_Public_Datasheet_V0.2.pdf
+ *
  * Copyright (c) 2016 BayLibre, SAS.
  * Author: Neil Armstrong 
  * Copyright (C) 2014 Amlogic, Inc.
-- 
2.22.0

[PATCH v3 10/14] pwm: meson: simplify the calculation of the pre-divider and count

2019-06-12 Thread Martin Blumenstingl

Replace the loop to calculate the pre-divider and count with two
separate div64_u64() calculations. This makes the code easier to read
and improves the precision.

Three example cases:
1) 32.768kHz LPO clock for the SDIO wifi chip on Khadas VIM
   clock input: 500MHz (FCLK_DIV4)
   period: 30518ns
   duty cycle: 15259ns
old algorithm: pre_div=0, cnt=15259
new algorithm: pre_div=0, cnt=15259
(no difference in calculated values)

2) PWM LED on Khadas VIM
   clock input: 24MHz (XTAL)
   period: 7812500ns
   duty cycle: 7812500ns
old algorithm: pre_div=2, cnt=62004
new algorithm: pre_div=2, cnt=62500
Using a scope (24MHz sampling rate) shows the actual difference:
- old: 7753000ns, off by -59500ns (0.7616%)
- new: 7815000ns, off by +2500ns (0.032%)

3) Theoretical case where pre_div is different
   clock input: 24MHz (XTAL)
   period: 2730624ns
   duty cycle: 1365312ns
old algorithm: pre_div=1, cnt=32768
new algorithm: pre_div=0, cnt=65534
Using a scope (24MHz sampling rate) shows the actual difference:
- old: 2731000ns
- new: 2731000ns
(my scope is not precise enough to measure the difference if there's
any)

Suggested-by: Uwe Kleine-König 
Acked-by: Uwe Kleine-König 
Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 25 ++---
 1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 27915d6475e3..9afa1e5aaebf 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -145,7 +146,6 @@ static int meson_pwm_calc(struct meson_pwm *meson, struct 
pwm_device *pwm,
struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
unsigned int duty, period, pre_div, cnt, duty_cnt;
unsigned long fin_freq = -1;
-   u64 fin_ps;
 
duty = state->duty_cycle;
period = state->period;
@@ -164,24 +164,19 @@ static int meson_pwm_calc(struct meson_pwm *meson, struct 
pwm_device *pwm,
}
 
dev_dbg(meson->chip.dev, "fin_freq: %lu Hz\n", fin_freq);
-   fin_ps = (u64)NSEC_PER_SEC * 1000;
-   do_div(fin_ps, fin_freq);
-
-   /* Calc pre_div with the period */
-   for (pre_div = 0; pre_div <= MISC_CLK_DIV_MASK; pre_div++) {
-   cnt = DIV_ROUND_CLOSEST_ULL((u64)period * 1000,
-   fin_ps * (pre_div + 1));
-   dev_dbg(meson->chip.dev, "fin_ps=%llu pre_div=%u cnt=%u\n",
-   fin_ps, pre_div, cnt);
-   if (cnt <= 0x)
-   break;
-   }
 
+   pre_div = div64_u64(fin_freq * (u64)period, NSEC_PER_SEC * 0xLL);
if (pre_div > MISC_CLK_DIV_MASK) {
dev_err(meson->chip.dev, "unable to get period pre_div\n");
return -EINVAL;
}
 
+   cnt = div64_u64(fin_freq * (u64)period, NSEC_PER_SEC * (pre_div + 1));
+   if (cnt > 0x) {
+   dev_err(meson->chip.dev, "unable to get period cnt\n");
+   return -EINVAL;
+   }
+
dev_dbg(meson->chip.dev, "period=%u pre_div=%u cnt=%u\n", period,
pre_div, cnt);
 
@@ -195,8 +190,8 @@ static int meson_pwm_calc(struct meson_pwm *meson, struct 
pwm_device *pwm,
channel->lo = cnt;
} else {
/* Then check is we can have the duty with the same pre_div */
-   duty_cnt = DIV_ROUND_CLOSEST_ULL((u64)duty * 1000,
-fin_ps * (pre_div + 1));
+   duty_cnt = div64_u64(fin_freq * (u64)duty,
+NSEC_PER_SEC * (pre_div + 1));
if (duty_cnt > 0x) {
dev_err(meson->chip.dev, "unable to get duty cycle\n");
return -EINVAL;
-- 
2.22.0

Re: [v2 PATCH] mm: thp: fix false negative of shmem vma's THP eligibility

2019-06-12 Thread Yang Shi





On 6/12/19 11:44 AM, Hugh Dickins wrote:

On Mon, 10 Jun 2019, Yang Shi wrote:

On 6/7/19 8:58 PM, Hugh Dickins wrote:

Yes, that is correct; and correctly placed. But a little more is needed:
see how mm/memory.c's transhuge_vma_suitable() will only allow a pmd to
be used instead of a pte if the vma offset and size permit. smaps should
not report a shmem vma as THPeligible if its offset or size prevent it.

And I see that should also be fixed on anon vmas: at present smaps
reports even a 4kB anon vma as THPeligible, which is not right.
Maybe a test like transhuge_vma_suitable() can be added into
transparent_hugepage_enabled(), to handle anon and shmem together.
I say "like transhuge_vma_suitable()", because that function needs
an address, which here you don't have.

Thanks for the remind. Since we don't have an address I'm supposed we just
need check if the vma's size is big enough or not other than other alignment
check.

And, I'm wondering whether we could reuse transhuge_vma_suitable() by passing
in an impossible address, i.e. -1 since it is not a valid userspace address.
It can be used as and indicator that this call is from THPeligible context.

Perhaps, but sounds like it will abuse and uglify transhuge_vma_suitable()
just for smaps. Would passing transhuge_vma_suitable() the address
 ((vma->vm_end & HPAGE_PMD_MASK) - HPAGE_PMD_SIZE)
give the the correct answer in all cases?


Yes, it looks better.




The anon offset situation is interesting: usually anon vm_pgoff is
initialized to fit with its vm_start, so the anon offset check passes;
but I wonder what happens after mremap to a different address - does
transhuge_vma_suitable() then prevent the use of pmds where they could
actually be used? Not a Number#1 priority to investigate or fix here!
but a curiosity someone might want to look into.

Will mark on my TODO list.


Even with your changes
ShmemPmdMapped: 4096 kB
THPeligible:0
will easily be seen: THPeligible reflects whether a huge page can be
allocated and mapped by pmd in that vma; but if something else already
allocated the huge page earlier, it will be mapped by pmd in this vma
if offset and size allow, whatever THPeligible says. We could change
transhuge_vma_suitable() to force ptes in that case, but it would be
a silly change, just to make what smaps shows easier to explain.

Where did this come from? From the commit log? If so it is the example for
the wrong smap output. If that case really happens, I think we could document
it since THPeligible should just show the current status.

Please read again what I explained there: it's not necessarily an example
of wrong smaps output, it's reasonable smaps output for a reasonable case.

Yes, maybe Documentation/filesystems/proc.txt should explain "THPeligble"
a little better - "eligible for allocating THP pages" rather than just
"eligible for THP pages" would be good enough? we don't want to write
a book about the various cases.


Yes, I agree.



Oh, and the "THPeligible" output lines up very nicely there in proc.txt:
could the actual alignment of that 0 or 1 be fixed in smaps itself too?


Sure.

Thanks,
Yang



Thanks,
Hugh

[PATCH v3 13/14] pwm: meson: add support PWM_POLARITY_INVERSED when disabling

2019-06-12 Thread Martin Blumenstingl

meson_pwm_apply() has to consider the PWM polarity when disabling the
output.
With enabled=false and polarity=PWM_POLARITY_NORMAL the output needs to
be LOW. The driver already supports this.
With enabled=false and polarity=PWM_POLARITY_INVERSED the output needs
to be HIGH. Implement this in the driver by internally enabling the
output with the same settings that we already use for "period == duty".

This fixes a PWM API violation which expects that the driver honors the
polarity also for enabled=false. Due to the IP block not supporting this
natively we only get "an as close as possible" to 100% HIGH signal (in
my test setup with input clock of 24MHz and measuring the output with a
logic analyzer at 24MHz sampling rate I got a duty cycle of 99.998475%
on a Khadas VIM).

Reviewed-by: Neil Armstrong 
Signed-off-by: Martin Blumenstingl 
---
 drivers/pwm/pwm-meson.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/pwm/pwm-meson.c b/drivers/pwm/pwm-meson.c
index 900d362ec3c9..bb48ba85f756 100644
--- a/drivers/pwm/pwm-meson.c
+++ b/drivers/pwm/pwm-meson.c
@@ -245,6 +245,7 @@ static void meson_pwm_disable(struct meson_pwm *meson, 
struct pwm_device *pwm)
 static int meson_pwm_apply(struct pwm_chip *chip, struct pwm_device *pwm,
   struct pwm_state *state)
 {
+   struct meson_pwm_channel *channel = pwm_get_chip_data(pwm);
struct meson_pwm *meson = to_meson_pwm(chip);
int err = 0;
 
@@ -252,7 +253,27 @@ static int meson_pwm_apply(struct pwm_chip *chip, struct 
pwm_device *pwm,
return -EINVAL;
 
if (!state->enabled) {
-   meson_pwm_disable(meson, pwm);
+   if (state->polarity == PWM_POLARITY_INVERSED) {
+   /*
+* This IP block revision doesn't have an "always high"
+* setting which we can use for "inverted disabled".
+* Instead we achieve this using the same settings
+* that we use a pre_div of 0 (to get the shortest
+* possible duration for one "count") and
+* "period == duty_cycle". This results in a signal
+* which is LOW for one "count", while being HIGH for
+* the rest of the (so the signal is HIGH for slightly
+* less than 100% of the period, but this is the best
+* we can achieve).
+*/
+   channel->pre_div = 0;
+   channel->hi = ~0;
+   channel->lo = 0;
+
+   meson_pwm_enable(meson, pwm);
+   } else {
+   meson_pwm_disable(meson, pwm);
+   }
} else {
err = meson_pwm_calc(meson, pwm, state);
if (err < 0)
-- 
2.22.0

Re: [PATCH] locking/static_key: always define static_branch_deferred_inc

2019-06-12 Thread Jakub Kicinski

On Wed, 12 Jun 2019 15:44:09 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
> available also when jump labels are disabled.
> 
> Fixes: ad282a8117d50 ("locking/static_key: Add support for deferred static 
> branches")
> Signed-off-by: Willem de Bruijn 
> 
> ---
> 
> The original patch went into 5.2-rc1, but this interface is not yet
> used, so this could target either 5.2 or 5.3.

Can we drop the Fixes tag?  It's an ugly omission but not a bug fix.

Are you planning to switch clean_acked_data_enable() to the helper once
merged?

Thanks!

> diff --git a/include/linux/jump_label_ratelimit.h 
> b/include/linux/jump_label_ratelimit.h
> index 42710d5949ba..8c3ee291b2d8 100644
> --- a/include/linux/jump_label_ratelimit.h
> +++ b/include/linux/jump_label_ratelimit.h
> @@ -60,8 +60,6 @@ extern void jump_label_update_timeout(struct work_struct 
> *work);
>  0),  \
>   }
>  
> -#define static_branch_deferred_inc(x)static_branch_inc(&(x)->key)
> -
>  #else/* !CONFIG_JUMP_LABEL */
>  struct static_key_deferred {
>   struct static_key  key;
> @@ -95,4 +93,7 @@ jump_label_rate_limit(struct static_key_deferred *key,
>   STATIC_KEY_CHECK_USE(key);
>  }
>  #endif   /* CONFIG_JUMP_LABEL */
> +
> +#define static_branch_deferred_inc(x)static_branch_inc(&(x)->key)
> +
>  #endif   /* _LINUX_JUMP_LABEL_RATELIMIT_H */

Re: [RFC 00/10] Process-local memory allocations for hiding KVM secrets

2019-06-12 Thread Dave Hansen

On 6/12/19 10:08 AM, Marius Hillenbrand wrote:
> This patch series proposes to introduce a region for what we call
> process-local memory into the kernel's virtual address space. 

It might be fun to cc some x86 folks on this series.  They might have
some relevant opinions. ;)

A few high-level questions:

Why go to all this trouble to hide guest state like registers if all the
guest data itself is still mapped?

Where's the context-switching code?  Did I just miss it?

We've discussed having per-cpu page tables where a given PGD is only in
use from one CPU at a time.  I *think* this scheme still works in such a
case, it just adds one more PGD entry that would have to context-switched.

Re: [PATCH -next] x86/mm: fix an unused variable "tsk" warning

2019-06-12 Thread Borislav Petkov

On Wed, Jun 12, 2019 at 01:19:06PM -0500, Eric W. Biederman wrote:
> Since I am removing the tsk parameter from all of the synchrnous signal
> sending functions, on all of the architectures it was easier to go
> through my own tree than -tip.

Yeah, I remember reading a mail about it...

> The removal of tsk from force_sig_fault is what caused the warning
> in do_sigbus.
> 
> My apologies I was a little slow in getting that patch added and
> generating work for other folks.

That's fine - now we know what the situation is.

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Re: [PATCH v2 1/1] PCI/IOV: Fix incorrect cfg_size for VF > 0

2019-06-12 Thread Raslan, KarimAllah

On Wed, 2019-06-12 at 12:03 -0700, Raj, Ashok wrote:
> On Wed, Jun 12, 2019 at 12:58:17PM -0600, Alex Williamson wrote:
> > 
> > On Wed, 12 Jun 2019 11:41:36 -0700
> > sathyanarayanan kuppuswamy 
> > wrote:
> > 
> > > 
> > > On 6/12/19 11:19 AM, Alex Williamson wrote:
> > > > 
> > > > On Wed, 12 Jun 2019 10:06:47 -0700
> > > > sathyanarayanan.kuppusw...@linux.intel.com wrote:
> > > >  
> > > > > 
> > > > > From: Kuppuswamy Sathyanarayanan 
> > > > > 
> > > > > 
> > > > > Commit 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for
> > > > > other VFs") calculates and caches the cfg_size for VF0 device before
> > > > > initializing the pcie_cap of the device which results in using 
> > > > > incorrect
> > > > > cfg_size for all VF devices > 0. So set pcie_cap of the device before
> > > > > calculating the cfg_size of VF0 device.
> > > > > 
> > > > > Fixes: 975bb8b4dc93 ("PCI/IOV: Use VF0 cached config space size for
> > > > > other VFs")
> > > > > Cc: Ashok Raj 
> > > > > Suggested-by: Mike Campin 
> > > > > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > > > > 
> > > > > ---
> > > > > 
> > > > > Changes since v1:
> > > > >   * Fixed a typo in commit message.
> > > > > 
> > > > >   drivers/pci/iov.c | 1 +
> > > > >   1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > > > > index 3aa115ed3a65..2869011c0e35 100644
> > > > > --- a/drivers/pci/iov.c
> > > > > +++ b/drivers/pci/iov.c
> > > > > @@ -160,6 +160,7 @@ int pci_iov_add_virtfn(struct pci_dev *dev, int 
> > > > > id)
> > > > >   virtfn->device = iov->vf_device;
> > > > >   virtfn->is_virtfn = 1;
> > > > >   virtfn->physfn = pci_dev_get(dev);
> > > > > + virtfn->pcie_cap = pci_find_capability(virtfn, PCI_CAP_ID_EXP);
> > > > >   
> > > > >   if (id == 0)
> > > > >   pci_read_vf_config_common(virtfn);  
> > > > Why not re-order until after we've setup pcie_cap?
> > > > 
> > > > https://lore.kernel.org/linux-pci/20190604143617.0a226...@x1.home/T/#  
> > > 
> > > pci_read_vf_config_common() also caches values for properties like 
> > > class, hdr_type, susbsystem_vendor/device. These values are read/used in 
> > > pci_setup_device(). So if we can use cached values in 
> > > pci_setup_device(), we don't have to read them from registers twice for 
> > > each device.
> > 
> > Sorry, I missed that dependency, a bit too subtle.  It's still pretty
> > ugly that pci_setup_device()->set_pcie_port_type() is the canonical
> > location for setting pcie_cap and now we need to kludge it earlier.
> > What about the question in the self follow-up to my patch in the link
> > above, can we simply assume 4K config space on a VF?  Thanks,
> 
> There should be no issue simply reading them once? I don't know
> what that exact optimization saves, unless some broken VFs didn't
> actually expose all the capabilities in config space and this happens
> to workaround the problem.

The original patch was to save time when you have hundreds of VFs in the system 
and doing this for each one of them is just a waste of time.

> 
> + Karim
> 
> Cheers,
> Ashok



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Ralf Herbrich
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879

Re: possible deadlock in io_submit_one

2019-06-12 Thread Eric Biggers

Hi Bart and Christoph,

On Mon, Feb 04, 2019 at 06:03:04PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:5eeb63359b1e Merge tag 'for-linus' of git://git.kernel.org..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=17906f64c0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2e0064f906afee10
> dashboard link: https://syzkaller.appspot.com/bug?extid=a3accb352f9c22041cfa
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=156479f8c0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=128c75c4c0
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a3accb352f9c22041...@syzkaller.appspotmail.com
> 
> =
> WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> 5.0.0-rc4+ #56 Not tainted
> -
> syz-executor263/8874 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> c469f622 (>fd_wqh){}, at: spin_lock
> include/linux/spinlock.h:329 [inline]
> c469f622 (>fd_wqh){}, at: aio_poll fs/aio.c:1772 [inline]
> c469f622 (>fd_wqh){}, at: __io_submit_one fs/aio.c:1875
> [inline]
> c469f622 (>fd_wqh){}, at: io_submit_one+0xedf/0x1cf0
> fs/aio.c:1908
> 
> and this task is already holding:
> 829de875 (&(>ctx_lock)->rlock){..-.}, at: spin_lock_irq
> include/linux/spinlock.h:354 [inline]
> 829de875 (&(>ctx_lock)->rlock){..-.}, at: aio_poll
> fs/aio.c:1771 [inline]
> 829de875 (&(>ctx_lock)->rlock){..-.}, at: __io_submit_one
> fs/aio.c:1875 [inline]
> 829de875 (&(>ctx_lock)->rlock){..-.}, at:
> io_submit_one+0xeb6/0x1cf0 fs/aio.c:1908
> which would create a new lock dependency:
>  (&(>ctx_lock)->rlock){..-.} -> (>fd_wqh){}
> 

This is still happening.  See
https://syzkaller.appspot.com/text?tag=CrashReport=129eb971a0 for a report
on Linus' tree from 5 days ago.

I see that a few months ago there was a commit

commit d3d6a18d7d351cbcc9b33dbedf710e65f8ce1595
Author: Bart Van Assche 
Date:   Fri Feb 8 16:59:49 2019 -0800

aio: Fix locking in aio_poll()

but apparently it didn't fully fix the problem.

- Eric

Re: [PATCH v3 3/4] backlight: pwm_bl: compute brightness of LED linearly to human eye.

2019-06-12 Thread Daniel Thompson

On Wed, Jun 12, 2019 at 12:26:42PM -0700, Matthias Kaehlcke wrote:
> Hi Daniel,
> 
> On Wed, Jun 12, 2019 at 12:03:25PM +0100, Daniel Thompson wrote:
> > On Tue, Jun 11, 2019 at 03:30:19PM -0700, Matthias Kaehlcke wrote:
> > > On Tue, Jun 11, 2019 at 09:55:30AM -0700, Brian Norris wrote:
> > > > On Tue, Jun 11, 2019 at 3:49 AM Daniel Thompson
> > > >  wrote:
> > > > > This is a long standing flaw in the backlight interfaces. AFAIK 
> > > > > generic
> > > > > userspaces end up with a (flawed) heuristic.
> > > > 
> > > > Bingo! Would be nice if we could start to fix this long-standing flaw.
> > > 
> > > Agreed!
> > > 
> > > How could a fix look like, a sysfs attribute? Would a boolean value
> > > like 'logarithmic_scale' or 'linear_scale' be enough or could more
> > > granularity be needed?
> > 
> > Certainly "linear" (this device will work more or less correctly if the
> > userspace applies perceptual curves). Not sure about logarithmic since
> > what is actually useful is something that is "perceptually linear"
> > (logarithmic is merely a way to approximate that).
> > 
> > I do wonder about a compatible string like most-detailed to
> > least-detailed description. This for a PWM with the auto-generated
> > tables we'd see something like:
> > 
> > cie-1991,perceptual,non-linear
> > 
> > For something that is non-linear but we are not sure what its tables are
> > we can offer just "non-linear".
> 
> Thanks for the feedback!
> 
> It seems clear that we want a string for the added flexibility. I can
> work on a patch with the compatible string like description you
> suggested and we can discuss in the review if we want to go with that
> or prefer something else.

Great, other important thing if we did decide to go this route is there
must be some devices with multiple strings on day 1 (such as the cie-1991
example above).

Whatever we say the ABI is, if we end up with established userspace
components that strcmp("linear", ...) and there are no early counter
examples then we could get stuck without the option to add more
precise tokens as we learn more.


> > > The new attribute could be optional (it only exists if explicitly
> > > specified by the driver) or be set to a default based on a heuristic
> > > if not specified and be 'fixed' on a case by case basis. The latter
> > > might violate "don't break userspace" though, so I'm not sure it's a
> > > good idea.
> > 
> > I think we should avoid any heuristic! There are several drivers and we
> > may not be able to work through all of them and make the correct
> > decision.
> 
> Agreed
> 
> > Instead one valid value for the sysfs should be "unknown" and this be
> > the default for drivers we have not analysed (this also makes it easy to
> > introduce change here).
> 
> An "unknown" value sounds good, it allows userspace to just do what it
> did/would hace done before this attribute existed.
> 
> > We should only set the property to something else for drivers that have
> > been reviewed.
> > 
> > There could be a special case for pwm_bl.c in that I'm prepared to
> > assume that the hardware components downstream of the PWM have a
> > roughly linear response and that if the user provided tables that their
> > function is to provide a perceptually comfortable response.
> 
> Unfortunately this isn't universally true :(
> 
> At least several Chrome OS devices use a linear brightness scale and
> userspace does the transformation in the animated slider. A quick
> 'git grep -A10 brightness-levels arch' suggests that there are
> multiple other devices/platforms using a linear scale.

Good point.

Any clue whether the tables are "stupid" (e.g. offer a poor user experience
with notchy feeling backlight response) or whether they work because
some of the downstream componentry has a non-linear response?


> We could treat devices with a predefined brightness table as
> "unknown", unless there is a (new optional) DT property that indicates
> the type of the scale.

If we have an "unknown" and we don't know then I guess I just claimed
that's what we have to do for cases we don't understand.

For pwm_bl it would be easy to study the table and calculate how far from the
line the centre point is... although that bringing back heuristics into
the picture, albeit more useful ones.

As I said... I'd be OK for the pwm_bl to take a few liberties because it
is so different from the fully fledged LED driver drivers but we don't
need to go crazy ;-)


Daniel.

Re: [PATCH v2 04/14] pwm: meson: change MISC_CLK_SEL_WIDTH to MISC_CLK_SEL_MASK

2019-06-12 Thread Martin Blumenstingl

Hi Uwe,

On Tue, Jun 11, 2019 at 6:33 PM Uwe Kleine-König
 wrote:
[...]
> > @@ -463,7 +463,7 @@ static int meson_pwm_init_channels(struct meson_pwm 
> > *meson,
> >
> >   channel->mux.reg = meson->base + REG_MISC_AB;
> >   channel->mux.shift = mux_reg_shifts[i];
> > - channel->mux.mask = BIT(MISC_CLK_SEL_WIDTH) - 1;
> > + channel->mux.mask = MISC_CLK_SEL_MASK;
> >   channel->mux.flags = 0;
> >   channel->mux.lock = >lock;
> >   channel->mux.table = NULL;
>
> IMHO clk_mux is ugly here. It could easily just take
>
> .mask = 3 << mux_reg_shifts[i],
in most cases that would be even nicer to read because it could be expressed as:
  .mask = GENMASK(5, 4)

so I like your idea in general
though I think it should not block this patch

[...]
> Apart from that, I wonder if the pwm-meson driver should better use
> clk_register_mux instead of open coding it. (Though there doesn't seem
> to exists a devm_ variant of it.)
I tried to use clk_register_mux in the past. it works but it's not as
nice to read as the open-coded variant because it takes 10 parameters.
I find it easier to read 13 separate lines compared to reading a
function call with 10 parameters


Martin

Re: infinite loop in read_hpet from ktime_get_boot_fast_ns

2019-06-12 Thread Arnd Bergmann

On Wed, Jun 12, 2019 at 7:55 PM Peter Zijlstra  wrote:
> On Wed, Jun 12, 2019 at 11:44:35AM +0200, Jason A. Donenfeld wrote:

> > But there's still the
> > issue of the 32-bit wraparound on the base implementation.
>
> If an architecture doesn't provide a sched_clock(), you're on a
> seriously handicapped arch. It wraps in ~500 days, and aside from
> changing jiffies_lock to a latch, I don't think we can do much about it.
>
> (the scheduler too expects sched_clock() to not wrap short of the u64
> and so having those machines online for 500 days will get you 'funny'
> results)
>
> AFAICT only: alpha, h8300, hexagon, m68knommu, nds32, nios2, openrisc
> are lacking any form of sched_clock(), the rest has it either natively
> or through sched_clock_register().

For completeness (as we already discussed on IRC), on many architectures
this would depend on the clocksource driver: many (older) arm, mips, sh
or m68k implementations don't have sched_clock(), as this depends on
the clocksource driver. All the modern ones tend to have one, but older
ones may only support an interval timer tick that cannot be read.

Arnd

Re: [PATCH 08/15] x86/alternatives: Teach text_poke_bp() to emulate instructions

2019-06-12 Thread Nadav Amit

> On Jun 11, 2019, at 8:55 AM, Peter Zijlstra  wrote:
> 
> On Tue, Jun 11, 2019 at 11:22:54AM -0400, Steven Rostedt wrote:
>> On Tue, 11 Jun 2019 10:03:07 +0200
>> Peter Zijlstra  wrote:
>> 
>> 
>>> So what happens is that arch_prepare_optimized_kprobe() <-
>>> copy_optimized_instructions() copies however much of the instruction
>>> stream is required such that we can overwrite the instruction at @addr
>>> with a 5 byte jump.
>>> 
>>> arch_optimize_kprobe() then does the text_poke_bp() that replaces the
>>> instruction @addr with int3, copies the rel jump address and overwrites
>>> the int3 with jmp.
>>> 
>>> And I'm thinking the problem is with something like:
>>> 
>>> @addr: nop nop nop nop nop
>> 
>> What would work would be to:
>> 
>>  add breakpoint to first opcode.
>> 
>>  call synchronize_tasks();
>> 
>>  /* All tasks now hitting breakpoint and jumping over affected
>>  code */
>> 
>>  update the rest of the instructions.
>> 
>>  replace breakpoint with jmp.
>> 
>> One caveat is that the replaced instructions must not be a call
>> function. As if the call function calls schedule then it will
>> circumvent the synchronize_tasks(). It would be OK if that call is the
>> last of the instructions. But I doubt we modify anything more then a
>> call size anyway, so this should still work for all current instances.
> 
> Right, something like this could work (although I cannot currently find
> synchronize_tasks), but it would make the optprobe stuff fairly slow
> (iirc this sync_tasks() thing could be pretty horrible).

I have run into similar problems before.

I had two problematic scenarios. In the first case, I had a “call” in the
middle of the patched code-block, but this call was always followed by a
“jump” to the end of the potentially patched code-block, so I did not have
the problem.

In the second case, I had an indirect call (which is shorter than a direct
call) being patched into a direct call. In this case, I preceded the
indirect call with NOPs so indeed the indirect call was at the end of the
patched block.

In certain cases, if a shorter instruction should be potentially patched
into a longer one, the shorter one can be preceded by some prefixes. If
there are multiple REX prefixes, for instance, the CPU only uses the last
one, IIRC. This can allow to avoid synchronize_sched() when patching a
single instruction into another instruction with a different length.

Not sure how helpful this information is, but sharing - just in case.

[PATCH] locking/static_key: always define static_branch_deferred_inc

2019-06-12 Thread Willem de Bruijn

From: Willem de Bruijn 

This interface is currently only defined if CONFIG_JUMP_LABEL. Make it
available also when jump labels are disabled.

Fixes: ad282a8117d50 ("locking/static_key: Add support for deferred static 
branches")
Signed-off-by: Willem de Bruijn 

---

The original patch went into 5.2-rc1, but this interface is not yet
used, so this could target either 5.2 or 5.3.

---
 include/linux/jump_label_ratelimit.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/jump_label_ratelimit.h 
b/include/linux/jump_label_ratelimit.h
index 42710d5949ba..8c3ee291b2d8 100644
--- a/include/linux/jump_label_ratelimit.h
+++ b/include/linux/jump_label_ratelimit.h
@@ -60,8 +60,6 @@ extern void jump_label_update_timeout(struct work_struct 
*work);
   0),  \
}
 
-#define static_branch_deferred_inc(x)  static_branch_inc(&(x)->key)
-
 #else  /* !CONFIG_JUMP_LABEL */
 struct static_key_deferred {
struct static_key  key;
@@ -95,4 +93,7 @@ jump_label_rate_limit(struct static_key_deferred *key,
STATIC_KEY_CHECK_USE(key);
 }
 #endif /* CONFIG_JUMP_LABEL */
+
+#define static_branch_deferred_inc(x)  static_branch_inc(&(x)->key)
+
 #endif /* _LINUX_JUMP_LABEL_RATELIMIT_H */
-- 
2.22.0.rc2.383.gf4fbbf30c2-goog

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-12 Thread Larry Finger


On 6/12/19 1:55 AM, Christoph Hellwig wrote:


Ooops, yes.  But I think we could just enable ZONE_DMA on 32-bit
powerpc.  Crude enablement hack below:

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8c1c636308c8..1dd71a98b70c 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -372,7 +372,7 @@ config PPC_ADV_DEBUG_DAC_RANGE
  
  config ZONE_DMA

bool
-   default y if PPC_BOOK3E_64
+   default y
  
  config PGTABLE_LEVELS

int



With the patch for Kconfig above, and the original patch setting 
ARCH_ZONE_DMA_BITS to 30, everything works.


Do you have any ideas on what should trigger the change in ARCH_ZONE_BITS? 
Should it be CONFIG_PPC32 defined, or perhaps CONFIG_G4_CPU defined?


Larry

Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-12 Thread Dan Williams

On Wed, Jun 12, 2019 at 12:37 PM Dan Williams  wrote:
>
> On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
> >
> > The linux-next commit "mm/sparsemem: Add helpers track active portions
> > of a section at boot" [1] causes a crash below when the first kmemleak
> > scan kthread kicks in. This is because kmemleak_scan() calls
> > pfn_to_online_page(() which calls pfn_valid_within() instead of
> > pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
> >
> > The commit [1] did add an additional check of pfn_section_valid() in
> > pfn_valid(), but forgot to add it in the above code path.
> >
> > page:ea0002748000 is uninitialized and poisoned
> > raw:    
> > raw:    
> > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > [ cut here ]
> > kernel BUG at include/linux/mm.h:1084!
> > invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> > CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> > Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> > BIOS -[TEE113T-1.00]- 07/07/2017
> > RIP: 0010:kmemleak_scan+0x6df/0xad0
> > Call Trace:
> >  kmemleak_scan_thread+0x9f/0xc7
> >  kthread+0x1d2/0x1f0
> >  ret_from_fork+0x35/0x4
> >
> > [1] https://patchwork.kernel.org/patch/10977957/
> >
> > Signed-off-by: Qian Cai 
> > ---
> >  include/linux/memory_hotplug.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> > index 0b8a5e5ef2da..f02be86077e3 100644
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -28,6 +28,7 @@
> > unsigned long ___nr = pfn_to_section_nr(___pfn);   \
> >\
> > if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> > +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> > pfn_valid_within(___pfn))  \
> > ___page = pfn_to_page(___pfn); \
> > ___page;   \
>
> Looks ok to me:
>
> Acked-by: Dan Williams 
>
> ...but why is pfn_to_online_page() a multi-line macro instead of a
> static inline like all the helper routines it invokes?

I do need to send out a refreshed version of the sub-section patchset,
so I'll fold this in and give you a Reported-by credit.

Re: [PATCH -next] mm/hotplug: skip bad PFNs from pfn_to_online_page()

2019-06-12 Thread Dan Williams

On Wed, Jun 12, 2019 at 12:16 PM Qian Cai  wrote:
>
> The linux-next commit "mm/sparsemem: Add helpers track active portions
> of a section at boot" [1] causes a crash below when the first kmemleak
> scan kthread kicks in. This is because kmemleak_scan() calls
> pfn_to_online_page(() which calls pfn_valid_within() instead of
> pfn_valid() on x86 due to CONFIG_HOLES_IN_ZONE=n.
>
> The commit [1] did add an additional check of pfn_section_valid() in
> pfn_valid(), but forgot to add it in the above code path.
>
> page:ea0002748000 is uninitialized and poisoned
> raw:    
> raw:    
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [ cut here ]
> kernel BUG at include/linux/mm.h:1084!
> invalid opcode:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> CPU: 5 PID: 332 Comm: kmemleak Not tainted 5.2.0-rc4-next-20190612+ #6
> Hardware name: Lenovo ThinkSystem SR530 -[7X07RCZ000]-/-[7X07RCZ000]-,
> BIOS -[TEE113T-1.00]- 07/07/2017
> RIP: 0010:kmemleak_scan+0x6df/0xad0
> Call Trace:
>  kmemleak_scan_thread+0x9f/0xc7
>  kthread+0x1d2/0x1f0
>  ret_from_fork+0x35/0x4
>
> [1] https://patchwork.kernel.org/patch/10977957/
>
> Signed-off-by: Qian Cai 
> ---
>  include/linux/memory_hotplug.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 0b8a5e5ef2da..f02be86077e3 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -28,6 +28,7 @@
> unsigned long ___nr = pfn_to_section_nr(___pfn);   \
>\
> if (___nr < NR_MEM_SECTIONS && online_section_nr(___nr) && \
> +   pfn_section_valid(__nr_to_section(___nr), pfn) &&  \
> pfn_valid_within(___pfn))  \
> ___page = pfn_to_page(___pfn); \
> ___page;   \

Looks ok to me:

Acked-by: Dan Williams 

...but why is pfn_to_online_page() a multi-line macro instead of a
static inline like all the helper routines it invokes?

[PATCH 7/8] sched,fair: refactor enqueue/dequeue_entity

2019-06-12 Thread Rik van Riel

Refactor enqueue_entity, dequeue_entity, and update_load_avg, in order
to split out the things we still want to happen at every level in the
cgroup hierarchy with a flat runqueue from the things we only need to
happen once.

No functional changes.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 65 +
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35153a89d5c5..c2baf3c8a879 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3481,17 +3481,17 @@ static void detach_entity_load_avg(struct cfs_rq 
*cfs_rq, struct sched_entity *s
 #define DO_ATTACH  0x4
 
 /* Update task and its cfs_rq load average */
-static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
+static inline bool update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
 {
u64 now = cfs_rq_clock_pelt(cfs_rq);
-   int decayed;
+   int decayed, updated = 0;
 
/*
 * Track task load average for carrying it to new CPU after migrated, 
and
 * track group sched_entity load average for task_h_load calc in 
migration
 */
if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
-   __update_load_avg_se(now, cfs_rq, se);
+   updated = __update_load_avg_se(now, cfs_rq, se);
 
decayed  = update_cfs_rq_load_avg(now, cfs_rq);
decayed |= propagate_entity_load_avg(se);
@@ -3510,6 +3510,8 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, 
struct sched_entity *s
 
} else if (decayed && (flags & UPDATE_TG))
update_tg_load_avg(cfs_rq, 0);
+
+   return decayed | updated;
 }
 
 #ifndef CONFIG_64BIT
@@ -3851,6 +3853,24 @@ static inline void check_schedstat_required(void)
  * CPU and an up-to-date min_vruntime on the destination CPU.
  */
 
+static bool
+enqueue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int 
flags)
+{
+   /*
+* When enqueuing a sched_entity, we must:
+*   - Update loads to have both entity and cfs_rq synced with now.
+*   - Add its load to cfs_rq->runnable_avg
+*   - For group_entity, update its weight to reflect the new share of
+* its group cfs_rq
+*   - Add its new weight to cfs_rq->load.weight
+*/
+   if (!update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH))
+   return false;
+
+   update_cfs_group(se);
+   return true;
+}
+
 static void
 enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -3875,16 +3895,6 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;
 
-   /*
-* When enqueuing a sched_entity, we must:
-*   - Update loads to have both entity and cfs_rq synced with now.
-*   - Add its load to cfs_rq->runnable_avg
-*   - For group_entity, update its weight to reflect the new share of
-* its group cfs_rq
-*   - Add its new weight to cfs_rq->load.weight
-*/
-   update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
-   update_cfs_group(se);
enqueue_runnable_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
 
@@ -3951,14 +3961,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq);
 
-static void
-dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+static bool 
+dequeue_entity_groups(struct cfs_rq *cfs_rq, struct sched_entity *se, int 
flags)
 {
-   /*
-* Update run-time statistics of the 'current'.
-*/
-   update_curr(cfs_rq);
-
/*
 * When dequeuing a sched_entity, we must:
 *   - Update loads to have both entity and cfs_rq synced with now.
@@ -3967,7 +3972,21 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se, int flags)
 *   - For group entity, update its weight to reflect the new share
 * of its group cfs_rq.
 */
-   update_load_avg(cfs_rq, se, UPDATE_TG);
+   if (!update_load_avg(cfs_rq, se, UPDATE_TG))
+   return false;
+   update_cfs_group(se);
+
+   return true;
+}
+
+static void
+dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
+{
+   /*
+* Update run-time statistics of the 'current'.
+*/
+   update_curr(cfs_rq);
+
dequeue_runnable_load_avg(cfs_rq, se);
 
update_stats_dequeue(cfs_rq, se, flags);
@@ -3991,8 +4010,6 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
 
-   update_cfs_group(se);
-
/*
 * Now advance min_vruntime if @se was the entity holding

[PATCH 6/8] sched,cfs: fix zero length timeslice calculation

2019-06-12 Thread Rik van Riel

The way the time slice length is currently calculated, not only do high
priority tasks get longer time slices than low priority tasks, but due
to fixed point math, low priority tasks could end up with a zero length
time slice. This can lead to cache thrashing and other inefficiencies.

Simplify the logic a little bit, and cap the minimum time slice length
to sysctl_sched_min_granularity.

Tasks that end up getting a time slice length too long for their relative
priority will simply end up having their vruntime advanced much faster than
other tasks, resulting in them receiving time slices less frequently.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 25 -
 1 file changed, 8 insertions(+), 17 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6ede2ecc935..35153a89d5c5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,22 +670,6 @@ static inline u64 calc_delta_fair(u64 delta, struct 
sched_entity *se)
return delta;
 }
 
-/*
- * The idea is to set a period in which each task runs once.
- *
- * When there are too many tasks (sched_nr_latency) we have to stretch
- * this period because otherwise the slices get too small.
- *
- * p = (nr <= nl) ? l : l*nr/nl
- */
-static u64 __sched_period(unsigned long nr_running)
-{
-   if (unlikely(nr_running > sched_nr_latency))
-   return nr_running * sysctl_sched_min_granularity;
-   else
-   return sysctl_sched_latency;
-}
-
 /*
  * We calculate the wall-time slice from the period by taking a part
  * proportional to the weight.
@@ -694,7 +678,7 @@ static u64 __sched_period(unsigned long nr_running)
  */
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
+   u64 slice = sysctl_sched_latency;
 
for_each_sched_entity(se) {
struct load_weight *load;
@@ -711,6 +695,13 @@ static u64 sched_slice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
}
slice = __calc_delta(slice, se->load.weight, load);
}
+
+   /*
+* To avoid cache thrashing, run at least sysctl_sched_min_granularity.
+* The vruntime of a low priority task advances faster; those tasks
+* will simply get time slices less frequently.
+*/
+   slice = max_t(u64, slice, sysctl_sched_min_granularity);
return slice;
 }
 
-- 
2.20.1

[PATCH 4/8] sched,fair: remove cfs rqs from leaf_cfs_rq_list bottom up

2019-06-12 Thread Rik van Riel

Reducing the overhead of the CPU controller is achieved by not walking
all the sched_entities every time a task is enqueued or dequeued.

One of the things being checked every single time is whether the cfs_rq
is on the rq->leaf_cfs_rq_list.

By only removing a cfs_rq from the list once it no longer has children
on the list, we can avoid walking the sched_entity hierarchy if the bottom
cfs_rq is on the list, once the runqueues have been flattened.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c  | 17 +
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aebd43d74468..dcc521d251e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -285,6 +285,13 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq 
*cfs_rq)
 
cfs_rq->on_list = 1;
 
+   /*
+* If the tmp_alone_branch cursor was moved, it means a child cfs_rq
+* is already on the list ahead of us.
+*/
+   if (rq->tmp_alone_branch != >leaf_cfs_rq_list)
+   cfs_rq->children_on_list++;
+
/*
 * Ensure we either appear before our parent (if already
 * enqueued) or force our parent to appear after us when it is
@@ -310,6 +317,7 @@ static inline bool list_add_leaf_cfs_rq(struct cfs_rq 
*cfs_rq)
 * list.
 */
rq->tmp_alone_branch = >leaf_cfs_rq_list;
+   cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list++;
return true;
}
 
@@ -358,6 +366,11 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq 
*cfs_rq)
if (rq->tmp_alone_branch == _rq->leaf_cfs_rq_list)
rq->tmp_alone_branch = cfs_rq->leaf_cfs_rq_list.prev;
 
+   if (cfs_rq->tg->parent) {
+   int cpu = cpu_of(rq);
+   cfs_rq->tg->parent->cfs_rq[cpu]->children_on_list--;
+   }
+
list_del_rcu(_rq->leaf_cfs_rq_list);
cfs_rq->on_list = 0;
}
@@ -7688,6 +7701,10 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
if (cfs_rq->avg.util_sum)
return false;
 
+   /* Remove decayed parents once their decayed children are gone. */
+   if (cfs_rq->children_on_list)
+   return false;
+
return true;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5be14cee61f9..18494b1a9bac 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ struct cfs_rq {
 * This list is used during load balance.
 */
int on_list;
+   int children_on_list;
struct list_headleaf_cfs_rq_list;
struct task_group   *tg;/* group that "owns" this runqueue */
 
-- 
2.20.1

[PATCH 8/8] sched,fair: flatten hierarchical runqueues

2019-06-12 Thread Rik van Riel

Flatten the hierarchical runqueues into just the per CPU rq.cfs runqueue.

Iteration of the sched_entity hierarchy is rate limited to once per jiffy
per sched_entity, which is a smaller change than it seems, because load
average adjustments were already rate limited to once per jiffy before this
patch series.

This patch breaks CONFIG_CFS_BANDWIDTH. The plan for that is to park tasks
from throttled cgroups onto their cgroup runqueues, and slowly (using the
GENTLE_FAIR_SLEEPERS) wake them back up, in vruntime order, once the cgroup
gets unthrottled, to prevent thundering herd issues.

Signed-off-by: Rik van Riel 
---
 include/linux/sched.h |   2 +
 kernel/sched/fair.c   | 478 +-
 kernel/sched/pelt.c   |   6 +-
 kernel/sched/pelt.h   |   2 +-
 kernel/sched/sched.h  |   2 +-
 5 files changed, 194 insertions(+), 296 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f5bb6948e40c..05ed40b304dc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -454,6 +454,8 @@ struct sched_entity {
 #ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
unsigned long   enqueued_h_load;
+   unsigned long   enqueued_h_weight;
+   struct load_weight  h_load;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq   *cfs_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2baf3c8a879..29bdfbd4dc2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -242,6 +242,9 @@ static u64 __calc_delta(u64 delta_exec, unsigned long 
weight, struct load_weight
 
 
 const struct sched_class fair_sched_class;
+static unsigned long task_se_h_weight(struct sched_entity *se);
+static unsigned long task_se_h_load(struct sched_entity *se);
+static unsigned long task_h_load(struct task_struct *p);
 
 /**
  * CFS operations on generic schedulable entities:
@@ -395,7 +398,6 @@ static inline void assert_list_leaf_cfs_rq(struct rq *rq)
list_for_each_entry_safe(cfs_rq, pos, >leaf_cfs_rq_list,\
 leaf_cfs_rq_list)
 
-/* Do the two (enqueued) entities belong to the same group ? */
 static inline struct cfs_rq *
 is_same_group(struct sched_entity *se, struct sched_entity *pse)
 {
@@ -410,6 +412,11 @@ static inline struct sched_entity *parent_entity(struct 
sched_entity *se)
return se->parent;
 }
 
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+   return parent_entity(se);
+}
+
 static void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -442,6 +449,19 @@ find_matching_se(struct sched_entity **se, struct 
sched_entity **pse)
}
 }
 
+/* Add the cgroup cfs_rqs to the list, for update_blocked_averages */
+static void enqueue_entity_cfs_rqs(struct sched_entity *se)
+{
+   SCHED_WARN_ON(!entity_is_task(se));
+
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
+
+   if (list_add_leaf_cfs_rq(cfs_rq))
+   break;
+   }
+}
+
 #else  /* !CONFIG_FAIR_GROUP_SCHED */
 
 static inline struct task_struct *task_of(struct sched_entity *se)
@@ -492,6 +512,11 @@ static inline struct sched_entity *parent_entity(struct 
sched_entity *se)
return NULL;
 }
 
+static inline bool task_se_in_cgroup(struct sched_entity *se)
+{
+   return false;
+}
+
 static inline void
 find_matching_se(struct sched_entity **se, struct sched_entity **pse)
 {
@@ -664,8 +689,14 @@ int sched_proc_update_handler(struct ctl_table *table, int 
write,
  */
 static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
 {
-   if (unlikely(se->load.weight != NICE_0_LOAD))
+   if (task_se_in_cgroup(se)) {
+   unsigned long h_load = task_se_h_load(se);
+   if (h_load != se->h_load.weight)
+   update_load_set(>h_load, h_load);
+   delta = __calc_delta(delta, NICE_0_LOAD, >h_load);
+   } else if (unlikely(se->load.weight != NICE_0_LOAD)) {
delta = __calc_delta(delta, NICE_0_LOAD, >load);
+   }
 
return delta;
 }
@@ -679,22 +710,16 @@ static inline u64 calc_delta_fair(u64 delta, struct 
sched_entity *se)
 static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
u64 slice = sysctl_sched_latency;
+   struct load_weight *load = _rq->load;
+   struct load_weight lw;
 
-   for_each_sched_entity(se) {
-   struct load_weight *load;
-   struct load_weight lw;
+   if (unlikely(!se->on_rq)) {
+   lw = cfs_rq->load;
 
-   cfs_rq = cfs_rq_of(se);
-   load = _rq->load;
-
-   if (unlikely(!se->on_rq)) {
-   lw = cfs_rq->load;
-
-

[PATCH 1/8] sched: introduce task_se_h_load helper

2019-06-12 Thread Rik van Riel

Sometimes the hierarchical load of a sched_entity needs to be calculated.
Split out task_h_load into a task_se_h_load that takes a sched_entity pointer
as its argument, and a task_h_load wrapper that calls task_se_h_load.

No functional changes.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f35930f5e528..df624f7a68e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -706,6 +706,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 #ifdef CONFIG_SMP
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
+static unsigned long task_se_h_load(struct sched_entity *se);
 static unsigned long task_h_load(struct task_struct *p);
 static unsigned long capacity_of(int cpu);
 
@@ -7833,14 +7834,19 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
}
 }
 
-static unsigned long task_h_load(struct task_struct *p)
+static unsigned long task_se_h_load(struct sched_entity *se)
 {
-   struct cfs_rq *cfs_rq = task_cfs_rq(p);
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
 
update_cfs_rq_h_load(cfs_rq);
-   return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
+   return div64_ul(se->avg.load_avg * cfs_rq->h_load,
cfs_rq_load_avg(cfs_rq) + 1);
 }
+
+static unsigned long task_h_load(struct task_struct *p)
+{
+   return task_se_h_load(>se);
+}
 #else
 static inline void update_blocked_averages(int cpu)
 {
@@ -7865,6 +7871,11 @@ static inline void update_blocked_averages(int cpu)
rq_unlock_irqrestore(rq, );
 }
 
+static unsigned long task_se_h_load(struct sched_entity *se)
+{
+   return se->avg.load_avg;
+}
+
 static unsigned long task_h_load(struct task_struct *p)
 {
return p->se.avg.load_avg;
-- 
2.20.1

[PATCH 5/8] sched,cfs: use explicit cfs_rq of parent se helper

2019-06-12 Thread Rik van Riel

Use an explicit "cfs_rq of parent sched_entity" helper in a few
strategic places, where cfs_rq_of(se) may no longer point at the
right runqueue once we flatten the hierarchical cgroup runqueues.

No functional change.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcc521d251e3..c6ede2ecc935 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -275,6 +275,15 @@ static inline struct cfs_rq *group_cfs_rq(struct 
sched_entity *grp)
return grp->my_q;
 }
 
+/* runqueue owned by the parent entity */
+static inline struct cfs_rq *group_cfs_rq_of_parent(struct sched_entity *se)
+{
+   if (se->parent)
+   return group_cfs_rq(se->parent);
+
+   return _rq_of(se)->rq->cfs;
+}
+
 static inline bool list_add_leaf_cfs_rq(struct cfs_rq *cfs_rq)
 {
struct rq *rq = rq_of(cfs_rq);
@@ -3298,7 +3307,7 @@ static inline int propagate_entity_load_avg(struct 
sched_entity *se)
 
gcfs_rq->propagate = 0;
 
-   cfs_rq = cfs_rq_of(se);
+   cfs_rq = group_cfs_rq_of_parent(se);
 
add_tg_cfs_propagate(cfs_rq, gcfs_rq->prop_runnable_sum);
 
@@ -7779,7 +7788,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
WRITE_ONCE(cfs_rq->h_load_next, NULL);
for_each_sched_entity(se) {
-   cfs_rq = cfs_rq_of(se);
+   cfs_rq = group_cfs_rq_of_parent(se);
WRITE_ONCE(cfs_rq->h_load_next, se);
if (cfs_rq->last_h_load_update == now)
break;
@@ -7802,7 +7811,7 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
 
 static unsigned long task_se_h_load(struct sched_entity *se)
 {
-   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   struct cfs_rq *cfs_rq = group_cfs_rq_of_parent(se);
 
update_cfs_rq_h_load(cfs_rq);
return div64_ul(se->avg.load_avg * cfs_rq->h_load,
@@ -10159,7 +10168,7 @@ static void task_tick_fair(struct rq *rq, struct 
task_struct *curr, int queued)
struct sched_entity *se = >se;
 
for_each_sched_entity(se) {
-   cfs_rq = cfs_rq_of(se);
+   cfs_rq = group_cfs_rq_of_parent(se);
entity_tick(cfs_rq, se, queued);
}
 
-- 
2.20.1

[PATCH 3/8] sched,fair: redefine runnable_load_avg as the sum of task_h_load

2019-06-12 Thread Rik van Riel

The runnable_load magic is used to quickly propagate information about
runnable tasks up the hierarchy of runqueues. lhen switching to a flat
runqueue, that no longer works.

Redefine the CPU cfs_rq runnable_load_avg to be the sum of task_h_loads
of the runnable tasks. This provides enough information to the load
balancer.

The runnable_load_avg of the cgroup cfs_rqs does not appear to be
used for anything, so don't bother calculating those.

This removes one of the things that the code currently traverses the
cgroup hierarchy for, and getting rid of it brings us one step closer
to a flat runqueue for the CPU controller.

Signed-off-by: Rik van Riel 
---
 include/linux/sched.h |   3 +-
 kernel/sched/core.c   |   2 -
 kernel/sched/debug.c  |   1 +
 kernel/sched/fair.c   | 125 +-
 kernel/sched/pelt.c   |  49 ++---
 kernel/sched/sched.h  |   6 --
 6 files changed, 55 insertions(+), 131 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 11837410690f..f5bb6948e40c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -391,7 +391,6 @@ struct util_est {
 struct sched_avg {
u64 last_update_time;
u64 load_sum;
-   u64 runnable_load_sum;
u32 util_sum;
u32 period_contrib;
unsigned long   load_avg;
@@ -439,7 +438,6 @@ struct sched_statistics {
 struct sched_entity {
/* For load-balancing: */
struct load_weight  load;
-   unsigned long   runnable_weight;
struct rb_node  run_node;
struct list_headgroup_node;
unsigned inton_rq;
@@ -455,6 +453,7 @@ struct sched_entity {
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
+   unsigned long   enqueued_h_load;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq   *cfs_rq;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 874c427742a9..fbd96900f715 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -744,7 +744,6 @@ static void set_load_weight(struct task_struct *p, bool 
update_load)
if (task_has_idle_policy(p)) {
load->weight = scale_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
-   p->se.runnable_weight = load->weight;
return;
}
 
@@ -757,7 +756,6 @@ static void set_load_weight(struct task_struct *p, bool 
update_load)
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
-   p->se.runnable_weight = load->weight;
}
 }
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aab4640d66c5..d06e7436d148 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -965,6 +965,7 @@ void proc_sched_show_task(struct task_struct *p, struct 
pid_namespace *ns,
P(se.avg.load_avg);
P(se.avg.runnable_load_avg);
P(se.avg.util_avg);
+   P(se.enqueued_h_load);
P(se.avg.last_update_time);
P(se.avg.util_est.ewma);
P(se.avg.util_est.enqueued);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index df624f7a68e7..aebd43d74468 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -724,9 +724,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 * nothing has been attached to the task group yet.
 */
if (entity_is_task(se))
-   sa->runnable_load_avg = sa->load_avg = 
scale_load_down(se->load.weight);
-
-   se->runnable_weight = se->load.weight;
+   sa->load_avg = scale_load_down(se->load.weight);
 
/* when this task enqueue'ed, it will contribute to its cfs_rq's 
load_avg */
 }
@@ -2767,20 +2765,39 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static inline void
 enqueue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   cfs_rq->runnable_weight += se->runnable_weight;
+   if (entity_is_task(se)) {
+   struct cfs_rq *cpu_cfs_rq = _rq->rq->cfs;
+   se->enqueued_h_load = task_se_h_load(se);
 
-   cfs_rq->avg.runnable_load_avg += se->avg.runnable_load_avg;
-   cfs_rq->avg.runnable_load_sum += se_runnable(se) * 
se->avg.runnable_load_sum;
+   cpu_cfs_rq->avg.runnable_load_avg += se->enqueued_h_load;
+   }
 }
 
 static inline void
 dequeue_runnable_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   cfs_rq->runnable_weight -= se->runnable_weight;
+   if (entity_is_task(se)) {
+   struct cfs_rq *cpu_cfs_rq = _rq->rq->cfs;
+

[PATCH 2/8] sched: change /proc/sched_debug fields

2019-06-12 Thread Rik van Riel

Remove some fields from /proc/sched_debug that are removed from
sched_entity in a subsequent patch, and add h_load, which comes in
very handy to debug CPU controller weight distribution.

Signed-off-by: Rik van Riel 
---
 kernel/sched/debug.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 678bfb9bd87f..aab4640d66c5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -419,11 +419,9 @@ static void print_cfs_group_stats(struct seq_file *m, int 
cpu, struct task_group
}
 
P(se->load.weight);
-   P(se->runnable_weight);
 #ifdef CONFIG_SMP
P(se->avg.load_avg);
P(se->avg.util_avg);
-   P(se->avg.runnable_load_avg);
 #endif
 
 #undef PN_SCHEDSTAT
@@ -541,7 +539,6 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
SEQ_printf(m, "  .%-30s: %d\n", "nr_running", cfs_rq->nr_running);
SEQ_printf(m, "  .%-30s: %ld\n", "load", cfs_rq->load.weight);
 #ifdef CONFIG_SMP
-   SEQ_printf(m, "  .%-30s: %ld\n", "runnable_weight", 
cfs_rq->runnable_weight);
SEQ_printf(m, "  .%-30s: %lu\n", "load_avg",
cfs_rq->avg.load_avg);
SEQ_printf(m, "  .%-30s: %lu\n", "runnable_load_avg",
@@ -550,17 +547,15 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct 
cfs_rq *cfs_rq)
cfs_rq->avg.util_avg);
SEQ_printf(m, "  .%-30s: %u\n", "util_est_enqueued",
cfs_rq->avg.util_est.enqueued);
-   SEQ_printf(m, "  .%-30s: %ld\n", "removed.load_avg",
-   cfs_rq->removed.load_avg);
SEQ_printf(m, "  .%-30s: %ld\n", "removed.util_avg",
cfs_rq->removed.util_avg);
-   SEQ_printf(m, "  .%-30s: %ld\n", "removed.runnable_sum",
-   cfs_rq->removed.runnable_sum);
 #ifdef CONFIG_FAIR_GROUP_SCHED
SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
cfs_rq->tg_load_avg_contrib);
SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
atomic_long_read(_rq->tg->load_avg));
+   SEQ_printf(m, "  .%-30s: %lu\n", "h_load",
+   cfs_rq->h_load);
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
@@ -964,10 +959,8 @@ void proc_sched_show_task(struct task_struct *p, struct 
pid_namespace *ns,
   "nr_involuntary_switches", (long long)p->nivcsw);
 
P(se.load.weight);
-   P(se.runnable_weight);
 #ifdef CONFIG_SMP
P(se.avg.load_sum);
-   P(se.avg.runnable_load_sum);
P(se.avg.util_sum);
P(se.avg.load_avg);
P(se.avg.runnable_load_avg);
-- 
2.20.1

[RFC] sched,cfs: flatten CPU controller runqueues

2019-06-12 Thread Rik van Riel

The current implementation of the CPU controller uses hierarchical
runqueues, where on wakeup a task is enqueued on its group's runqueue,
the group is enqueued on the runqueue of the group above it, etc.

This increases a fairly large amount of overhead for workloads that
do a lot of wakeups a second, especially given that the default systemd
hierarchy is 2 or 3 levels deep.

This patch series is an attempt at reducing that overhead, by placing
all the tasks on the same runqueue, and scaling the task priority by
the priority of the group, which is calculated periodically.

This patch series still has a number of TODO items:
- Clean up the code, and fix compilation without CONFIG_FAIR_GROUP_SCHED.
- Remove some more now unused code.
- Figure out a regression with schbench, where the p99 latency goes up
  before the system is fully overloaded. I suspect wakeup_preempt_entity()
  and wakeup_gran() because they now use the task_h_load instead of the
  unscaled load to figure out whether a task should be preempted.
- Reimplement CONFIG_CFS_BANDWIDTH.

Plan for the CONFIG_CFS_BANDWIDTH reimplementation:
- When a cgroup gets throttled, mark the cgroup and its children
  as throttled.
- When pick_next_entity finds a task that is on a throttled cgroup,
  stash it on the cgroup runqueue (which is not used for runnable
  tasks any more). Leave the vruntime unchanged, and adjust that
  runqueue's vruntime to be that of the left-most task.
- When a cgroup gets unthrottled, and has tasks on it, place it on
  a vruntime ordered heap separate from the main runqueue.
- Have pick_next_task_fair grab one task off that heap every time it
  is called, and the min vruntime of that heap is lower than the
  vruntime of the CPU's cfs_rq (or the CPU has no other runnable tasks).
- Place that selected task on the CPU's cfs_rq, renormalizing its
  vruntime with the GENTLE_FAIR_SLEEPERS logic. That should help
  interleave the already runnable tasks with the recently unthrottled
  group, and prevent thundering herd issues.
- If the group gets throttled again before all of its task had a chance
  to run, vruntime sorting ensures all the tasks in the throttled cgroup
  get a chance to run over time.

This patch applies on top of what was Linus's current tree when I last
rebased it:
2c1212de6f97 ("Merge tag 'spdx-5.2-rc2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core")

 include/linux/sched.h |5 
 kernel/sched/core.c   |2 
 kernel/sched/debug.c  |   12 
 kernel/sched/fair.c   |  744 +-
 kernel/sched/pelt.c   |   55 +--
 kernel/sched/pelt.h   |2 
 kernel/sched/sched.h  |9 
 7 files changed, 346 insertions(+), 483 deletions(-)

[PATCH net-next v2 0/1] stmmac: honor the GPIO flags for the PHY reset GPIO

2019-06-12 Thread Martin Blumenstingl

Recent Amlogic SoCs (G12A which includes S905X2 and S905D2 as well as
G12B which includes S922X) use GPIOZ_14 or GPIOZ_15 for the PHY reset
line. These GPIOs are special because they are marked as "3.3V input
tolerant open drain (OD) pins" which means they can only drive the pin
output LOW (to reset the PHY) or to switch to input mode (to take the
PHY out of reset).
The GPIO subsystem already supports this with the GPIO_OPEN_DRAIN and
GPIO_OPEN_SOURCE flags in the devicetree bindings.

The goal of this series to add support for these special GPIOs in
stmmac (even though the "snps,reset-gpio" binding is deprecated).

My test-cases were:
- X96 Max: snps,reset-gpio = < GPIOZ_15 0> with and without
   snps,reset-active-low before these patches. The PHY was
   not detected.
- X96 Max: snps,reset-gpio = < GPIOZ_15
  (GPIO_ACTIVE_LOW | GPIO_OPEN_DRAIN)>.
   The PHY is now detected correctly
- Meson8b EC100: snps,reset-gpio = < GPIOH_4 0> with
 snps,reset-active-low. Before and after these
 patches the PHY is detected correctly.
- Meson8b EC100: snps,reset-gpio = < GPIOH_4 0> without
 snps,reset-active-low. Before and after these
 patches the PHY is not detected (this is expected
 because we need to set the output LOW to take the
 PHY out of reset).
- Meson8b EC100: snps,reset-gpio = < GPIOH_4 GPIO_ACTIVE_LOW>
 but without snps,reset-active-low. Before these
 patches the PHY was not detected. With these patches
 the PHY is now detected correctly.


Changes since RFC v1 at [0]:
- dropped all patches except the main patch which changes
  stmmac_mdio_reset to use GPIO descriptors (I will send the cleanup
  patches in a separate series once this patch is merged)
- drop the active_low field from struct stmmac_mdio_bus_data
- added Linus Walleij's Reviewed-by (thank you!)


DEPENDENCIES:
This has a runtime dependency on the preparation patch [0] from
Linus W.'s GPIO tree. Without that dependency the
snps,reset-active-low property (which quite a few .dts files use)
will be ignored.
Linus created an immutable branch which can be pulled into net-next:
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git
ib-snps-reset-gpio
gitweb for this immutable branch: [2]


[0] https://patchwork.kernel.org/cover/10983801/
[1] https://patchwork.ozlabs.org/cover/1113217/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git/log/?h=ib-snps-reset-gpio


Martin Blumenstingl (1):
  net: stmmac: use GPIO descriptors in stmmac_mdio_reset

 .../net/ethernet/stmicro/stmmac/stmmac_mdio.c | 27 +--
 include/linux/stmmac.h|  2 +-
 2 files changed, 14 insertions(+), 15 deletions(-)

-- 
2.22.0

[PATCH net-next v2 1/1] net: stmmac: use GPIO descriptors in stmmac_mdio_reset

2019-06-12 Thread Martin Blumenstingl

Switch stmmac_mdio_reset to use GPIO descriptors. GPIO core handles the
"snps,reset-gpio" for GPIO descriptors so we don't need to take care of
it inside the driver anymore.

The advantage of this is that we now preserve the GPIO flags which are
passed via devicetree. This is required on some newer Amlogic boards
which use an Open Drain pin for the reset GPIO. This pin can only output
a LOW signal or switch to input mode but it cannot output a HIGH signal.
There are already devicetree bindings for these special cases and GPIO
core already takes care of them but only if we use GPIO descriptors
instead of GPIO numbers.

Signed-off-by: Martin Blumenstingl 
Reviewed-by: Linus Walleij 
---
 .../net/ethernet/stmicro/stmmac/stmmac_mdio.c | 27 +--
 include/linux/stmmac.h|  2 +-
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index 093a223fe408..f1c39dd048e7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -20,11 +20,11 @@
   Maintainer: Giuseppe Cavallaro 
 
***/
 
+#include 
 #include 
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -251,37 +251,36 @@ int stmmac_mdio_reset(struct mii_bus *bus)
 
 #ifdef CONFIG_OF
if (priv->device->of_node) {
+   struct gpio_desc *reset_gpio;
+
if (data->reset_gpio < 0) {
struct device_node *np = priv->device->of_node;
 
if (!np)
return 0;
 
-   data->reset_gpio = of_get_named_gpio(np,
-   "snps,reset-gpio", 0);
-   if (data->reset_gpio < 0)
-   return 0;
+   reset_gpio = devm_gpiod_get_optional(priv->device,
+"snps,reset",
+GPIOD_OUT_LOW);
+   if (IS_ERR(reset_gpio))
+   return PTR_ERR(reset_gpio);
 
-   data->active_low = of_property_read_bool(np,
-   "snps,reset-active-low");
of_property_read_u32_array(np,
"snps,reset-delays-us", data->delays, 3);
+   } else {
+   reset_gpio = gpio_to_desc(data->reset_gpio);
 
-   if (devm_gpio_request(priv->device, data->reset_gpio,
- "mdio-reset"))
-   return 0;
+   gpiod_direction_output(reset_gpio, 0);
}
 
-   gpio_direction_output(data->reset_gpio,
- data->active_low ? 1 : 0);
if (data->delays[0])
msleep(DIV_ROUND_UP(data->delays[0], 1000));
 
-   gpio_set_value(data->reset_gpio, data->active_low ? 0 : 1);
+   gpiod_set_value_cansleep(reset_gpio, 1);
if (data->delays[1])
msleep(DIV_ROUND_UP(data->delays[1], 1000));
 
-   gpio_set_value(data->reset_gpio, data->active_low ? 1 : 0);
+   gpiod_set_value_cansleep(reset_gpio, 0);
if (data->delays[2])
msleep(DIV_ROUND_UP(data->delays[2], 1000));
}
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 4335bd771ce5..816edb545592 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -97,7 +97,7 @@ struct stmmac_mdio_bus_data {
int *irqs;
int probed_phy_irq;
 #ifdef CONFIG_OF
-   int reset_gpio, active_low;
+   int reset_gpio;
u32 delays[3];
 #endif
 };
-- 
2.22.0

Re: [RFC PATCH v1 2/3] LSM/x86/sgx: Implement SGX specific hooks in SELinux

2019-06-12 Thread Andy Lutomirski

On Tue, Jun 11, 2019 at 3:02 PM Sean Christopherson
 wrote:
>
> On Tue, Jun 11, 2019 at 09:40:25AM -0400, Stephen Smalley wrote:
> > I haven't looked at this code closely, but it feels like a lot of
> > SGX-specific logic embedded into SELinux that will have to be repeated or
> > reused for every security module.  Does SGX not track this state itself?
>
> SGX does track equivalent state.
>
> There are three proposals on the table (I think):

Sounds about right.  I've been playing with #1 and #2 (as text, not
code), and I'll post my latest thoughts on it below.  But first, I
should mention that I think we've gotten a bit too caught up on
SELinux-y terminology like "EXECMOD" and "EXECMEM", which is relevant
since the kernel has very little visibility into what the enclave is
doing.  Instead, I think we should think about the relevant
permissions more like this:

a) "execute code from a particular source, e.g. a file"
b) "execute code supplied from arbitrary memory outside the enclave"
c) "execute code generated within the enclave"
d) "possess WX enclave memory"

I think that any sensible policy that allows (b) should allow (a).
Similarly, any policy that allows (d) should allow (c).   I don't see
any particular need for the kernel to go out of its way to ensure
these relationships, though.

We could plausibly also distinguish "execute measured code", although
I think that the details of defining and implenenting this, especially
with SGX2, could be nastier than we want to deal with.  A minimal
approach that mostly ignores SGX2 would be to have another permission
"execute code supplied from outside the enclave that was not
measured".  This permission would be required on top of (a) or (b),
depending on where that code comes from.

If we want to map these to existing SELinux terms, we could use
EXECUTE for (a), EXECMOD for (c), and EXECMEM for (d). (b) seems to
also map to EXECMOD or EXECMEM depending on exactly how it happens,
and I'm not sure this makes all that much sense.

>
>   1. Require userspace to explicitly specificy (maximal) enclave page
>  permissions at build time.  The enclave page permissions are provided
>  to, and checked by, LSMs at enclave build time.
>
>  Pros: Low-complexity kernel implementation, straightforward auditing
>  Cons: Sullies the SGX UAPI to some extent, may increase complexity of
>SGX2 enclave loaders.

In my notes, this works like this.  This is similar, but not
identical, to what Sean has been sending out.

EADD takes flags: ALLOW_READ, ALLOW_WRITE, ALLOW_EXEC.  It calls a new hook:

  int security_enclave_load(struct vm_area_struct *source, unsigned int flags);

(Sean passed in the secinfo protection too, but I think we agreed
that this could be omitted.)  This hook will fail if ALLOW_EXEC is
requested and the LSM doesn't consider the source VMA to be
executable.  Privileges (a) and (b) are implemented here.

Optionally, we can enforce noexec here.

The future EAUG ioctl takes the same flags, but it doesn't call
security_enclave_load().  (As Cedric noted, the actual user API for EAUG
is not settled, but I don't think it makes much difference here.)

EINIT takes a sigstruct pointer.  SGX calls a new hook:

  unsigned int security_enclave_init(struct sigstruct *sigstruct,
struct vm_area_struct *source, unsigned int flags);

This hook can return -EPERM.  Otherwise it returns 0 or a combination of
flags DENY_WX and DENY_X_IF_ALLOW_WRITE.  The driver saves this value.
These represent permissions (c) and (d).

If we want to have a permission for "execute code supplied from
outside the enclave that was not measured", we could have a flag like
HAS_UNMEASURED_ALLOW_EXEC_PAGE that the LSM could consider.

mmap() and mprotect() enforce the following rules:

 - Deny if a PROT_ flag is requested but the corresponding ALLOW_ flag
   is not set for all pages in question.

 - Deny if PROT_WRITE, PROT_EXEC, and DENY_WX are all set.

 - Deny if PROT_EXEC, ALLOW_WRITE, and DENY_X_IF_ALLOW_WRITE are all set.

mprotect() and mmap() do *not* call SGX-specific LSM hooks to ask for
permission, although they can optionally call an LSM hook if they hit one of
the -EPERM cases for auditing purposes.

I think this model works quite well in an SGX1 world.  The main thing
that makes me uneasy about this model is that, in SGX2, it requires
that an SGX2-compatible enclave loader must pre-declare to the kernel
whether it intends for its dynamically allocated memory to be
ALLOW_EXEC.  If ALLOW_EXEC is set but not actually needed, it will
still fail if DENY_X_IF_ALLOW_WRITE ends up being set.  The other
version below does not have this limitation.

>
>   2. Pre-check LSM permissions and dynamically track mappings to enclave
>  pages, e.g. add an SGX mprotect() hook to restrict W->X and WX
>  based on the pre-checked permissions.
>
>  Pros: Does not impact SGX UAPI, medium kernel complexity
>  Cons: Auditing is complex/weird, requires taking enclave-specific
>

Re: [RESEND PATCH v1 1/5] of/platform: Speed up of_find_device_by_node()

2019-06-12 Thread Saravana Kannan

On Wed, Jun 12, 2019 at 11:19 AM Rob Herring  wrote:
>
> On Wed, Jun 12, 2019 at 11:08 AM Greg Kroah-Hartman
>  wrote:
> >
> > On Wed, Jun 12, 2019 at 10:53:09AM -0600, Rob Herring wrote:
> > > On Wed, Jun 12, 2019 at 8:22 AM Greg Kroah-Hartman
> > >  wrote:
> > > >
> > > > On Wed, Jun 12, 2019 at 07:53:39AM -0600, Rob Herring wrote:
> > > > > On Tue, Jun 11, 2019 at 3:52 PM Sandeep Patil  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, Jun 11, 2019 at 01:56:25PM -0700, 'Saravana Kannan' via 
> > > > > > kernel-team wrote:
> > > > > > > On Tue, Jun 11, 2019 at 8:18 AM Frank Rowand 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > Hi Saravana,
> > > > > > > >
> > > > > > > > On 6/10/19 10:36 AM, Rob Herring wrote:
> > > > > > > > > Why are you resending this rather than replying to Frank's 
> > > > > > > > > last
> > > > > > > > > comments on the original?
> > > > > > > >
> > > > > > > > Adding on a different aspect...  The independent replies from 
> > > > > > > > three different
> > > > > > > > maintainers (Rob, Mark, myself) pointed out architectural 
> > > > > > > > issues with the
> > > > > > > > patch series.  There were also some implementation issues 
> > > > > > > > brought out.
> > > > > > > > (Although I refrained from bringing up most of my 
> > > > > > > > implementation issues
> > > > > > > > as they are not relevant until architecture issues are 
> > > > > > > > resolved.)
> > > > > > >
> > > > > > > Right, I'm not too worried about the implementation issues before 
> > > > > > > we
> > > > > > > settle on the architectural issues. Those are easy to fix.
> > > > > > >
> > > > > > > Honestly, the main points that the maintainers raised are:
> > > > > > > 1) This is a configuration property and not describing the device.
> > > > > > > Just use the implicit dependencies coming from existing bindings.
> > > > > > >
> > > > > > > I gave a bunch of reasons for why I think it isn't an OS 
> > > > > > > configuration
> > > > > > > property. But even if that's not something the maintainers can 
> > > > > > > agree
> > > > > > > to, I gave a concrete example (cyclic dependencies between clock
> > > > > > > provider hardware) where the implicit dependencies would prevent 
> > > > > > > one
> > > > > > > of the devices from probing till the end of time. So even if the
> > > > > > > maintainers don't agree we should always look at "depends-on" to
> > > > > > > decide the dependencies, we still need some means to override the
> > > > > > > implicit dependencies where they don't match the real dependency. 
> > > > > > > Can
> > > > > > > we use depends-on as an override when the implicit dependencies 
> > > > > > > aren't
> > > > > > > correct?
> > > > > > >
> > > > > > > 2) This doesn't need to be solved because this is just optimizing
> > > > > > > probing or saving power ("we should get rid of this auto 
> > > > > > > disabling"):
> > > > > > >
> > > > > > > I explained why this patch series is not just about optimizing 
> > > > > > > probe
> > > > > > > ordering or saving power. And why we can't ignore auto disabling
> > > > > > > (because it's more than just auto disabling). The kernel is 
> > > > > > > currently
> > > > > > > broken when trying to use modules in ARM SoCs (probably in other
> > > > > > > systems/archs too, but I can't speak for those).
> > > > > > >
> > > > > > > 3) Concerns about backwards compatibility
> > > > > > >
> > > > > > > I pointed out why the current scheme (depends-on being the only 
> > > > > > > source
> > > > > > > of dependency) doesn't break compatibility. And if we go with
> > > > > > > "depends-on" as an override what we could do to keep backwards
> > > > > > > compatibility. Happy to hear more thoughts or discuss options.
> > > > > > >
> > > > > > > 4) How the "sync_state" would work for a device that supplies 
> > > > > > > multiple
> > > > > > > functionalities but a limited driver.
> > > > > >
> > > > > > 
> > > > > > To be clear, all of above are _real_ problems that stops us from 
> > > > > > efficiently
> > > > > > load device drivers as modules for Android.
> > > > > >
> > > > > > So, if 'depends-on' doesn't seem like the right approach and "going 
> > > > > > back to
> > > > > > the drawing board" is the ask, could you please point us in the 
> > > > > > right
> > > > > > direction?
> > > > >
> > > > > Use the dependencies which are already there in DT. That's clocks,
> > > > > pinctrl, regulators, interrupts, gpio at a minimum. I'm simply not
> > > > > going to accept duplicating all those dependencies in DT. The downside
> > > > > for the kernel is you have to address these one by one and can't have
> > > > > a generic property the driver core code can parse. After that's in
> > > > > place, then maybe we can consider handling any additional dependencies
> > > > > not already captured in DT. Once all that is in place, we can probably
> > > > > sort device and/or driver lists to optimize the probe order (maybe the
> > > > > driver core already does

Re: [RFC PATCH v2 2/5] x86/sgx: Require userspace to define enclave pages' protection bits

2019-06-12 Thread Jarkko Sakkinen

On Mon, Jun 10, 2019 at 11:17:44AM -0700, Sean Christopherson wrote:
> On Mon, Jun 10, 2019 at 08:45:06PM +0300, Jarkko Sakkinen wrote:
> > On Mon, Jun 10, 2019 at 09:15:33AM -0700, Sean Christopherson wrote:
> > > > 'flags' should would renamed as 'secinfo_flags_mask' even if the name is
> > > > longish. It would use the same values as the SECINFO flags. The field in
> > > > struct sgx_encl_page should have the same name. That would express
> > > > exactly relation between SECINFO and the new field. I would have never
> > > > asked on last iteration why SECINFO is not enough with a better naming.
> > > 
> > > No, these flags do not impact the EPCM protections in any way.  Userspace
> > > can extend the EPCM protections without going through the kernel.  The
> > > protection flags for an enclave page impact VMA/PTE protection bits.
> > > 
> > > IMO, it is best to treat the EPCM as being completely separate from the
> > > kernel's EPC management.
> > 
> > It is a clumsy API if permissions are not taken in the same format for
> > everything. There is no reason not to do it. The way mprotect() callback
> > just interprets the field is as VMA permissions.
> 
> They are two entirely different things.  The explicit protection bits are
> consumed by the kernel, while SECINFO.flags is consumed by the CPU.  The
> intent is to have the protection flags be analogous to mprotect(), the
> fact that they have a similar/identical format to SECINFO is irrelevant.
> 
> Calling the field secinfo_flags_mask is straight up wrong on SGX2, as 
> userspace can use EMODPE to set SECINFO after the page is added.  It's
> also wrong on SGX1 when adding TCS pages since SECINFO.RWX bits for TCS
> pages are forced to zero by hardware.

The new variable tells the limits on which kernel will co-operate with
the enclave. It is way more descriptive than 'flags'.

> > It would also be more future-proof just to have a mask covering all bits
> > of the SECINFO flags field.
> 
> This simply doesn't work, e.g. the PENDING, MODIFIED and PR flags in the
> SECINFO are read-only from a software perspective.

It is easy to validate reserved bits from a SECINFO struct.

/Jarkko

[PATCHv4 05/28] timens: Introduce CLOCK_BOOTTIME offset

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Adds boottime virtualisation for time namespace.
Introduce timespec for boottime clock into timens offsets and wire
clock_gettime() syscall.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 include/linux/time_namespace.h | 9 +
 include/linux/timens_offsets.h | 1 +
 kernel/time/alarmtimer.c   | 1 +
 kernel/time/posix-stubs.c  | 1 +
 kernel/time/posix-timers.c | 1 +
 5 files changed, 13 insertions(+)

diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 81d0c989df3c..1dda8af6b9fe 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -48,6 +48,14 @@ static inline void timens_add_monotonic(struct timespec64 
*ts)
 *ts = timespec64_add(*ts, ns_offsets->monotonic);
 }
 
+static inline void timens_add_boottime(struct timespec64 *ts)
+{
+struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
+
+if (ns_offsets)
+*ts = timespec64_add(*ts, ns_offsets->boottime);
+}
+
 #else
 static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
 {
@@ -73,6 +81,7 @@ static inline int timens_on_fork(struct nsproxy *nsproxy, 
struct task_struct *ts
 }
 
 static inline void timens_add_monotonic(struct timespec64 *ts) {}
+static inline void timens_add_boottime(struct timespec64 *ts) {}
 #endif
 
 #endif /* _LINUX_TIMENS_H */
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index eaac2c82be5c..e93aabaa5e45 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -4,6 +4,7 @@
 
 struct timens_offsets {
struct timespec64 monotonic;
+   struct timespec64 boottime;
 };
 
 #endif
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 68a163c8b4f2..6346e6ee0d32 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "posix-timers.h"
 
diff --git a/kernel/time/posix-stubs.c b/kernel/time/posix-stubs.c
index 17c67e0aecd8..edaf075d1ee4 100644
--- a/kernel/time/posix-stubs.c
+++ b/kernel/time/posix-stubs.c
@@ -82,6 +82,7 @@ int do_clock_gettime(clockid_t which_clock, struct timespec64 
*tp)
break;
case CLOCK_BOOTTIME:
ktime_get_boottime_ts64(tp);
+   timens_add_boottime(tp);
break;
default:
return -EINVAL;
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 52098f6ad596..573942ae2629 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -240,6 +240,7 @@ static int posix_get_coarse_res(const clockid_t 
which_clock, struct timespec64 *
 int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 
*tp)
 {
ktime_get_boottime_ts64(tp);
+   timens_add_boottime(tp);
return 0;
 }
 
-- 
2.22.0

[PATCHv4 15/28] x86/vdso: Add offsets page in vvar

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

As modern applications fetch time from VDSO without entering the kernel,
it's needed to provide offsets for userspace code inside time namespace.

A page for timens offsets is allocated on time namespace construction.
Put that page into VVAR for tasks inside timens and zero page for
host processes.

As VDSO code is already optimized as much as possible in terms of speed,
any new if-condition in VDSO code is undesirable; the goal is to provide
two .so(s), as was originally suggested by Andy and Thomas:
- for host tasks with optimized-out clk_to_ns() without any penalty
- for processes inside timens with clk_to_ns()
For this purpose, define clk_to_ns() under CONFIG_TIME_NS.

To eliminate any performance regression, clk_to_ns() will be called
under static_branch with follow-up patches, that adds support for
patching vdso.

VDSO mappings are platform-specific, add Kconfig dependency for arch.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 arch/Kconfig  |  5 
 arch/x86/Kconfig  |  1 +
 arch/x86/entry/vdso/vclock_gettime.c  | 43 +++
 arch/x86/entry/vdso/vdso-layout.lds.S |  9 +-
 arch/x86/entry/vdso/vdso2c.c  |  3 ++
 arch/x86/entry/vdso/vma.c | 12 
 arch/x86/include/asm/vdso.h   |  1 +
 init/Kconfig  |  1 +
 8 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index c47b328eada0..503a4113dc6c 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -734,6 +734,11 @@ config HAVE_ARCH_NVRAM_OPS
 config ISA_BUS_API
def_bool ISA
 
+config ARCH_HAS_VDSO_TIME_NS
+   bool
+   help
+VDSO can add time-ns offsets without entering kernel.
+
 #
 # ABI hall of shame
 #
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2bbbd4d1ba31..da70b320eb09 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -79,6 +79,7 @@ config X86
select ARCH_HAS_STRICT_MODULE_RWX
select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
select ARCH_HAS_UBSAN_SANITIZE_ALL
+   select ARCH_HAS_VDSO_TIME_NS
select ARCH_HAS_ZONE_DEVICE if X86_64
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index 0f82a70c7682..e2d93628c0dd 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define gtod ((vsyscall_gtod_data))
 
@@ -38,6 +39,11 @@ extern u8 hvclock_page[PAGE_SIZE]
__attribute__((visibility("hidden")));
 #endif
 
+#ifdef CONFIG_TIME_NS
+extern u8 timens_page
+   __attribute__((visibility("hidden")));
+#endif
+
 #ifndef BUILD_VDSO32
 
 notrace static long vdso_fallback_gettime(long clock, struct timespec *ts)
@@ -139,6 +145,39 @@ notrace static inline u64 vgetcyc(int mode)
return U64_MAX;
 }
 
+#ifdef CONFIG_TIME_NS
+notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec 
*ts)
+{
+   struct timens_offsets *timens = (struct timens_offsets *) _page;
+   struct timespec64 *offset64;
+
+   switch (clk) {
+   case CLOCK_MONOTONIC:
+   case CLOCK_MONOTONIC_COARSE:
+   case CLOCK_MONOTONIC_RAW:
+   offset64 = >monotonic;
+   break;
+   case CLOCK_BOOTTIME:
+   offset64 = >boottime;
+   default:
+   return;
+   }
+
+   ts->tv_nsec += offset64->tv_nsec;
+   ts->tv_sec += offset64->tv_sec;
+   if (ts->tv_nsec >= NSEC_PER_SEC) {
+   ts->tv_nsec -= NSEC_PER_SEC;
+   ts->tv_sec++;
+   }
+   if (ts->tv_nsec < 0) {
+   ts->tv_nsec += NSEC_PER_SEC;
+   ts->tv_sec--;
+   }
+}
+#else
+notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec 
*ts) {}
+#endif
+
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
 {
struct vgtod_ts *base = >basetime[clk];
@@ -165,6 +204,8 @@ notrace static int do_hres(clockid_t clk, struct timespec 
*ts)
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, );
ts->tv_nsec = ns;
 
+   clk_to_ns(clk, ts);
+
return 0;
 }
 
@@ -178,6 +219,8 @@ notrace static void do_coarse(clockid_t clk, struct 
timespec *ts)
ts->tv_sec = base->sec;
ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
+
+   clk_to_ns(clk, ts);
 }
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index 93c6dc7812d0..ba216527e59f 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -7,6 +7,12 @@
  * This script controls its layout.
  */
 
+#ifdef

[PATCHv4 03/28] posix-clocks: add another call back to return clock time in ktime_t

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

The callsite in common_timer_get() has already a comment:
/*
 * The timespec64 based conversion is suboptimal, but it's not
 * worth to implement yet another callback.
 */
kc->clock_get(timr->it_clock, );
now = timespec64_to_ktime(ts64);

Now we are going to add time namespaces and we need to be able to get:
* clock value in a task time namespace to return it from the clock_gettime
  syscall.
* clock valuse in the root time namespace to use it in
  common_timer_get().

It looks like another reason why we need a separate callback to return
clock value in ktime_t.

Suggested-by: Thomas Gleixner 
Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 include/linux/posix-timers.h   |  3 ++
 kernel/time/alarmtimer.c   | 24 ++---
 kernel/time/posix-clock.c  |  8 ++---
 kernel/time/posix-cpu-timers.c | 32 +-
 kernel/time/posix-timers.c | 61 ++
 kernel/time/posix-timers.h |  7 ++--
 6 files changed, 87 insertions(+), 48 deletions(-)

diff --git a/include/linux/posix-timers.h b/include/linux/posix-timers.h
index b20798fc5191..fe13ab265213 100644
--- a/include/linux/posix-timers.h
+++ b/include/linux/posix-timers.h
@@ -127,4 +127,7 @@ void set_process_cpu_timer(struct task_struct *task, 
unsigned int clock_idx,
 void update_rlimit_cpu(struct task_struct *task, unsigned long rlim_new);
 
 void posixtimer_rearm(struct kernel_siginfo *info);
+
+int posix_get_timespec(clockid_t which_clock, struct timespec64 *tp);
+int posix_get_boottime_timespec(const clockid_t which_clock, struct timespec64 
*tp);
 #endif
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 0519a8805aab..68a163c8b4f2 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -43,6 +43,8 @@ static struct alarm_base {
spinlock_t  lock;
struct timerqueue_head  timerqueue;
ktime_t (*gettime)(void);
+   int (*get_timespec)(const clockid_t which_clock,
+   struct timespec64 *tp);
clockid_t   base_clockid;
 } alarm_bases[ALARM_NUMTYPE];
 
@@ -645,21 +647,30 @@ static int alarm_clock_getres(const clockid_t 
which_clock, struct timespec64 *tp
 }
 
 /**
- * alarm_clock_get - posix clock_get interface
+ * alarm_clock_get_timespec - posix clock_get_timespec interface
  * @which_clock: clockid
  * @tp: timespec to fill.
  *
  * Provides the underlying alarm base time.
  */
-static int alarm_clock_get(clockid_t which_clock, struct timespec64 *tp)
+static int alarm_clock_get_timespec(clockid_t which_clock, struct timespec64 
*tp)
 {
struct alarm_base *base = _bases[clock2alarm(which_clock)];
 
if (!alarmtimer_get_rtcdev())
return -EINVAL;
 
-   *tp = ktime_to_timespec64(base->gettime());
-   return 0;
+   return base->get_timespec(base->base_clockid, tp);
+}
+
+static ktime_t alarm_clock_get_ktime(clockid_t which_clock)
+{
+   struct alarm_base *base = _bases[clock2alarm(which_clock)];
+
+   if (!alarmtimer_get_rtcdev())
+   return -EINVAL;
+
+   return base->gettime();
 }
 
 /**
@@ -825,7 +836,8 @@ static int alarm_timer_nsleep(const clockid_t which_clock, 
int flags,
 
 const struct k_clock alarm_clock = {
.clock_getres   = alarm_clock_getres,
-   .clock_get  = alarm_clock_get,
+   .clock_get_ktime= alarm_clock_get_ktime,
+   .clock_get_timespec = alarm_clock_get_timespec,
.timer_create   = alarm_timer_create,
.timer_set  = common_timer_set,
.timer_del  = common_timer_del,
@@ -870,8 +882,10 @@ static int __init alarmtimer_init(void)
/* Initialize alarm bases */
alarm_bases[ALARM_REALTIME].base_clockid = CLOCK_REALTIME;
alarm_bases[ALARM_REALTIME].gettime = _get_real;
+   alarm_bases[ALARM_BOOTTIME].get_timespec = posix_get_timespec,
alarm_bases[ALARM_BOOTTIME].base_clockid = CLOCK_BOOTTIME;
alarm_bases[ALARM_BOOTTIME].gettime = _get_boottime;
+   alarm_bases[ALARM_BOOTTIME].get_timespec = posix_get_boottime_timespec;
for (i = 0; i < ALARM_NUMTYPE; i++) {
timerqueue_init_head(_bases[i].timerqueue);
spin_lock_init(_bases[i].lock);
diff --git a/kernel/time/posix-clock.c b/kernel/time/posix-clock.c
index ec960bb939fd..c8f9c9b1cd82 100644
--- a/kernel/time/posix-clock.c
+++ b/kernel/time/posix-clock.c
@@ -315,8 +315,8 @@ static int pc_clock_settime(clockid_t id, const struct 
timespec64 *ts)
 }
 
 const struct k_clock clock_posix_dynamic = {
-   .clock_getres   = pc_clock_getres,
-   .clock_set  = pc_clock_settime,
-   .clock_get  = pc_clock_gettime,
-   .clock_adj  = pc_clock_adjtime,
+   .clock_getres   =

[PATCHv4 01/28] ns: Introduce Time Namespace

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Time Namespace isolates clock values.

The kernel provides access to several clocks CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME, etc.

CLOCK_REALTIME
  System-wide clock that measures real (i.e., wall-clock) time.

CLOCK_MONOTONIC
  Clock that cannot be set and represents monotonic time since
  some unspecified starting point.

CLOCK_BOOTTIME
  Identical to CLOCK_MONOTONIC, except it also includes any time
  that the system is suspended.

For many users, the time namespace means the ability to changes date and
time in a container (CLOCK_REALTIME).

But in a context of the checkpoint/restore functionality, monotonic and
bootime clocks become interesting. Both clocks are monotonic with
unspecified staring points. These clocks are widely used to measure time
slices and set timers. After restoring or migrating processes, we have to
guarantee that they never go backward. In an ideal case, the behavior of
these clocks should be the same as for a case when a whole system is
suspended. All this means that we need to be able to set CLOCK_MONOTONIC
and CLOCK_BOOTTIME clocks, what can be done by adding per-namespace
offsets for clocks.

A time namespace is similar to a pid namespace in a way how it is
created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
but doesn't set it to the current process. Then all children of
the process will be born in the new time namespace, or a process can
use the setns() system call to join a namespace.

This scheme allows setting clock offsets for a namespace, before any
processes appear in it.

All avaliable clone flags have been used, so CLONE_NEWTIME uses the
highest bit of CSIGNAL. It means that we can use it with the unshare
system call only. Rith now, this works for us, because time namespace
offsets can be set only when a new time namespace is not populated. In a
future, we will have the clone3 system call [1] which will allow to use
the CSIGNAL mask for clone flags.

[1]: httmps://lkml.kernel.org/r/20190604160944.4058-1-christ...@brauner.io

Link: https://criu.org/Time_namespace
Link: https://lists.openvz.org/pipermail/criu/2018-June/041504.html
Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 MAINTAINERS|   2 +
 fs/proc/namespaces.c   |   4 +
 include/linux/nsproxy.h|   2 +
 include/linux/proc_ns.h|   2 +
 include/linux/time_namespace.h |  69 +++
 include/linux/user_namespace.h |   1 +
 include/uapi/linux/sched.h |   5 +
 init/Kconfig   |   7 ++
 kernel/Makefile|   1 +
 kernel/fork.c  |  29 -
 kernel/nsproxy.c   |  41 +--
 kernel/time_namespace.c| 215 +
 12 files changed, 367 insertions(+), 11 deletions(-)
 create mode 100644 include/linux/time_namespace.h
 create mode 100644 kernel/time_namespace.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 57f496cff999..323ab92b963b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12589,6 +12589,8 @@ T:  git 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
 S: Maintained
 F: fs/timerfd.c
 F: include/linux/timer*
+F: include/linux/time_namespace.h
+F: kernel/time_namespace.c
 F: kernel/time/*timer*
 
 POWER MANAGEMENT CORE
diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
index dd2b35f78b09..8b5c720fe5d7 100644
--- a/fs/proc/namespaces.c
+++ b/fs/proc/namespaces.c
@@ -33,6 +33,10 @@ static const struct proc_ns_operations *ns_entries[] = {
 #ifdef CONFIG_CGROUPS
_operations,
 #endif
+#ifdef CONFIG_TIME_NS
+   _operations,
+   _for_children_operations,
+#endif
 };
 
 static const char *proc_ns_get_link(struct dentry *dentry,
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 2ae1b1a4d84d..074f395b9ad2 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -35,6 +35,8 @@ struct nsproxy {
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns_for_children;
struct net   *net_ns;
+   struct time_namespace *time_ns;
+   struct time_namespace *time_ns_for_children;
struct cgroup_namespace *cgroup_ns;
 };
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/proc_ns.h b/include/linux/proc_ns.h
index d31cb6215905..3e6f332da465 100644
--- a/include/linux/proc_ns.h
+++ b/include/linux/proc_ns.h
@@ -32,6 +32,8 @@ extern const struct proc_ns_operations 
pidns_for_children_operations;
 extern const struct proc_ns_operations userns_operations;
 extern const struct proc_ns_operations mntns_operations;
 extern const struct proc_ns_operations cgroupns_operations;
+extern const struct proc_ns_operations timens_operations;
+extern const struct proc_ns_operations timens_for_children_operations;
 
 /*
  * We always define these enumerators
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
new file

[PATCHv4 12/28] x86/vdso/Makefile: Add vobjs32

2019-06-12 Thread Dmitry Safonov

Treat ia32/i386 objects in array the same As for 64-bit vdso objects.
This is a preparation ground to avoid code duplication on introduction
timens vdso.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/Makefile | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index 42fe42e82baf..b58d34120fd8 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -18,6 +18,8 @@ VDSO32-$(CONFIG_IA32_EMULATION)   := y
 
 # files to link into the vdso
 vobjs-y := vdso-note.o vclock_gettime.o vgetcpu.o
+vobjs32-y := vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
+vobjs32-y += vdso32/vclock_gettime.o
 
 # files to link into kernel
 obj-y  += vma.o
@@ -31,10 +33,12 @@ vdso_img-$(VDSO32-y)+= 32
 obj-$(VDSO32-y)+= vdso32-setup.o
 
 vobjs := $(foreach F,$(vobjs-y),$(obj)/$F)
+vobjs32 := $(foreach F,$(vobjs32-y),$(obj)/$F)
 
 $(obj)/vdso.o: $(obj)/vdso.so
 
 targets += vdso.lds $(vobjs-y)
+targets += vdso32/vdso32.lds $(vobjs32-y)
 
 # Build the vDSO image C files and link them in.
 vdso_img_objs := $(vdso_img-y:%=vdso-image-%.o)
@@ -125,10 +129,6 @@ $(obj)/vdsox32.so.dbg: $(obj)/vdsox32.lds $(vobjx32s) FORCE
 CPPFLAGS_vdso32.lds = $(CPPFLAGS_vdso.lds)
 VDSO_LDFLAGS_vdso32.lds = -m elf_i386 -soname linux-gate.so.1
 
-targets += vdso32/vdso32.lds
-targets += vdso32/note.o vdso32/system_call.o vdso32/sigreturn.o
-targets += vdso32/vclock_gettime.o
-
 KBUILD_AFLAGS_32 := $(filter-out -m64,$(KBUILD_AFLAGS)) -DBUILD_VDSO
 $(obj)/vdso32.so.dbg: KBUILD_AFLAGS = $(KBUILD_AFLAGS_32)
 $(obj)/vdso32.so.dbg: asflags-$(CONFIG_X86_64) += -m32
@@ -153,12 +153,7 @@ endif
 
 $(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
 
-$(obj)/vdso32.so.dbg: FORCE \
- $(obj)/vdso32/vdso32.lds \
- $(obj)/vdso32/vclock_gettime.o \
- $(obj)/vdso32/note.o \
- $(obj)/vdso32/system_call.o \
- $(obj)/vdso32/sigreturn.o
+$(obj)/vdso32.so.dbg: $(obj)/vdso32/vdso32.lds $(vobjs32) FORCE
$(call if_changed,vdso)
 
 #
-- 
2.22.0

[PATCHv4 09/28] timens: Shift /proc/uptime

2019-06-12 Thread Dmitry Safonov

Respect boottime inside time namespace for /proc/uptime

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 fs/proc/uptime.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/fs/proc/uptime.c b/fs/proc/uptime.c
index a4c2791ab70b..5a1b228964fb 100644
--- a/fs/proc/uptime.c
+++ b/fs/proc/uptime.c
@@ -5,6 +5,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 static int uptime_proc_show(struct seq_file *m, void *v)
@@ -20,6 +21,8 @@ static int uptime_proc_show(struct seq_file *m, void *v)
nsec += (__force u64) kcpustat_cpu(i).cpustat[CPUTIME_IDLE];
 
ktime_get_boottime_ts64();
+   timens_add_boottime();
+
idle.tv_sec = div_u64_rem(nsec, NSEC_PER_SEC, );
idle.tv_nsec = rem;
seq_printf(m, "%lu.%02lu %lu.%02lu\n",
-- 
2.22.0

[PATCHv4 06/28] timerfd/timens: Take into account ns clock offsets

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Make timerfd respect timens offsets.
Provide a helper timens_ktime_to_host() that is useful to wire up
timens to different kernel subsystems.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 fs/timerfd.c   |  3 +++
 include/linux/time_namespace.h | 18 ++
 kernel/time_namespace.c| 27 +++
 3 files changed, 48 insertions(+)

diff --git a/fs/timerfd.c b/fs/timerfd.c
index 6a6fc8aa1de7..9b0c2f65e7e8 100644
--- a/fs/timerfd.c
+++ b/fs/timerfd.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct timerfd_ctx {
union {
@@ -196,6 +197,8 @@ static int timerfd_setup(struct timerfd_ctx *ctx, int flags,
}
 
if (texp != 0) {
+   if (flags & TFD_TIMER_ABSTIME)
+   texp = timens_ktime_to_host(clockid, texp);
if (isalarm(ctx)) {
if (flags & TFD_TIMER_ABSTIME)
alarm_start(>t.alarm, texp);
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 1dda8af6b9fe..d32b55fad953 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -56,6 +56,19 @@ static inline void timens_add_boottime(struct timespec64 *ts)
 *ts = timespec64_add(*ts, ns_offsets->boottime);
 }
 
+ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim,
+   struct timens_offsets *offsets);
+static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
+{
+   struct timens_offsets *offsets = current->nsproxy->time_ns->offsets;
+
+   if (!offsets) /* fast-path for the root time namespace */
+  return tim;
+
+   return do_timens_ktime_to_host(clockid, tim, offsets);
+}
+
+
 #else
 static inline struct time_namespace *get_time_ns(struct time_namespace *ns)
 {
@@ -82,6 +95,11 @@ static inline int timens_on_fork(struct nsproxy *nsproxy, 
struct task_struct *ts
 
 static inline void timens_add_monotonic(struct timespec64 *ts) {}
 static inline void timens_add_boottime(struct timespec64 *ts) {}
+
+static inline ktime_t timens_ktime_to_host(clockid_t clockid, ktime_t tim)
+{
+   return tim;
+}
 #endif
 
 #endif /* _LINUX_TIMENS_H */
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 4828447721ec..b3cffdf2635c 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -15,6 +15,33 @@
 #include 
 #include 
 
+ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim, struct 
timens_offsets *ns_offsets)
+{
+   ktime_t koff;
+
+   switch (clockid) {
+   case CLOCK_MONOTONIC:
+   koff = timespec64_to_ktime(ns_offsets->monotonic);
+   break;
+   case CLOCK_BOOTTIME:
+   case CLOCK_BOOTTIME_ALARM:
+   koff = timespec64_to_ktime(ns_offsets->boottime);
+   break;
+   default:
+   return tim;
+   }
+
+   /* tim - off has to be in [0, KTIME_MAX) */
+   if (tim < koff)
+   tim = 0;
+   else if (KTIME_MAX - tim < -koff)
+   tim = KTIME_MAX;
+   else
+   tim = ktime_sub(tim, koff);
+
+   return tim;
+}
+
 static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
 {
return inc_ucount(ns, current_euid(), UCOUNT_TIME_NAMESPACES);
-- 
2.22.0

[PATCHv4 16/28] x86/vdso: Allocate timens vdso

2019-06-12 Thread Dmitry Safonov

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

The propose is to allocate a second vdso code with dynamically
patched out (disabled by static_branch) timens code on boot time.

Allocate another vdso and copy original code.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vdso2c.h |   2 +-
 arch/x86/entry/vdso/vma.c| 113 +--
 arch/x86/include/asm/vdso.h  |   9 +--
 3 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 7556bb70ed8b..885b988aea19 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -157,7 +157,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
}
fprintf(outfile, "\n};\n\n");
 
-   fprintf(outfile, "const struct vdso_image %s = {\n", image_name);
+   fprintf(outfile, "struct vdso_image %s __ro_after_init = {\n", 
image_name);
fprintf(outfile, "\t.text = raw_data,\n");
fprintf(outfile, "\t.size = %lu,\n", mapping_size);
if (alt_sec) {
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 8a7f4cfe1fad..cc06c6b70167 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -30,26 +30,128 @@
 unsigned int __read_mostly vdso64_enabled = 1;
 #endif
 
-void __init init_vdso_image(const struct vdso_image *image)
+void __init init_vdso_image(struct vdso_image *image)
 {
BUG_ON(image->size % PAGE_SIZE != 0);
 
apply_alternatives((struct alt_instr *)(image->text + image->alt),
   (struct alt_instr *)(image->text + image->alt +
image->alt_len));
+#ifdef CONFIG_TIME_NS
+   image->text_timens = vmalloc_32(image->size);
+   if (WARN_ON(image->text_timens == NULL))
+   return;
+
+   memcpy(image->text_timens, image->text, image->size);
+#endif
 }
 
 struct linux_binprm;
 
+#ifdef CONFIG_TIME_NS
+static inline struct timens_offsets *current_timens_offsets(void)
+{
+   return current->nsproxy->time_ns->offsets;
+}
+
+static int vdso_check_timens(struct vm_area_struct *vma, bool *in_timens)
+{
+   struct task_struct *tsk;
+
+   if (likely(vma->vm_mm == current->mm)) {
+   *in_timens = !!current_timens_offsets();
+   return 0;
+   }
+
+   /*
+* .fault() handler can be called over remote process through
+* interfaces like /proc/$pid/mem or process_vm_{readv,writev}()
+* Considering such access to vdso as a slow-path.
+*/
+
+#ifdef CONFIG_MEMCG
+   rcu_read_lock();
+
+   tsk = rcu_dereference(vma->vm_mm->owner);
+   if (tsk) {
+   task_lock(tsk);
+   /*
+* Shouldn't happen: nsproxy is unset in exit_mm().
+* Before that exit_mm() holds mmap_sem to set (mm = NULL).
+* It's impossible to have a fault in task without mm
+* and mmap_sem is taken during the fault.
+*/
+   if (WARN_ON_ONCE(tsk->nsproxy == NULL)) {
+   task_unlock(tsk);
+   rcu_read_unlock();
+   return -EIO;
+   }
+   *in_timens = !!tsk->nsproxy->time_ns->offsets;
+   task_unlock(tsk);
+   rcu_read_unlock();
+   return 0;
+   }
+   rcu_read_unlock();
+#endif
+
+   read_lock(_lock);
+   for_each_process(tsk) {
+   struct task_struct *c;
+
+   if (tsk->flags & PF_KTHREAD)
+   continue;
+   for_each_thread(tsk, c) {
+   if (c->mm == vma->vm_mm)
+   goto found;
+   if (c->mm)
+   break;
+   }
+   }
+   read_unlock(_lock);
+   return -ESRCH;
+
+found:
+   task_lock(tsk);
+   read_unlock(_lock);
+   *in_timens = !!tsk->nsproxy->time_ns->offsets;
+   task_unlock(tsk);
+
+   return 0;
+}
+#else /* CONFIG_TIME_NS */
+static inline int vdso_check_timens(struct vm_area_struct *vma, bool 
*in_timens)
+{
+   *in_timens = false;
+   return 0;
+}
+static inline struct timens_offsets *current_timens_offsets(void)
+{
+   return NULL;
+}
+#endif /* CONFIG_TIME_NS */
+
 static vm_fault_t vdso_fault(const struct vm_special_mapping *sm,
  struct vm_area_struct *vma, struct vm_fault *vmf)
 {
const struct vdso_image *image =

[PATCHv4 10/28] x86/vdso2c: Correct err messages on file opening

2019-06-12 Thread Dmitry Safonov

err() message in main() is misleading: it should print `outfilename`,
which is argv[3], not argv[2].

Correct error messages to be more precise about what failed and for
which file.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vdso2c.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c
index 3a4d8d4d39f8..ce67370d14e5 100644
--- a/arch/x86/entry/vdso/vdso2c.c
+++ b/arch/x86/entry/vdso/vdso2c.c
@@ -184,7 +184,7 @@ static void map_input(const char *name, void **addr, size_t 
*len, int prot)
 
int fd = open(name, O_RDONLY);
if (fd == -1)
-   err(1, "%s", name);
+   err(1, "open(%s)", name);
 
tmp_len = lseek(fd, 0, SEEK_END);
if (tmp_len == (off_t)-1)
@@ -237,7 +237,7 @@ int main(int argc, char **argv)
outfilename = argv[3];
outfile = fopen(outfilename, "w");
if (!outfile)
-   err(1, "%s", argv[2]);
+   err(1, "fopen(%s)", outfilename);
 
go(raw_addr, raw_len, stripped_addr, stripped_len, outfile, name);
 
-- 
2.22.0

[PATCHv4 07/28] posix-timers/timens: Take into account clock offsets

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Wire timer_settime() syscall into time namespace virtualization.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 kernel/time/posix-timers.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 573942ae2629..dba77ee48e74 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -857,6 +857,8 @@ int common_timer_set(struct k_itimer *timr, int flags,
 
timr->it_interval = timespec64_to_ktime(new_setting->it_interval);
expires = timespec64_to_ktime(new_setting->it_value);
+   if (flags & TIMER_ABSTIME)
+   expires = timens_ktime_to_host(timr->it_clock, expires);
sigev_none = timr->it_sigev_notify == SIGEV_NONE;
 
kc->timer_arm(timr, expires, flags & TIMER_ABSTIME, sigev_none);
-- 
2.22.0

[PATCHv4 08/28] timens/kernel: Take into account timens clock offsets in clock_nanosleep

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Wire up clock_nanosleep() to timens offsets.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 include/linux/hrtimer.h|  2 +-
 kernel/time/alarmtimer.c   |  2 ++
 kernel/time/hrtimer.c  |  8 
 kernel/time/posix-stubs.c  | 12 ++--
 kernel/time/posix-timers.c | 19 ---
 5 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 2e8957eac4d4..5a3b3e17d0e8 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -473,7 +473,7 @@ static inline u64 hrtimer_forward_now(struct hrtimer *timer,
 /* Precise sleep: */
 
 extern int nanosleep_copyout(struct restart_block *, struct timespec64 *);
-extern long hrtimer_nanosleep(const struct timespec64 *rqtp,
+extern long hrtimer_nanosleep(ktime_t rqtp,
  const enum hrtimer_mode mode,
  const clockid_t clockid);
 
diff --git a/kernel/time/alarmtimer.c b/kernel/time/alarmtimer.c
index 6346e6ee0d32..f1f42df179d0 100644
--- a/kernel/time/alarmtimer.c
+++ b/kernel/time/alarmtimer.c
@@ -819,6 +819,8 @@ static int alarm_timer_nsleep(const clockid_t which_clock, 
int flags,
ktime_t now = alarm_bases[type].gettime();
 
exp = ktime_add_safe(now, exp);
+   } else {
+   exp = timens_ktime_to_host(which_clock, exp);
}
 
ret = alarmtimer_do_nsleep(, exp, type);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 41dfff23c1f9..b245f6ff9c8f 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1716,7 +1716,7 @@ static long __sched hrtimer_nanosleep_restart(struct 
restart_block *restart)
return ret;
 }
 
-long hrtimer_nanosleep(const struct timespec64 *rqtp,
+long hrtimer_nanosleep(ktime_t rqtp,
   const enum hrtimer_mode mode, const clockid_t clockid)
 {
struct restart_block *restart;
@@ -1729,7 +1729,7 @@ long hrtimer_nanosleep(const struct timespec64 *rqtp,
slack = 0;
 
hrtimer_init_on_stack(, clockid, mode);
-   hrtimer_set_expires_range_ns(, timespec64_to_ktime(*rqtp), 
slack);
+   hrtimer_set_expires_range_ns(, rqtp, slack);
ret = do_nanosleep(, mode);
if (ret != -ERESTART_RESTARTBLOCK)
goto out;
@@ -1764,7 +1764,7 @@ SYSCALL_DEFINE2(nanosleep, struct __kernel_timespec 
__user *, rqtp,
 
current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
current->restart_block.nanosleep.rmtp = rmtp;
-   return hrtimer_nanosleep(, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+   return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, 
CLOCK_MONOTONIC);
 }
 
 #endif
@@ -1784,7 +1784,7 @@ SYSCALL_DEFINE2(nanosleep_time32, struct old_timespec32 
__user *, rqtp,
 
current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
current->restart_block.nanosleep.compat_rmtp = rmtp;
-   return hrtimer_nanosleep(, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
+   return hrtimer_nanosleep(timespec64_to_ktime(tu), HRTIMER_MODE_REL, 
CLOCK_MONOTONIC);
 }
 #endif
 
diff --git a/kernel/time/posix-stubs.c b/kernel/time/posix-stubs.c
index edaf075d1ee4..4ee0dc180866 100644
--- a/kernel/time/posix-stubs.c
+++ b/kernel/time/posix-stubs.c
@@ -129,6 +129,7 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, 
which_clock, int, flags,
struct __kernel_timespec __user *, rmtp)
 {
struct timespec64 t;
+   ktime_t texp;
 
switch (which_clock) {
case CLOCK_REALTIME:
@@ -147,7 +148,10 @@ SYSCALL_DEFINE4(clock_nanosleep, const clockid_t, 
which_clock, int, flags,
rmtp = NULL;
current->restart_block.nanosleep.type = rmtp ? TT_NATIVE : TT_NONE;
current->restart_block.nanosleep.rmtp = rmtp;
-   return hrtimer_nanosleep(, flags & TIMER_ABSTIME ?
+   texp = timespec64_to_ktime(t);
+   if (flags & TIMER_ABSTIME)
+   texp = timens_ktime_to_host(clockid, texp;
+   return hrtimer_nanosleep(texp, flags & TIMER_ABSTIME ?
 HRTIMER_MODE_ABS : HRTIMER_MODE_REL,
 which_clock);
 }
@@ -215,6 +219,7 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, 
which_clock, int, flags,
struct old_timespec32 __user *, rmtp)
 {
struct timespec64 t;
+   ktime texp;
 
switch (which_clock) {
case CLOCK_REALTIME:
@@ -233,7 +238,10 @@ SYSCALL_DEFINE4(clock_nanosleep_time32, clockid_t, 
which_clock, int, flags,
rmtp = NULL;
current->restart_block.nanosleep.type = rmtp ? TT_COMPAT : TT_NONE;
current->restart_block.nanosleep.compat_rmtp = rmtp;
-   return hrtimer_nanosleep(, flags & TIMER_ABSTIME ?
+   texp = timespec64_to_ktime(t);
+   if (flags & TIMER_ABSTIME)
+   texp = timens_ktime_to_host(clockid, texp;
+

Re: [PATCH v3 3/4] backlight: pwm_bl: compute brightness of LED linearly to human eye.

2019-06-12 Thread Matthias Kaehlcke

Hi Daniel,

On Wed, Jun 12, 2019 at 12:03:25PM +0100, Daniel Thompson wrote:
> On Tue, Jun 11, 2019 at 03:30:19PM -0700, Matthias Kaehlcke wrote:
> > On Tue, Jun 11, 2019 at 09:55:30AM -0700, Brian Norris wrote:
> > > On Tue, Jun 11, 2019 at 3:49 AM Daniel Thompson
> > >  wrote:
> > > > This is a long standing flaw in the backlight interfaces. AFAIK generic
> > > > userspaces end up with a (flawed) heuristic.
> > > 
> > > Bingo! Would be nice if we could start to fix this long-standing flaw.
> > 
> > Agreed!
> > 
> > How could a fix look like, a sysfs attribute? Would a boolean value
> > like 'logarithmic_scale' or 'linear_scale' be enough or could more
> > granularity be needed?
> 
> Certainly "linear" (this device will work more or less correctly if the
> userspace applies perceptual curves). Not sure about logarithmic since
> what is actually useful is something that is "perceptually linear"
> (logarithmic is merely a way to approximate that).
> 
> I do wonder about a compatible string like most-detailed to
> least-detailed description. This for a PWM with the auto-generated
> tables we'd see something like:
> 
> cie-1991,perceptual,non-linear
> 
> For something that is non-linear but we are not sure what its tables are
> we can offer just "non-linear".

Thanks for the feedback!

It seems clear that we want a string for the added flexibility. I can
work on a patch with the compatible string like description you
suggested and we can discuss in the review if we want to go with that
or prefer something else.

> > The new attribute could be optional (it only exists if explicitly
> > specified by the driver) or be set to a default based on a heuristic
> > if not specified and be 'fixed' on a case by case basis. The latter
> > might violate "don't break userspace" though, so I'm not sure it's a
> > good idea.
> 
> I think we should avoid any heuristic! There are several drivers and we
> may not be able to work through all of them and make the correct
> decision.

Agreed

> Instead one valid value for the sysfs should be "unknown" and this be
> the default for drivers we have not analysed (this also makes it easy to
> introduce change here).

An "unknown" value sounds good, it allows userspace to just do what it
did/would hace done before this attribute existed.

> We should only set the property to something else for drivers that have
> been reviewed.
> 
> There could be a special case for pwm_bl.c in that I'm prepared to
> assume that the hardware components downstream of the PWM have a
> roughly linear response and that if the user provided tables that their
> function is to provide a perceptually comfortable response.

Unfortunately this isn't universally true :(

At least several Chrome OS devices use a linear brightness scale and
userspace does the transformation in the animated slider. A quick
'git grep -A10 brightness-levels arch' suggests that there are
multiple other devices/platforms using a linear scale.

We could treat devices with a predefined brightness table as
"unknown", unless there is a (new optional) DT property that indicates
the type of the scale.

Cheers

Matthias

[PATCHv4 17/28] x86/vdso: Switch image on setns()/unshare()/clone()

2019-06-12 Thread Dmitry Safonov

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.
It will add a penalty for everybody as branch predictor may mispredict
the jump. Also there are instruction cache lines wasted on cmp/jmp.

Those effects of introducing time namespace are very much unwanted
having in mind how much work have been spent on micro-optimisation
vdso code.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

Whenever a user does setns()/unshare() or clone() with CLONE_TIMENS,
change VDSO image in mm and zap existing VVAR/VDSO page tables.
They will be re-faulted with corresponding image and VVAR offsets.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vma.c   | 28 
 arch/x86/include/asm/vdso.h |  1 +
 kernel/time_namespace.c | 11 +++
 3 files changed, 40 insertions(+)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index cc06c6b70167..3ed5bf4932af 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #if defined(CONFIG_X86_64)
 unsigned int __read_mostly vdso64_enabled = 1;
@@ -266,6 +267,33 @@ static const struct vm_special_mapping vvar_mapping = {
.mremap = vvar_mremap,
 };
 
+#ifdef CONFIG_TIME_NS
+int vdso_join_timens(struct task_struct *task)
+{
+   struct mm_struct *mm = task->mm;
+   struct vm_area_struct *vma;
+
+   if (down_write_killable(>mmap_sem))
+   return -EINTR;
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   unsigned long size = vma->vm_end - vma->vm_start;
+
+   if (vma_is_special_mapping(vma, _mapping) ||
+   vma_is_special_mapping(vma, _mapping))
+   zap_page_range(vma, vma->vm_start, size);
+   }
+
+   up_write(>mmap_sem);
+   return 0;
+}
+#else /* CONFIG_TIME_NS */
+int vdso_join_timens(struct task_struct *task)
+{
+   return -ENXIO;
+}
+#endif
+
 /*
  * Add vdso and vvar mappings to current process.
  * @image  - blob to map
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 03f468c63a24..ccf89dedd04f 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -45,6 +45,7 @@ extern struct vdso_image vdso_image_32;
 extern void __init init_vdso_image(struct vdso_image *image);
 
 extern int map_vdso_once(const struct vdso_image *image, unsigned long addr);
+extern int vdso_join_timens(struct task_struct *task);
 
 #endif /* __ASSEMBLER__ */
 
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index b3cffdf2635c..2a2cab14ac29 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 
 ktime_t do_timens_ktime_to_host(clockid_t clockid, ktime_t tim, struct 
timens_offsets *ns_offsets)
 {
@@ -182,11 +183,16 @@ static void timens_put(struct ns_common *ns)
 static int timens_install(struct nsproxy *nsproxy, struct ns_common *new)
 {
struct time_namespace *ns = to_time_ns(new);
+   int ret;
 
if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
!ns_capable(current_user_ns(), CAP_SYS_ADMIN))
return -EPERM;
 
+   ret = vdso_join_timens(current);
+   if (ret)
+   return ret;
+
get_time_ns(ns);
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
@@ -201,10 +207,15 @@ int timens_on_fork(struct nsproxy *nsproxy, struct 
task_struct *tsk)
 {
struct ns_common *nsc = >time_ns_for_children->ns;
struct time_namespace *ns = to_time_ns(nsc);
+   int ret;
 
if (nsproxy->time_ns == nsproxy->time_ns_for_children)
return 0;
 
+   ret = vdso_join_timens(tsk);
+   if (ret)
+   return ret;
+
get_time_ns(ns);
put_time_ns(nsproxy->time_ns);
nsproxy->time_ns = ns;
-- 
2.22.0

[PATCHv4 14/28] x86/vdso: Rename vdso_image {.data=>.text}

2019-06-12 Thread Dmitry Safonov

To avoid any confusion with VVAR.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vdso2c.h | 2 +-
 arch/x86/entry/vdso/vma.c| 6 +++---
 arch/x86/include/asm/vdso.h  | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 80be339ee93e..7556bb70ed8b 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -158,7 +158,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
fprintf(outfile, "\n};\n\n");
 
fprintf(outfile, "const struct vdso_image %s = {\n", image_name);
-   fprintf(outfile, "\t.data = raw_data,\n");
+   fprintf(outfile, "\t.text = raw_data,\n");
fprintf(outfile, "\t.size = %lu,\n", mapping_size);
if (alt_sec) {
fprintf(outfile, "\t.alt = %lu,\n",
diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index d2b421233ba5..c30a33b2963b 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -32,8 +32,8 @@ void __init init_vdso_image(const struct vdso_image *image)
 {
BUG_ON(image->size % PAGE_SIZE != 0);
 
-   apply_alternatives((struct alt_instr *)(image->data + image->alt),
-  (struct alt_instr *)(image->data + image->alt +
+   apply_alternatives((struct alt_instr *)(image->text + image->alt),
+  (struct alt_instr *)(image->text + image->alt +
image->alt_len));
 }
 
@@ -47,7 +47,7 @@ static vm_fault_t vdso_fault(const struct vm_special_mapping 
*sm,
if (!image || (vmf->pgoff << PAGE_SHIFT) >= image->size)
return VM_FAULT_SIGBUS;
 
-   vmf->page = virt_to_page(image->data + (vmf->pgoff << PAGE_SHIFT));
+   vmf->page = virt_to_page(image->text + (vmf->pgoff << PAGE_SHIFT));
get_page(vmf->page);
return 0;
 }
diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index 230474e2ddb5..dffdc12cc7d6 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -11,7 +11,7 @@
 #include 
 
 struct vdso_image {
-   void *data;
+   void *text;
unsigned long size;   /* Always a multiple of PAGE_SIZE */
 
unsigned long alt, alt_len;
-- 
2.22.0

[PATCHv4 21/28] selftest/timens: Add Time Namespace test for supported clocks

2019-06-12 Thread Dmitry Safonov

A test to check that all supported clocks work on host and inside
a new time namespace. Use both ways to get time: through VDSO and
by entering the kernel with implicit syscall.

Introduce a new timens directory in selftests framework for
the next timens tests.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/Makefile  |   1 +
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   5 +
 tools/testing/selftests/timens/config |   1 +
 tools/testing/selftests/timens/log.h  |  26 +++
 tools/testing/selftests/timens/timens.c   | 188 ++
 tools/testing/selftests/timens/timens.h   |  63 
 7 files changed, 285 insertions(+)
 create mode 100644 tools/testing/selftests/timens/.gitignore
 create mode 100644 tools/testing/selftests/timens/Makefile
 create mode 100644 tools/testing/selftests/timens/config
 create mode 100644 tools/testing/selftests/timens/log.h
 create mode 100644 tools/testing/selftests/timens/timens.c
 create mode 100644 tools/testing/selftests/timens/timens.h

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..f71a59632192 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -47,6 +47,7 @@ TARGETS += splice
 TARGETS += static_keys
 TARGETS += sync
 TARGETS += sysctl
+TARGETS += timens
 ifneq (1, $(quicktest))
 TARGETS += timers
 endif
diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
new file mode 100644
index ..27a693229ce1
--- /dev/null
+++ b/tools/testing/selftests/timens/.gitignore
@@ -0,0 +1 @@
+timens
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
new file mode 100644
index ..b877efb78974
--- /dev/null
+++ b/tools/testing/selftests/timens/Makefile
@@ -0,0 +1,5 @@
+TEST_GEN_PROGS := timens
+
+CFLAGS := -Wall -Werror
+
+include ../lib.mk
diff --git a/tools/testing/selftests/timens/config 
b/tools/testing/selftests/timens/config
new file mode 100644
index ..4480620f6f49
--- /dev/null
+++ b/tools/testing/selftests/timens/config
@@ -0,0 +1 @@
+CONFIG_TIME_NS=y
diff --git a/tools/testing/selftests/timens/log.h 
b/tools/testing/selftests/timens/log.h
new file mode 100644
index ..db64df2a8483
--- /dev/null
+++ b/tools/testing/selftests/timens/log.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __SELFTEST_TIMENS_LOG_H__
+#define __SELFTEST_TIMENS_LOG_H__
+
+#define pr_msg(fmt, lvl, ...)  \
+   ksft_print_msg("[%s] (%s:%d)\t" fmt "\n",   \
+   lvl, __FILE__, __LINE__, ##__VA_ARGS__)
+
+#define pr_p(func, fmt, ...)   func(fmt ": %m", ##__VA_ARGS__)
+
+#define pr_err(fmt, ...)   \
+   ({  \
+   ksft_test_result_error(fmt "\n", ##__VA_ARGS__);
\
+   -1; \
+   })
+
+#define pr_fail(fmt, ...)  \
+   ({  \
+   ksft_test_result_fail(fmt, ##__VA_ARGS__);  \
+   -1; \
+   })
+
+#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__)
+
+#endif
diff --git a/tools/testing/selftests/timens/timens.c 
b/tools/testing/selftests/timens/timens.c
new file mode 100644
index ..407e7a97882f
--- /dev/null
+++ b/tools/testing/selftests/timens/timens.c
@@ -0,0 +1,188 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC(10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+#define CLOCK_TYPES\
+   ct(CLOCK_BOOTTIME, -1), \
+   ct(CLOCK_BOOTTIME_ALARM, 1),\
+   ct(CLOCK_MONOTONIC, -1),\
+   ct(CLOCK_MONOTONIC_COARSE, 1),  \
+   ct(CLOCK_MONOTONIC_RAW, 1), \
+
+
+struct test_clock {
+   clockid_t id;
+   char *name;
+   /*
+* off_id is -1 if a clock has own offset, or it contains an index
+* which contains a right offset of this clock.
+*/
+

[PATCHv4 20/28] timens/fs/proc: Introduce /proc/pid/timens_offsets

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

API to set time namespace offsets for children processes, i.e.:
echo "clockid off_ses off_nsec" > /proc/self/timens_offsets

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 fs/proc/base.c |  95 ++
 include/linux/time_namespace.h |  10 
 kernel/time_namespace.c| 104 +
 3 files changed, 209 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9c8ca6cd3ce4..6a96b0543f69 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "internal.h"
 #include "fd.h"
@@ -1516,6 +1517,97 @@ static const struct file_operations 
proc_pid_sched_autogroup_operations = {
 
 #endif /* CONFIG_SCHED_AUTOGROUP */
 
+#ifdef CONFIG_TIME_NS
+static int timens_offsets_show(struct seq_file *m, void *v)
+{
+   struct task_struct *p;
+
+   p = get_proc_task(file_inode(m->file));
+   if (!p)
+   return -ESRCH;
+   proc_timens_show_offsets(p, m);
+
+   put_task_struct(p);
+
+   return 0;
+}
+
+static ssize_t
+timens_offsets_write(struct file *file, const char __user *buf,
+   size_t count, loff_t *ppos)
+{
+   struct inode *inode = file_inode(file);
+   struct proc_timens_offset offsets[2];
+   char *kbuf = NULL, *pos, *next_line;
+   struct task_struct *p;
+   int ret, noffsets;
+
+   /* Only allow < page size writes at the beginning of the file */
+   if ((*ppos != 0) || (count >= PAGE_SIZE))
+   return -EINVAL;
+
+   /* Slurp in the user data */
+   kbuf = memdup_user_nul(buf, count);
+   if (IS_ERR(kbuf))
+   return PTR_ERR(kbuf);
+
+   /* Parse the user data */
+   ret = -EINVAL;
+   noffsets = 0;
+   for (pos = kbuf; pos; pos = next_line) {
+   struct proc_timens_offset *off = [noffsets];
+   int err;
+
+   /* Find the end of line and ensure we don't look past it */
+   next_line = strchr(pos, '\n');
+   if (next_line) {
+   *next_line = '\0';
+   next_line++;
+   if (*next_line == '\0')
+   next_line = NULL;
+   }
+
+   err = sscanf(pos, "%u %lld %lu", >clockid,
+   >val.tv_sec, >val.tv_nsec);
+   if (err != 3 || off->val.tv_nsec >= NSEC_PER_SEC)
+   goto out;
+   noffsets++;
+   if (noffsets == ARRAY_SIZE(offsets)) {
+   if (next_line)
+   count = next_line - kbuf;
+   break;
+   }
+   }
+
+   ret = -ESRCH;
+   p = get_proc_task(inode);
+   if (!p)
+   goto out;
+   ret = proc_timens_set_offset(file, p, offsets, noffsets);
+   put_task_struct(p);
+   if (ret)
+   goto out;
+
+   ret = count;
+out:
+   kfree(kbuf);
+   return ret;
+}
+
+static int timens_offsets_open(struct inode *inode, struct file *filp)
+{
+   return single_open(filp, timens_offsets_show, inode);
+}
+
+static const struct file_operations proc_timens_offsets_operations = {
+   .open   = timens_offsets_open,
+   .read   = seq_read,
+   .write  = timens_offsets_write,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+#endif /* CONFIG_TIME_NS */
+
 static ssize_t comm_write(struct file *file, const char __user *buf,
size_t count, loff_t *offset)
 {
@@ -2982,6 +3074,9 @@ static const struct pid_entry tgid_base_stuff[] = {
 #endif
 #ifdef CONFIG_SCHED_AUTOGROUP
REG("autogroup",  S_IRUGO|S_IWUSR, proc_pid_sched_autogroup_operations),
+#endif
+#ifdef CONFIG_TIME_NS
+   REG("timens_offsets",  S_IRUGO|S_IWUSR, proc_timens_offsets_operations),
 #endif
REG("comm",  S_IRUGO|S_IWUSR, proc_pid_set_comm_operations),
 #ifdef CONFIG_HAVE_ARCH_TRACEHOOK
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index d32b55fad953..8cd16dfea42d 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -40,6 +40,16 @@ static inline void put_time_ns(struct time_namespace *ns)
kref_put(>kref, free_time_ns);
 }
 
+extern void proc_timens_show_offsets(struct task_struct *p, struct seq_file 
*m);
+
+struct proc_timens_offset {
+   int clockid;
+   struct timespec64 val;
+};
+
+extern int proc_timens_set_offset(struct file *file, struct task_struct *p,
+   struct proc_timens_offset *offsets, int n);
+
 static inline void timens_add_monotonic(struct timespec64 *ts)
 {
 struct timens_offsets *ns_offsets = current->nsproxy->time_ns->offsets;
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index

[PATCHv4 23/28] selftest/timens: Add a test for clock_nanosleep()

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Check that clock_nanosleep() takes into account clock offsets.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   2 +-
 .../selftests/timens/clock_nanosleep.c| 100 ++
 3 files changed, 102 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/clock_nanosleep.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index b609f6ee9fb9..9b6c8ddac2c8 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,2 +1,3 @@
+clock_nanosleep
 timens
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index 66b90cd28e5c..76a1dc891184 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd
+TEST_GEN_PROGS := timens timerfd clock_nanosleep
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/clock_nanosleep.c 
b/tools/testing/selftests/timens/clock_nanosleep.c
new file mode 100644
index ..dfd4e3429c75
--- /dev/null
+++ b/tools/testing/selftests/timens/clock_nanosleep.c
@@ -0,0 +1,100 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+static long long get_elapsed_time(int clockid, struct timespec *start)
+{
+   struct timespec curr;
+   long long secs, nsecs;
+
+   if (clock_gettime(clockid, ) == -1)
+   return pr_perror("clock_gettime");
+
+   secs = curr.tv_sec - start->tv_sec;
+   nsecs = curr.tv_nsec - start->tv_nsec;
+   if (nsecs < 0) {
+   secs--;
+   nsecs += 10;
+   }
+   if (nsecs > 10) {
+   secs++;
+   nsecs -= 10;
+   }
+   return secs * 1000 + nsecs / 100;
+}
+
+int run_test(int clockid)
+{
+   long long elapsed;
+   int i;
+
+   for (i = 0; i < 2; i++) {
+   struct timespec now = {};
+   struct timespec start;
+
+   if (clock_gettime(clockid, ) == -1)
+   return pr_perror("clock_gettime");
+
+
+   if (i == 1) {
+   now.tv_sec = start.tv_sec;
+   now.tv_nsec = start.tv_nsec;
+   }
+
+   now.tv_sec += 2;
+   clock_nanosleep(clockid, i ? TIMER_ABSTIME : 0, , NULL);
+
+   elapsed = get_elapsed_time(clockid, );
+   if (elapsed < 1900 || elapsed > 2100) {
+   pr_fail("clockid: %d abs: %d elapsed: %lld\n",
+   clockid, i, elapsed);
+   return 1;
+   }
+   ksft_test_result_pass("clockid: %d abs:%d\n", clockid, i);
+   }
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int ret, nsfd;
+
+   nscheck();
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("unshare");
+
+   if (_settime(CLOCK_MONOTONIC, 7 * 24 * 3600))
+   return 1;
+   if (_settime(CLOCK_BOOTTIME, 9 * 24 * 3600))
+   return 1;
+
+   nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+   if (nsfd < 0)
+   return pr_perror("Unable to open timens_for_children");
+
+   if (setns(nsfd, CLONE_NEWTIME))
+   return pr_perror("Unable to set timens");
+
+   ret = 0;
+   ret |= run_test(CLOCK_MONOTONIC);
+   ret |= run_test(CLOCK_BOOTTIME_ALARM);
+
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return ret;
+}
+
-- 
2.22.0

Re: [PATCH v3 1/2] KVM: LAPIC: Optimize timer latency consider world switch time

2019-06-12 Thread Radim Krčmář

2019-06-12 21:22+0200, Radim Krčmář:
> 2019-06-12 08:14-0700, Sean Christopherson:
> > On Wed, Jun 12, 2019 at 05:40:18PM +0800, Wanpeng Li wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > @@ -145,6 +145,12 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | 
> > > S_IWUSR);
> > >  static int __read_mostly lapic_timer_advance_ns = -1;
> > >  module_param(lapic_timer_advance_ns, int, S_IRUGO | S_IWUSR);
> > >  
> > > +/*
> > > + * lapic timer vmentry advance (tscdeadline mode only) in nanoseconds.
> > > + */
> > > +u32 __read_mostly vmentry_advance_ns = 300;
> > 
> > Enabling this by default makes me nervous, e.g. nothing guarantees that
> > future versions of KVM and/or CPUs will continue to have 300ns of overhead
> > between wait_lapic_expire() and VM-Enter.
> > 
> > If we want it enabled by default so that it gets tested, the default
> > value should be extremely conservative, e.g. set the default to a small
> > percentage (25%?) of the latency of VM-Enter itself on modern CPUs,
> > VM-Enter latency being the min between VMLAUNCH and VMLOAD+VMRUN+VMSAVE.
> 
> I share the sentiment.  We definitely must not enter the guest before
> the deadline has expired and CPUs are approaching 5 GHz (in turbo), so
> 300 ns would be too much even today.
> 
> I wrote a simple testcase for rough timing and there are 267 cycles
> (111 ns @ 2.4 GHz) between doing rdtsc() right after
> kvm_wait_lapic_expire() [1] and doing rdtsc() in the guest as soon as
> possible (see the attached kvm-unit-test).

I forgot to attach it, pasting here as a patch for kvm-unit-tests.

---
diff --git a/x86/Makefile.common b/x86/Makefile.common
index e612dbe..ceed648 100644
--- a/x86/Makefile.common
+++ b/x86/Makefile.common
@@ -58,7 +58,7 @@ tests-common = $(TEST_DIR)/vmexit.flat $(TEST_DIR)/tsc.flat \
$(TEST_DIR)/init.flat $(TEST_DIR)/smap.flat \
$(TEST_DIR)/hyperv_synic.flat $(TEST_DIR)/hyperv_stimer.flat \
$(TEST_DIR)/hyperv_connections.flat \
-   $(TEST_DIR)/umip.flat
+   $(TEST_DIR)/umip.flat $(TEST_DIR)/vmentry_latency.flat
 
 ifdef API
 tests-api = api/api-sample api/dirty-log api/dirty-log-perf
diff --git a/x86/vmentry_latency.c b/x86/vmentry_latency.c
new file mode 100644
index 000..3859f09
--- /dev/null
+++ b/x86/vmentry_latency.c
@@ -0,0 +1,45 @@
+#include "x86/vm.h"
+
+static u64 get_last_hypervisor_tsc_delta(void)
+{
+   u64 a = 0, b, c, d;
+   u64 tsc;
+
+   /*
+* The first vmcall is there to force a vm exit just before measuring.
+*/
+   asm volatile ("vmcall" : "+a"(a), "=b"(b), "=c"(c), "=d"(d));
+
+   tsc = rdtsc();
+
+   /*
+* The second hypercall recovers the value that was stored when vm
+* entering to execute the rdtsc()
+*/
+   a = 11;
+   asm volatile ("vmcall" : "+a"(a), "=b"(b), "=c"(c), "=d"(d));
+
+   return tsc - a;
+}
+
+static void vmentry_latency(void)
+{
+   unsigned i = 100;
+   u64 min = -1;
+
+   while (i--) {
+   u64 latency = get_last_hypervisor_tsc_delta();
+   if (latency < min)
+   min = latency;
+   }
+
+   printf("vm entry latency is %"PRIu64" TSC cycles\n", min);
+}
+
+int main(void)
+{
+   setup_vm();
+   vmentry_latency();
+
+   return 0;
+}

[PATCHv4 26/28] x86/vdso: Align VDSO functions by CPU L1 cache line

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

After performance testing VDSO patches a noticeable 20% regression was
found on gettime_perf selftest with a cold cache.
As it turns to be, before time namespaces introduction, VDSO functions
were quite aligned to cache lines, but adding a new code to adjust
timens offset inside namespace created a small shift and vdso functions
become unaligned on cache lines.

Add align to vdso functions with gcc option to fix performance drop.

Coping the resulting numbers from cover letter:

Hot CPU cache (more gettime_perf.c cycles - the better):
| before | CONFIG_TIME_NS=n | host| inside timens
||--|-|-
cycles  | 139887013  | 139453003| 139899785   | 128792458
diff (%)| 100| 99.7 | 100 | 92

Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better):
| before | CONFIG_TIME_NS=n | host| inside timens
||--|-|-
tsc | 6748   | 6718 | 6862| 12682
diff (%)| 100| 99.6 | 101.7   | 188

Measured on Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz

Co-developed-by: Dmitry Safonov 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/entry/vdso/Makefile b/arch/x86/entry/vdso/Makefile
index b58d34120fd8..c7bfd62d1fc3 100644
--- a/arch/x86/entry/vdso/Makefile
+++ b/arch/x86/entry/vdso/Makefile
@@ -4,6 +4,7 @@
 #
 
 KBUILD_CFLAGS += $(DISABLE_LTO)
+KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_L1_CACHE_SHIFT)
 KASAN_SANITIZE := n
 UBSAN_SANITIZE := n
 OBJECT_FILES_NON_STANDARD  := y
-- 
2.22.0

[PATCHv4 25/28] selftest/timens: Add timer offsets test

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Check that timer_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   3 +-
 tools/testing/selftests/timens/timer.c| 116 ++
 3 files changed, 119 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/timer.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index 94ffdd9cead7..3b7eda8f35ce 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
 clock_nanosleep
 procfs
 timens
+timer
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index f96f50d1fef8..ae1ffd24cc43 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,5 +1,6 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
 
 CFLAGS := -Wall -Werror
+LDFLAGS := -lrt
 
 include ../lib.mk
diff --git a/tools/testing/selftests/timens/timer.c 
b/tools/testing/selftests/timens/timer.c
new file mode 100644
index ..6e33cd54d397
--- /dev/null
+++ b/tools/testing/selftests/timens/timer.c
@@ -0,0 +1,116 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+int run_test(int clockid, struct timespec now)
+{
+   struct itimerspec new_value;
+   long long elapsed;
+   timer_t fd;
+   int i;
+
+   for (i = 0; i < 2; i++) {
+   struct sigevent sevp = {.sigev_notify = SIGEV_NONE};
+   int flags = 0;
+
+   new_value.it_value.tv_sec = 3600;
+   new_value.it_value.tv_nsec = 0;
+   new_value.it_interval.tv_sec = 1;
+   new_value.it_interval.tv_nsec = 0;
+
+   if (i == 1) {
+   new_value.it_value.tv_sec += now.tv_sec;
+   new_value.it_value.tv_nsec += now.tv_nsec;
+   }
+
+   if (timer_create(clockid, , ) == -1)
+   return pr_perror("timerfd_create");
+
+   if (i == 1)
+   flags |= TIMER_ABSTIME;
+   if (timer_settime(fd, flags, _value, NULL) == -1)
+   return pr_perror("timerfd_settime");
+
+   if (timer_gettime(fd, _value) == -1)
+   return pr_perror("timerfd_gettime");
+
+   elapsed = new_value.it_value.tv_sec;
+   if (abs(elapsed - 3600) > 60) {
+   ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+ clockid, elapsed);
+   return 1;
+   }
+   }
+
+   ksft_test_result_pass("clockid=%d\n", clockid);
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int ret, status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec btime_now, mtime_now;
+
+   nscheck();
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+   clock_gettime(CLOCK_BOOTTIME, _now);
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("unshare");
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600,
+   CLOCK_BOOTTIME, 9 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+   mtime_now.tv_sec += 70 * 24 * 3600;
+   btime_now.tv_sec += 9 * 24 * 3600;
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   ret = 0;
+   ret |= run_test(CLOCK_BOOTTIME, btime_now);
+   ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+   ret |= run_test(CLOCK_BOOTTIME_ALARM, btime_now);
+
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return ret;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}
+
-- 
2.22.0

[PATCHv4 24/28] selftest/timens: Add procfs selftest

2019-06-12 Thread Dmitry Safonov

Check that /proc/uptime is correct inside a new time namespace.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   2 +-
 tools/testing/selftests/timens/procfs.c   | 142 ++
 3 files changed, 144 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/procfs.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index 9b6c8ddac2c8..94ffdd9cead7 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,3 +1,4 @@
 clock_nanosleep
+procfs
 timens
 timerfd
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index 76a1dc891184..f96f50d1fef8 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd clock_nanosleep
+TEST_GEN_PROGS := timens timerfd clock_nanosleep procfs
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/procfs.c 
b/tools/testing/selftests/timens/procfs.c
new file mode 100644
index ..89a24c134510
--- /dev/null
+++ b/tools/testing/selftests/timens/procfs.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+/*
+ * Test shouldn't be run for a day, so add 10 days to child
+ * time and check parent's time to be in the same day.
+ */
+#define MAX_TEST_TIME_SEC  (60*5)
+#define DAY_IN_SEC (60*60*24)
+#define TEN_DAYS_IN_SEC(10*DAY_IN_SEC)
+
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+static int child_ns, parent_ns;
+
+static int switch_ns(int fd)
+{
+   if (setns(fd, CLONE_NEWTIME))
+   return pr_perror("setns()");
+
+   return 0;
+}
+
+static int init_namespaces(void)
+{
+   char path[] = "/proc/self/ns/time_for_children";
+   struct stat st1, st2;
+
+   parent_ns = open(path, O_RDONLY);
+   if (parent_ns <= 0)
+   return pr_perror("Unable to open %s", path);
+
+   if (fstat(parent_ns, ))
+   return pr_perror("Unable to stat the parent timens");
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("Can't unshare() timens");
+
+   child_ns = open(path, O_RDONLY);
+   if (child_ns <= 0)
+   return pr_perror("Unable to open %s", path);
+
+   if (fstat(child_ns, ))
+   return pr_perror("Unable to stat the timens");
+
+   if (st1.st_ino == st2.st_ino)
+   return pr_err("The same child_ns after CLONE_NEWTIME");
+
+   if (_settime(CLOCK_BOOTTIME, TEN_DAYS_IN_SEC))
+   return -1;
+
+   return 0;
+}
+
+static int read_proc_uptime(struct timespec *uptime)
+{
+   unsigned long up_sec, up_nsec;
+   FILE *proc;
+
+   proc = fopen("/proc/uptime", "r");
+   if (proc == NULL) {
+   pr_perror("Unable to open /proc/uptime");
+   return -1;
+   }
+
+   if (fscanf(proc, "%lu.%02lu", _sec, _nsec) != 2) {
+   if (errno) {
+   pr_perror("fscanf");
+   return -errno;
+   }
+   pr_err("failed to parse /proc/uptime");
+   return -1;
+   }
+   fclose(proc);
+
+   uptime->tv_sec = up_sec;
+   uptime->tv_nsec = up_nsec;
+   return 0;
+}
+
+static int check_uptime(void)
+{
+   struct timespec uptime_new, uptime_old;
+   time_t uptime_expected;
+   double prec = MAX_TEST_TIME_SEC;
+
+   if (switch_ns(parent_ns))
+   return pr_err("switch_ns(%d)", parent_ns);
+
+   if (read_proc_uptime(_old))
+   return 1;
+
+   if (switch_ns(child_ns))
+   return pr_err("switch_ns(%d)", child_ns);
+
+   if (read_proc_uptime(_new))
+   return 1;
+
+   uptime_expected = uptime_old.tv_sec + TEN_DAYS_IN_SEC;
+   if (fabs(difftime(uptime_new.tv_sec, uptime_expected)) > prec) {
+   pr_fail("uptime in /proc/uptime: old %ld, new %ld [%ld]",
+   uptime_old.tv_sec, uptime_new.tv_sec,
+   uptime_old.tv_sec + TEN_DAYS_IN_SEC);
+   return 1;
+   }
+
+   ksft_test_result_pass("Passed for /proc/uptime\n");
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int ret = 0;
+
+   nscheck();
+
+   if (init_namespaces())
+   return 1;
+
+   ret |= check_uptime();
+
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return ret;
+}
-- 
2.22.0

[PATCHv4 27/28] selftests: Add a simple perf test for clock_gettime()

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |  2 +
 tools/testing/selftests/timens/Makefile   |  8 +-
 tools/testing/selftests/timens/gettime_perf.c | 74 +++
 .../selftests/timens/gettime_perf_cold.c  | 63 
 4 files changed, 146 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/gettime_perf.c
 create mode 100644 tools/testing/selftests/timens/gettime_perf_cold.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index 3b7eda8f35ce..16292e4d08a5 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,6 @@
 clock_nanosleep
+gettime_perf
+gettime_perf_cold
 procfs
 timens
 timer
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index ae1ffd24cc43..ef65bf96b55c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,10 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),x86_64)
+TEST_GEN_PROGS += gettime_perf_cold
+endif
 
 CFLAGS := -Wall -Werror
 LDFLAGS := -lrt
diff --git a/tools/testing/selftests/timens/gettime_perf.c 
b/tools/testing/selftests/timens/gettime_perf.c
new file mode 100644
index ..510d77a941d9
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf.c
@@ -0,0 +1,74 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+//#define TEST_SYSCALL
+
+static void test(clock_t clockid, char *clockstr, bool in_ns)
+{
+   struct timespec tp, start;
+   long i = 0;
+   const int timeout = 3;
+
+#ifndef TEST_SYSCALL
+   clock_gettime(clockid, );
+#else
+   syscall(__NR_clock_gettime, clockid, );
+#endif
+   tp = start;
+   for (tp = start; start.tv_sec + timeout > tp.tv_sec ||
+(start.tv_sec + timeout == tp.tv_sec &&
+ start.tv_nsec > tp.tv_nsec); i++) {
+#ifndef TEST_SYSCALL
+   clock_gettime(clockid, );
+#else
+   syscall(__NR_clock_gettime, clockid, );
+#endif
+   }
+
+   ksft_test_result_pass("%s:\tclock: %10s\tcycles:\t%10ld\n",
+ in_ns ? "ns" : "host", clockstr, i);
+}
+
+int main(int argc, char *argv[])
+{
+   time_t offset = 10;
+   int nsfd;
+
+   test(CLOCK_MONOTONIC, "monotonic", false);
+   test(CLOCK_BOOTTIME, "boottime", false);
+
+   nscheck();
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("Can't unshare() timens");
+
+   nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+   if (nsfd < 0)
+   return pr_perror("Can't open a time namespace");
+
+   if (_settime(CLOCK_MONOTONIC, offset))
+   return 1;
+   if (_settime(CLOCK_BOOTTIME, offset))
+   return 1;
+
+   if (setns(nsfd, CLONE_NEWTIME))
+   return pr_perror("setns");
+
+   test(CLOCK_MONOTONIC, "monotonic", true);
+   test(CLOCK_BOOTTIME, "boottime", true);
+
+   ksft_exit_pass();
+   return 0;
+}
diff --git a/tools/testing/selftests/timens/gettime_perf_cold.c 
b/tools/testing/selftests/timens/gettime_perf_cold.c
new file mode 100644
index ..f72db8a4c903
--- /dev/null
+++ b/tools/testing/selftests/timens/gettime_perf_cold.c
@@ -0,0 +1,63 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+static __inline__ unsigned long long rdtsc(void)
+{
+   unsigned hi, lo;
+
+   __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
+   return ((unsigned long long) lo) | (((unsigned long long)hi) << 32);
+}
+
+static void test(clock_t clockid, char *clockstr)
+{
+   struct timespec tp;
+   long long s, e;
+
+   s = rdtsc();
+   clock_gettime(clockid, );
+   e = rdtsc();
+   printf("%lld\n", e - s);
+   return;
+}
+
+int main(int argc, char **argv)
+{
+   time_t offset = 10;
+   int nsfd;
+
+   if (argc == 1) {
+   test(CLOCK_MONOTONIC, "monotonic");
+   return 0;
+   }
+   nscheck();
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("Can't unshare() timens");
+
+   nsfd = open("/proc/self/ns/time_for_children", O_RDONLY);
+   if (nsfd < 0)
+   return pr_perror("Can't open a time namespace");
+
+   if

[PATCHv4 28/28] selftest/timens: Check that a right vdso is mapped after fork and exec

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |  1 +
 tools/testing/selftests/timens/Makefile   |  2 +-
 tools/testing/selftests/timens/exec.c | 91 +++
 3 files changed, 93 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/exec.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index 16292e4d08a5..789f21e81028 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1,4 +1,5 @@
 clock_nanosleep
+exec
 gettime_perf
 gettime_perf_cold
 procfs
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index ef65bf96b55c..9e0edf354906 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf
+TEST_GEN_PROGS := timens timerfd timer clock_nanosleep procfs gettime_perf exec
 
 uname_M := $(shell uname -m 2>/dev/null || echo not)
 ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
diff --git a/tools/testing/selftests/timens/exec.c 
b/tools/testing/selftests/timens/exec.c
new file mode 100644
index ..b3a05c41e202
--- /dev/null
+++ b/tools/testing/selftests/timens/exec.c
@@ -0,0 +1,91 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+#define OFFSET (36000)
+
+int main(int argc, char *argv[])
+{
+   struct timespec now, tst;
+   int status, i;
+   pid_t pid;
+
+   if (argc > 1) {
+   if (sscanf(argv[1], "%ld", _sec) != 1)
+   return pr_perror("sscanf");
+
+   for (i = 0; i < 2; i++) {
+   _gettime(CLOCK_MONOTONIC, , i);
+   if (abs(tst.tv_sec - now.tv_sec) > 5)
+   return pr_fail("%ld %ld\n", now.tv_sec, 
tst.tv_sec);
+   }
+   }
+
+   nscheck();
+
+   clock_gettime(CLOCK_MONOTONIC, );
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("Can't unshare() timens");
+
+   if (_settime(CLOCK_MONOTONIC, OFFSET))
+   return 1;
+
+   for (i = 0; i < 2; i++) {
+   _gettime(CLOCK_MONOTONIC, , i);
+   if (abs(tst.tv_sec - now.tv_sec) > 5)
+   return pr_fail("%ld %ld\n",
+   now.tv_sec, tst.tv_sec);
+   }
+
+   if (argc > 1)
+   return 0;
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("fork");
+
+   if (pid == 0) {
+   char now_str[64];
+   char *cargv[] = {"exec", now_str, NULL};
+   char *cenv[] = {NULL};
+
+   /* Check that a child process is in the new timens. */
+   for (i = 0; i < 2; i++) {
+   _gettime(CLOCK_MONOTONIC, , i);
+   if (abs(tst.tv_sec - now.tv_sec - OFFSET) > 5)
+   return pr_fail("%ld %ld\n",
+   now.tv_sec + OFFSET, 
tst.tv_sec);
+   }
+
+   /* Check that a proper vdso will be mapped after execve. */
+   snprintf(now_str, sizeof(now_str), "%ld", now.tv_sec + OFFSET);
+   execve("/proc/self/exe", cargv, cenv);
+   return pr_perror("execve");
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("waitpid");
+
+   if (status)
+   ksft_exit_fail();
+
+   ksft_test_result_pass("exec\n");
+   ksft_exit_pass();
+   return 0;
+}
-- 
2.22.0

[PATCHv4 18/28] vdso: introduce timens_static_branch

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

As it has been discussed on timens RFC, adding a new conditional branch
`if (inside_time_ns)` on VDSO for all processes is undesirable.

Addressing those problems, there are two versions of VDSO's .so:
for host tasks (without any penalty) and for processes inside of time
namespace with clk_to_ns() that subtracts offsets from host's time.

This patch introduces timens_static_branch(), which is similar with
static_branch_unlikely.

The timens code in vdso looks like this:

   if (timens_static_branch()) {
   clk_to_ns(clk, ts);
   }

The version of vdso which is compiled from sources will never execute
clk_to_ns(). And then we can patch the 'no-op' in the straight-line
codepath with a 'jump' instruction to the out-of-line true branch and
get the timens version of the vdso library.

While cooking the patch, an alternative approach has being considered:
to omit no-ops - memcpy() the following asm ret sequience on the place of
a function call: https://github.com/0x7f454c46/linux/commit/4cc0180f6d65
Having in mind possible issues with different toolchains, the usual
static_branch() approach was choosen.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 arch/x86/entry/vdso/vclock_gettime.c  |  9 +--
 arch/x86/entry/vdso/vdso-layout.lds.S |  1 +
 arch/x86/entry/vdso/vdso2c.h  | 11 +++-
 arch/x86/entry/vdso/vma.c | 37 ++-
 arch/x86/include/asm/jump_label.h | 14 ++
 arch/x86/include/asm/vdso.h   |  1 +
 include/linux/jump_label.h|  5 
 7 files changed, 69 insertions(+), 9 deletions(-)

diff --git a/arch/x86/entry/vdso/vclock_gettime.c 
b/arch/x86/entry/vdso/vclock_gettime.c
index e2d93628c0dd..21b7153cf2b0 100644
--- a/arch/x86/entry/vdso/vclock_gettime.c
+++ b/arch/x86/entry/vdso/vclock_gettime.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -174,8 +175,10 @@ notrace static __always_inline void clk_to_ns(clockid_t 
clk, struct timespec *ts
ts->tv_sec--;
}
 }
+#define _timens_static_branch_unlikely timens_static_branch_unlikely
 #else
 notrace static __always_inline void clk_to_ns(clockid_t clk, struct timespec 
*ts) {}
+notrace static __always_inline bool _timens_static_branch_unlikely(void) { 
return false; }
 #endif
 
 notrace static int do_hres(clockid_t clk, struct timespec *ts)
@@ -204,7 +207,8 @@ notrace static int do_hres(clockid_t clk, struct timespec 
*ts)
ts->tv_sec = sec + __iter_div_u64_rem(ns, NSEC_PER_SEC, );
ts->tv_nsec = ns;
 
-   clk_to_ns(clk, ts);
+   if (_timens_static_branch_unlikely())
+   clk_to_ns(clk, ts);
 
return 0;
 }
@@ -220,7 +224,8 @@ notrace static void do_coarse(clockid_t clk, struct 
timespec *ts)
ts->tv_nsec = base->nsec;
} while (unlikely(gtod_read_retry(gtod, seq)));
 
-   clk_to_ns(clk, ts);
+   if (_timens_static_branch_unlikely())
+   clk_to_ns(clk, ts);
 }
 
 notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S 
b/arch/x86/entry/vdso/vdso-layout.lds.S
index ba216527e59f..69dbe4821aa5 100644
--- a/arch/x86/entry/vdso/vdso-layout.lds.S
+++ b/arch/x86/entry/vdso/vdso-layout.lds.S
@@ -45,6 +45,7 @@ SECTIONS
.gnu.version: { *(.gnu.version) }
.gnu.version_d  : { *(.gnu.version_d) }
.gnu.version_r  : { *(.gnu.version_r) }
+   __jump_table: { *(__jump_table) }   :text
 
.dynamic: { *(.dynamic) }   :text   :dynamic
 
diff --git a/arch/x86/entry/vdso/vdso2c.h b/arch/x86/entry/vdso/vdso2c.h
index 885b988aea19..392031258315 100644
--- a/arch/x86/entry/vdso/vdso2c.h
+++ b/arch/x86/entry/vdso/vdso2c.h
@@ -16,7 +16,7 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
unsigned int i, syms_nr;
unsigned long j;
ELF(Shdr) *symtab_hdr = NULL, *strtab_hdr, *secstrings_hdr,
-   *alt_sec = NULL;
+   *alt_sec = NULL, *jump_table_sec = NULL;
ELF(Dyn) *dyn = 0, *dyn_end = 0;
const char *secstrings;
INT_BITS syms[NSYMS] = {};
@@ -78,6 +78,9 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
if (!strcmp(secstrings + GET_LE(>sh_name),
".altinstructions"))
alt_sec = sh;
+   if (!strcmp(secstrings + GET_LE(>sh_name),
+   "__jump_table"))
+   jump_table_sec  = sh;
}
 
if (!symtab_hdr)
@@ -166,6 +169,12 @@ static void BITSFUNC(go)(void *raw_addr, size_t raw_len,
fprintf(outfile, "\t.alt_len = %lu,\n",
(unsigned long)GET_LE(_sec->sh_size));
}
+   if (jump_table_sec) {
+   fprintf(outfile, "\t.jump_table = %lu,\n",
+   (unsigned

[PATCHv4 22/28] selftest/timens: Add a test for timerfd

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Check that timerfd_create() takes into account clock offsets.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 tools/testing/selftests/timens/.gitignore |   1 +
 tools/testing/selftests/timens/Makefile   |   2 +-
 tools/testing/selftests/timens/timerfd.c  | 127 ++
 3 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/timens/timerfd.c

diff --git a/tools/testing/selftests/timens/.gitignore 
b/tools/testing/selftests/timens/.gitignore
index 27a693229ce1..b609f6ee9fb9 100644
--- a/tools/testing/selftests/timens/.gitignore
+++ b/tools/testing/selftests/timens/.gitignore
@@ -1 +1,2 @@
 timens
+timerfd
diff --git a/tools/testing/selftests/timens/Makefile 
b/tools/testing/selftests/timens/Makefile
index b877efb78974..66b90cd28e5c 100644
--- a/tools/testing/selftests/timens/Makefile
+++ b/tools/testing/selftests/timens/Makefile
@@ -1,4 +1,4 @@
-TEST_GEN_PROGS := timens
+TEST_GEN_PROGS := timens timerfd
 
 CFLAGS := -Wall -Werror
 
diff --git a/tools/testing/selftests/timens/timerfd.c 
b/tools/testing/selftests/timens/timerfd.c
new file mode 100644
index ..c9816db4fe79
--- /dev/null
+++ b/tools/testing/selftests/timens/timerfd.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+#define _GNU_SOURCE
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "log.h"
+#include "timens.h"
+
+static int tclock_gettime(clock_t clockid, struct timespec *now)
+{
+   if (clockid == CLOCK_BOOTTIME_ALARM)
+   clockid = CLOCK_BOOTTIME;
+   return clock_gettime(clockid, now);
+}
+
+int run_test(int clockid, struct timespec now)
+{
+   struct itimerspec new_value;
+   long long elapsed;
+   int fd, i;
+
+   if (tclock_gettime(clockid, ))
+   return pr_perror("clock_gettime");
+
+   for (i = 0; i < 2; i++) {
+   int flags = 0;
+
+   new_value.it_value.tv_sec = 3600;
+   new_value.it_value.tv_nsec = 0;
+   new_value.it_interval.tv_sec = 1;
+   new_value.it_interval.tv_nsec = 0;
+
+   if (i == 1) {
+   new_value.it_value.tv_sec += now.tv_sec;
+   new_value.it_value.tv_nsec += now.tv_nsec;
+   }
+
+   fd = timerfd_create(clockid, 0);
+   if (fd == -1)
+   return pr_perror("timerfd_create");
+
+   if (i == 1)
+   flags |= TFD_TIMER_ABSTIME;
+
+   if (timerfd_settime(fd, flags, _value, NULL))
+   return pr_perror("timerfd_settime");
+
+   if (timerfd_gettime(fd, _value))
+   return pr_perror("timerfd_gettime");
+
+   elapsed = new_value.it_value.tv_sec;
+   if (abs(elapsed - 3600) > 60) {
+   ksft_test_result_fail("clockid: %d elapsed: %lld\n",
+ clockid, elapsed);
+   return 1;
+   }
+
+   close(fd);
+   }
+
+   ksft_test_result_pass("clockid=%d\n", clockid);
+
+   return 0;
+}
+
+int main(int argc, char *argv[])
+{
+   int ret, status, len, fd;
+   char buf[4096];
+   pid_t pid;
+   struct timespec btime_now, mtime_now;
+
+   nscheck();
+
+   clock_gettime(CLOCK_MONOTONIC, _now);
+   clock_gettime(CLOCK_BOOTTIME, _now);
+
+   if (unshare(CLONE_NEWTIME))
+   return pr_perror("unshare");
+
+   len = snprintf(buf, sizeof(buf), "%d %d 0\n%d %d 0",
+   CLOCK_MONOTONIC, 70 * 24 * 3600,
+   CLOCK_BOOTTIME, 9 * 24 * 3600);
+   fd = open("/proc/self/timens_offsets", O_WRONLY);
+   if (fd < 0)
+   return pr_perror("/proc/self/timens_offsets");
+
+   if (write(fd, buf, len) != len)
+   return pr_perror("/proc/self/timens_offsets");
+
+   close(fd);
+   mtime_now.tv_sec += 70 * 24 * 3600;
+   btime_now.tv_sec += 9 * 24 * 3600;
+
+   pid = fork();
+   if (pid < 0)
+   return pr_perror("Unable to fork");
+   if (pid == 0) {
+   ret = 0;
+   ret |= run_test(CLOCK_BOOTTIME, btime_now);
+   ret |= run_test(CLOCK_MONOTONIC, mtime_now);
+   ret |= run_test(CLOCK_BOOTTIME_ALARM, btime_now);
+
+   if (ret)
+   ksft_exit_fail();
+   ksft_exit_pass();
+   return ret;
+   }
+
+   if (waitpid(pid, , 0) != pid)
+   return pr_perror("Unable to wait the child process");
+
+   if (WIFEXITED(status))
+   return WEXITSTATUS(status);
+
+   return 1;
+}
+
-- 
2.22.0

[PATCHv4 02/28] timens: Add timens_offsets

2019-06-12 Thread Dmitry Safonov

From: Andrei Vagin 

Introduce offsets for time namespace. They will contain an adjustment
needed to convert clocks to/from host's.

Allocate one page for each time namespace that will be premapped into
userspace among vvar pages.

Signed-off-by: Andrei Vagin 
Co-developed-by: Dmitry Safonov 
Signed-off-by: Dmitry Safonov 
---
 MAINTAINERS|  1 +
 include/linux/time_namespace.h |  1 +
 include/linux/timens_offsets.h |  8 
 kernel/time_namespace.c| 14 --
 4 files changed, 22 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/timens_offsets.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 323ab92b963b..bf55aec42f2d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12590,6 +12590,7 @@ S:  Maintained
 F: fs/timerfd.c
 F: include/linux/timer*
 F: include/linux/time_namespace.h
+F: include/linux/timens_offsets.h
 F: kernel/time_namespace.c
 F: kernel/time/*timer*
 
diff --git a/include/linux/time_namespace.h b/include/linux/time_namespace.h
index 9507ed7072fe..b6985aa87479 100644
--- a/include/linux/time_namespace.h
+++ b/include/linux/time_namespace.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct user_namespace;
 extern struct user_namespace init_user_ns;
diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
new file mode 100644
index ..7d7cb68ea778
--- /dev/null
+++ b/include/linux/timens_offsets.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_TIME_OFFSETS_H
+#define _LINUX_TIME_OFFSETS_H
+
+struct timens_offsets {
+};
+
+#endif
diff --git a/kernel/time_namespace.c b/kernel/time_namespace.c
index 8c600df9771d..4828447721ec 100644
--- a/kernel/time_namespace.c
+++ b/kernel/time_namespace.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct ucounts *inc_time_namespaces(struct user_namespace *ns)
 {
@@ -46,6 +47,7 @@ static struct time_namespace *clone_time_ns(struct 
user_namespace *user_ns,
 {
struct time_namespace *ns;
struct ucounts *ucounts;
+   struct page *page;
int err;
 
err = -ENOSPC;
@@ -58,15 +60,22 @@ static struct time_namespace *clone_time_ns(struct 
user_namespace *user_ns,
if (!ns)
goto fail_dec;
 
+   page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+   if (!page)
+   goto fail_free;
+   ns->offsets = page_address(page);
+   BUILD_BUG_ON(sizeof(*ns->offsets) > PAGE_SIZE);
+
err = ns_alloc_inum(>ns);
if (err)
-   goto fail_free;
+   goto fail_page;
 
ns->ucounts = ucounts;
ns->ns.ops = _operations;
ns->user_ns = get_user_ns(user_ns);
return ns;
-
+fail_page:
+   free_page((unsigned long)ns->offsets);
 fail_free:
kfree(ns);
 fail_dec:
@@ -94,6 +103,7 @@ void free_time_ns(struct kref *kref)
struct time_namespace *ns;
 
ns = container_of(kref, struct time_namespace, kref);
+   free_page((unsigned long)ns->offsets);
dec_time_namespaces(ns->ucounts);
put_user_ns(ns->user_ns);
ns_free_inum(>ns);
-- 
2.22.0

[PATCHv4 19/28] timens: Add align for timens_offsets

2019-06-12 Thread Dmitry Safonov

Align offsets so that time namespace will work for ia32 applications on
x86_64 host.

Co-developed-by: Andrei Vagin 
Signed-off-by: Andrei Vagin 
Signed-off-by: Dmitry Safonov 
---
 include/linux/timens_offsets.h | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/timens_offsets.h b/include/linux/timens_offsets.h
index e93aabaa5e45..05da1b0563ce 100644
--- a/include/linux/timens_offsets.h
+++ b/include/linux/timens_offsets.h
@@ -2,9 +2,17 @@
 #ifndef _LINUX_TIME_OFFSETS_H
 #define _LINUX_TIME_OFFSETS_H
 
+/*
+ * Time offsets need align as they're placed on VVAR page,
+ * which is used by x86_64 and ia32 VDSO code.
+ * On ia32 offset::tv_sec (u64) has align(4), so re-align offsets
+ * to the same positions as 64-bit offsets.
+ * On 64-bit big-endian systems VDSO should convert to timespec64
+ * to timespec because of a padding occurring between the fields.
+ */
 struct timens_offsets {
-   struct timespec64 monotonic;
-   struct timespec64 boottime;
+   struct timespec64 monotonic __aligned(8);
+   struct timespec64 boottime __aligned(8);
 };
 
 #endif
-- 
2.22.0

[PATCHv4 00/28] kernel: Introduce Time Namespace

2019-06-12 Thread Dmitry Safonov

Discussions around time namespace are there for a long time. The first
attempt to implement it was in 2006 by Jeff Dike. From that time, the
topic appears on and off in various discussions.

There are two main use cases for time namespaces:
1. change date and time inside a container;
2. adjust clocks for a container restored from a checkpoint.

“It seems like this might be one of the last major obstacles keeping
migration from being used in production systems, given that not all
containers and connections can be migrated as long as a time dependency
is capable of messing it up.” (by github.com/dav-ell)

The kernel provides access to several clocks: CLOCK_REALTIME,
CLOCK_MONOTONIC, CLOCK_BOOTTIME. Last two clocks are monotonous, but the
start points for them are not defined and are different for each
system. When a container is migrated from one node to another, all
clocks have to be restored into consistent states; in other words, they
have to continue running from the same points where they have been
dumped.

The main idea of this patch set is adding per-namespace offsets for
system clocks. When a process in a non-root time namespace requests
time of a clock, a namespace offset is added to the current value of
this clock and the sum is returned.

All offsets are placed on a separate page, this allows us to map it as
part of VVAR into user processes and use offsets from VDSO calls.

Now offsets are implemented for CLOCK_MONOTONIC and CLOCK_BOOTTIME
clocks.

v4 Changes:

* CLOCKE_NEWTIME is unshare()-only flag now (CLON_PIDFD took previous value)
* Addressing Jann Horn's feedback - we don't allow CLONE_THREAD or
  CLONE_VM together with CLONE_NEWTIME (thanks for spotting!)
* Addressing issues found by Thomas - removed unmaintainable CLOCK_TIMENS
  and introduced another call back into k_clock to get ktime instead
  of getting timespec and converting it (Patch 03)
* Renaming timens_offsets members to omit _offset postfix
  (thanks Cyrill for the suggestion)
* Suggestions, renaming and making code more maintainable from Thomas's
  feedback (thanks much!)
* Fixing out-of-bounds and other issues in procfs file (kudos Jann Horn)
* vdso_fault() can be called on a remote task by /proc/$pid/mem or
  process_vm_readv() - addressed by adding a slow-path with searching
  for owner's namespace (thanks for spotting this unobvious issue, Jann)
* Other nits by Jann Horn

v3: Major changes:

* Simplify two VDSO images by using static_branch() in vclock_gettime()
  Removes unwanted conflicts with generic VDSO movement patches and
  simplifies things by dropping too invasive linker magic.
  As an alternative to static_branch() we tested an attempt to introduce
  home-made dynamic patching called retcalls:
  https://github.com/0x7f454c46/linux/commit/4cc0180f6d65
  Considering some theoretical problems with toolchains, we decided to go
  with long well-tested nop-patching in static_branch(). Though, it was
  needed to provide backend for relative code.

* address Thomas' comments.
* add sanity checks for offsets:
  - the current clock time in a namespace has to be in [0, KTIME_MAX / 2).
KTIME_MAX is divided by two here to be sure that the KTIME_MAX limit
is still unreachable.
Link: https://lkml.org/lkml/2018/9/19/950
Link: https://lkml.org/lkml/2019/2/5/867

v2: There are two major changes:

* Two versions of the VDSO library to avoid a performance penalty for
  host tasks outside time namespace (as suggested by Andy and Thomas).

  As it has been discussed on timens RFC, adding a new conditional branch
  `if (inside_time_ns)` on VDSO for all processes is undesirable.
  It will add a penalty for everybody as branch predictor may mispredict
  the jump. Also there are instruction cache lines wasted on cmp/jmp.

  Those effects of introducing time namespace are very much unwanted
  having in mind how much work have been spent on micro-optimisation
  VDSO code.

  Addressing those problems, there are two versions of VDSO's .so:
  for host tasks (without any penalty) and for processes inside of time
  namespace with clk_to_ns() that subtracts offsets from host's time.


* Allow to set clock offsets for a namespace only before any processes
  appear in it.

  Now a time namespace looks similar to a pid namespace in a way how it is
  created: unshare(CLONE_NEWTIME) system call creates a new time namespace,
  but doesn't set it to the current process. Then all children of
  the process will be born in the new time namespace, or a process can
  use the setns() system call to join a namespace.

  This scheme allows to create a new time namespaces, set clock offsets
  and then populate the namespace with processes.

Our performance measurements show that the price of VDSO's clock_gettime()
in a child time namespace is about 8% with a hot CPU cache and about 90%
with a cold CPU cache. There is no performance regression for host
processes outside time namespace on those tests.

We wrote two small benchmarks. The first one

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 930 matches

Mail list logo