Re: [PATCH v3] arm64: v8.4: Support for new floating point multiplication instructions

2018-01-04 Thread Greg KH
On Fri, Jan 05, 2018 at 09:22:54AM +0800, gengdongjiu wrote:
> Hi will/catalin
> 
> On 2017/12/13 18:09, Suzuki K Poulose wrote:
> > On 13/12/17 10:13, Dongjiu Geng wrote:
> >> ARM v8.4 extensions add new neon instructions for performing a
> >> multiplication of each FP16 element of one vector with the corresponding
> >> FP16 element of a second vector, and to add or subtract this without an
> >> intermediate rounding to the corresponding FP32 element in a third vector.
> >>
> >> This patch detects this feature and let the userspace know about it via a
> >> HWCAP bit and MRS emulation.
> >>
> >> Cc: Dave Martin 
> >> Cc: Suzuki K Poulose 
> >> Signed-off-by: Dongjiu Geng 
> >> Reviewed-by: Dave Martin 
> > 
> > Looks good to me.
> > 
> > Reviewed-by: Suzuki K Poulose 
> 
>  sorry to disturb you. Reminder, hope this patch can be applied to Linux 
> 4.15-rc7.

New features should not be going into 4.15-rc, that should be a 4.16-rc1
thing, right?

thanks,

greg k-h


Re: [PATCH v3] arm64: v8.4: Support for new floating point multiplication instructions

2018-01-04 Thread Greg KH
On Fri, Jan 05, 2018 at 09:22:54AM +0800, gengdongjiu wrote:
> Hi will/catalin
> 
> On 2017/12/13 18:09, Suzuki K Poulose wrote:
> > On 13/12/17 10:13, Dongjiu Geng wrote:
> >> ARM v8.4 extensions add new neon instructions for performing a
> >> multiplication of each FP16 element of one vector with the corresponding
> >> FP16 element of a second vector, and to add or subtract this without an
> >> intermediate rounding to the corresponding FP32 element in a third vector.
> >>
> >> This patch detects this feature and let the userspace know about it via a
> >> HWCAP bit and MRS emulation.
> >>
> >> Cc: Dave Martin 
> >> Cc: Suzuki K Poulose 
> >> Signed-off-by: Dongjiu Geng 
> >> Reviewed-by: Dave Martin 
> > 
> > Looks good to me.
> > 
> > Reviewed-by: Suzuki K Poulose 
> 
>  sorry to disturb you. Reminder, hope this patch can be applied to Linux 
> 4.15-rc7.

New features should not be going into 4.15-rc, that should be a 4.16-rc1
thing, right?

thanks,

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-04 Thread Greg Kroah-Hartman
On Thu, Jan 04, 2018 at 03:00:29PM -0700, Shuah Khan wrote:
> On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.4.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Based on the email threads, I expected to see issues, however,
> compiled and booted on my test system. No dmesg regressions.

Hey, you got lucky :)

Thanks for testing all of these and letting me know.

greg k-h


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-04 Thread Greg Kroah-Hartman
On Thu, Jan 04, 2018 at 03:00:29PM -0700, Shuah Khan wrote:
> On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote:
> > This is the start of the stable review cycle for the 4.4.110 release.
> > There are 37 patches in this series, all will be posted as a response
> > to this one.  If anyone has any issues with these being applied, please
> > let me know.
> > 
> > Responses should be made by Fri Jan  5 19:50:38 UTC 2018.
> > Anything received after that time might be too late.
> > 
> > The whole patch series can be found in one patch at:
> > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz
> > or in the git tree and branch at:
> >   git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> > linux-4.4.y
> > and the diffstat can be found below.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
> Based on the email threads, I expected to see issues, however,
> compiled and booted on my test system. No dmesg regressions.

Hey, you got lucky :)

Thanks for testing all of these and letting me know.

greg k-h


Re: [PATCH 4.14 00/14] 4.14.12-stable review

2018-01-04 Thread Greg Kroah-Hartman
On Thu, Jan 04, 2018 at 04:12:31PM -0800, Kevin Hilman wrote:
> kernelci.org bot  writes:
> 
> > stable-rc/linux-4.14.y boot: 118 boots: 4 failed, 113 passed with 1 offline 
> > (v4.14.11-15-g732141e47ee6)
> >
> > Full Boot Summary: 
> > https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/
> > Full Build Summary: 
> > https://kernelci.org/build/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/
> >
> > Tree: stable-rc
> > Branch: linux-4.14.y
> > Git Describe: v4.14.11-15-g732141e47ee6
> > Git Commit: 732141e47ee614d70aeb8ad828a977ad19447e87
> > Git URL: 
> > http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > Tested: 68 unique boards, 23 SoC families, 16 builds out of 185
> >
> > Boot Regressions Detected:
> 
> TL;DR;  All is well.

Thanks for the summary of all of these, and for your continued testing.

greg k-h


Re: [PATCH 4.14 00/14] 4.14.12-stable review

2018-01-04 Thread Greg Kroah-Hartman
On Thu, Jan 04, 2018 at 04:12:31PM -0800, Kevin Hilman wrote:
> kernelci.org bot  writes:
> 
> > stable-rc/linux-4.14.y boot: 118 boots: 4 failed, 113 passed with 1 offline 
> > (v4.14.11-15-g732141e47ee6)
> >
> > Full Boot Summary: 
> > https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/
> > Full Build Summary: 
> > https://kernelci.org/build/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/
> >
> > Tree: stable-rc
> > Branch: linux-4.14.y
> > Git Describe: v4.14.11-15-g732141e47ee6
> > Git Commit: 732141e47ee614d70aeb8ad828a977ad19447e87
> > Git URL: 
> > http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
> > Tested: 68 unique boards, 23 SoC families, 16 builds out of 185
> >
> > Boot Regressions Detected:
> 
> TL;DR;  All is well.

Thanks for the summary of all of these, and for your continued testing.

greg k-h


Re: [PATCH] tty: fix data race in n_tty_receive_buf_common

2018-01-04 Thread Kohli, Gaurav

Hi Alan,


Can you make that code available otherwise it's impossible to see 
what the problem might be.




 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/tty/serial?h=msm-4.9
 As discussed , there not seems a problem as we are getting print 
request even when port seems to closed.



tty_ldisc_lock(tty, 5 * HZ);
 tty_ldisc_setup(tty);
 tty_ldisc_unlock(tty)

But in above lock,  there is a chance when flush_to_ldisc will occur 
first and acquired a lock in

tty_ldisc_ref itself.
So this may fail, I am not much sure here, Please correct me, If i am 
missing something here.
So can not we simply return from flush_to_ldisc ,when we know 
disc_data is not valid like

we are doing for tty and ldisc already?

if (tty->disc_data == NULL) {
    tty_ldisc_deref(disc);
    return;
    }







Regards
Gaurav
-- Qualcomm India Private Limited, on behalf of Qualcomm Innovation 
Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation 
Collaborative Project.


Re: [PATCH] tty: fix data race in n_tty_receive_buf_common

2018-01-04 Thread Kohli, Gaurav

Hi Alan,


Can you make that code available otherwise it's impossible to see 
what the problem might be.




 
https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/tty/serial?h=msm-4.9
 As discussed , there not seems a problem as we are getting print 
request even when port seems to closed.



tty_ldisc_lock(tty, 5 * HZ);
 tty_ldisc_setup(tty);
 tty_ldisc_unlock(tty)

But in above lock,  there is a chance when flush_to_ldisc will occur 
first and acquired a lock in

tty_ldisc_ref itself.
So this may fail, I am not much sure here, Please correct me, If i am 
missing something here.
So can not we simply return from flush_to_ldisc ,when we know 
disc_data is not valid like

we are doing for tty and ldisc already?

if (tty->disc_data == NULL) {
    tty_ldisc_deref(disc);
    return;
    }







Regards
Gaurav
-- Qualcomm India Private Limited, on behalf of Qualcomm Innovation 
Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation 
Collaborative Project.


Crypto Fixes for 4.15

2018-01-04 Thread Herbert Xu
Hi Linus: 

This push fixes the following issues:

- Racy use of ctx->rcvused in af_alg.
- algif_aead crash in chacha20poly1305.
- Freeing bogus pointer in pcrypt.
- Build error on MIPS in mpi.
- Memory leak in inside-secure.
- Memory overwrite in inside-secure.
- NULL pointer dereference in inside-secure.
- State corruption in inside-secure.
- Build error without CRYPTO_GF128MUL in chelsio.
- Use after free in n2.


Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus


Antoine Ténart (3):
  crypto: inside-secure - free requests even if their handling failed
  crypto: inside-secure - fix request allocations in invalidation path
  crypto: inside-secure - do not use areq->result for partial results

Arnd Bergmann (1):
  crypto: chelsio - select CRYPTO_GF128MUL

Eric Biggers (2):
  crypto: chacha20poly1305 - validate the digest size
  crypto: pcrypt - fix freeing pcrypt instances

James Hogan (1):
  lib/mpi: Fix umul_ppmm() for MIPS64r6

Jan Engelhardt (1):
  crypto: n2 - cure use after free

Jonathan Cameron (1):
  crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t

Ofer Heifetz (1):
  crypto: inside-secure - per request invalidation

 crypto/af_alg.c|4 +-
 crypto/algif_aead.c|2 +-
 crypto/algif_skcipher.c|2 +-
 crypto/chacha20poly1305.c  |6 +-
 crypto/pcrypt.c|   19 ++---
 drivers/crypto/chelsio/Kconfig |1 +
 drivers/crypto/inside-secure/safexcel.c|1 +
 drivers/crypto/inside-secure/safexcel_cipher.c |   85 --
 drivers/crypto/inside-secure/safexcel_hash.c   |   89 +---
 drivers/crypto/n2_core.c   |3 +
 include/crypto/if_alg.h|5 +-
 lib/mpi/longlong.h |   18 -
 12 files changed, 173 insertions(+), 62 deletions(-)

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Crypto Fixes for 4.15

2018-01-04 Thread Herbert Xu
Hi Linus: 

This push fixes the following issues:

- Racy use of ctx->rcvused in af_alg.
- algif_aead crash in chacha20poly1305.
- Freeing bogus pointer in pcrypt.
- Build error on MIPS in mpi.
- Memory leak in inside-secure.
- Memory overwrite in inside-secure.
- NULL pointer dereference in inside-secure.
- State corruption in inside-secure.
- Build error without CRYPTO_GF128MUL in chelsio.
- Use after free in n2.


Please pull from

git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus


Antoine Ténart (3):
  crypto: inside-secure - free requests even if their handling failed
  crypto: inside-secure - fix request allocations in invalidation path
  crypto: inside-secure - do not use areq->result for partial results

Arnd Bergmann (1):
  crypto: chelsio - select CRYPTO_GF128MUL

Eric Biggers (2):
  crypto: chacha20poly1305 - validate the digest size
  crypto: pcrypt - fix freeing pcrypt instances

James Hogan (1):
  lib/mpi: Fix umul_ppmm() for MIPS64r6

Jan Engelhardt (1):
  crypto: n2 - cure use after free

Jonathan Cameron (1):
  crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t

Ofer Heifetz (1):
  crypto: inside-secure - per request invalidation

 crypto/af_alg.c|4 +-
 crypto/algif_aead.c|2 +-
 crypto/algif_skcipher.c|2 +-
 crypto/chacha20poly1305.c  |6 +-
 crypto/pcrypt.c|   19 ++---
 drivers/crypto/chelsio/Kconfig |1 +
 drivers/crypto/inside-secure/safexcel.c|1 +
 drivers/crypto/inside-secure/safexcel_cipher.c |   85 --
 drivers/crypto/inside-secure/safexcel_hash.c   |   89 +---
 drivers/crypto/n2_core.c   |3 +
 include/crypto/if_alg.h|5 +-
 lib/mpi/longlong.h |   18 -
 12 files changed, 173 insertions(+), 62 deletions(-)

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [GIT PULL 2/3] SOC: Keystone SOC update for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:51PM -0800, Santosh Shilimkar wrote:
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_driver_soc_for_4.16
> 
> for you to fetch changes up to aefc5818553680c50c9f6840e47c01b80edd9b3a:
> 
>   soc: ti: fix max dup length for kstrndup (2017-12-16 14:45:33 -0800)
> 
> 
> SOC: Keystone Soc driver updates for 4.16
> 
>  - TI EMIF-SRAM driver
>  - TI SCI print format fix
>  - Navigator strndup lenth fix
> 
> 
> Arnd Bergmann (1):
>   memory: ti-emif-sram: remove unused variable
> 
> Dave Gerlach (2):
>   Documentation: dt: Update ti,emif bindings
>   memory: ti-emif-sram: introduce relocatable suspend/resume handlers
> 
> Ma Shimiao (1):
>   soc: ti: fix max dup length for kstrndup
> 
> Nishanth Menon (1):
>   firmware: ti_sci: Use %zu for size_t print format
> 
>  .../bindings/memory-controllers/ti/emif.txt|  17 +-
>  drivers/firmware/ti_sci.c  |   4 +-
>  drivers/memory/Kconfig |  10 +
>  drivers/memory/Makefile|   8 +
>  drivers/memory/Makefile.asm-offsets|   5 +
>  drivers/memory/emif-asm-offsets.c  |  92 ++
>  drivers/memory/emif.h  |  17 ++
>  drivers/memory/ti-emif-pm.c| 324 
>  drivers/memory/ti-emif-sram-pm.S   | 334 
> +
>  drivers/soc/ti/knav_qmss_queue.c   |   4 +-
>  include/linux/ti-emif-sram.h   |  69 +

Based on the contents, I merged this into next/drivers instead of next/soc.


-Olof


Re: [GIT PULL] arm64: dts: uniphier: UniPhier DT updates (64bit) for v4.16

2018-01-04 Thread Olof Johansson
On Fri, Dec 29, 2017 at 10:35:38PM +0900, Masahiro Yamada wrote:
> Hi Arnd, Olof,
> 
> Here are UniPhier DT (64bit) updates for the v4.16 merge window.
> Please pull!
> 
> 
> The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:
> 
>   Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git
> tags/uniphier-dt64-v4.16
> 
> for you to fetch changes up to dbdae8474e08fc1194102bef95dc96db435c15da:
> 
>   arm64: dts: uniphier: enable more serial ports for PXs3 ref board
> (2017-12-29 22:03:26 +0900)
> 
> 
> UniPhier ARM64 SoC DT updates for v4.16
> 
> - clean up gpios properties by macro
> - add GPIO hog for PXs3 reference node
> - add has-transaction-translator property to generic-ehci nodes
> - enable more serial ports for PXs3 reference node

Merged, thanks!


-Olof


Re: [GIT PULL 2/3] SOC: Keystone SOC update for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:51PM -0800, Santosh Shilimkar wrote:
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_driver_soc_for_4.16
> 
> for you to fetch changes up to aefc5818553680c50c9f6840e47c01b80edd9b3a:
> 
>   soc: ti: fix max dup length for kstrndup (2017-12-16 14:45:33 -0800)
> 
> 
> SOC: Keystone Soc driver updates for 4.16
> 
>  - TI EMIF-SRAM driver
>  - TI SCI print format fix
>  - Navigator strndup lenth fix
> 
> 
> Arnd Bergmann (1):
>   memory: ti-emif-sram: remove unused variable
> 
> Dave Gerlach (2):
>   Documentation: dt: Update ti,emif bindings
>   memory: ti-emif-sram: introduce relocatable suspend/resume handlers
> 
> Ma Shimiao (1):
>   soc: ti: fix max dup length for kstrndup
> 
> Nishanth Menon (1):
>   firmware: ti_sci: Use %zu for size_t print format
> 
>  .../bindings/memory-controllers/ti/emif.txt|  17 +-
>  drivers/firmware/ti_sci.c  |   4 +-
>  drivers/memory/Kconfig |  10 +
>  drivers/memory/Makefile|   8 +
>  drivers/memory/Makefile.asm-offsets|   5 +
>  drivers/memory/emif-asm-offsets.c  |  92 ++
>  drivers/memory/emif.h  |  17 ++
>  drivers/memory/ti-emif-pm.c| 324 
>  drivers/memory/ti-emif-sram-pm.S   | 334 
> +
>  drivers/soc/ti/knav_qmss_queue.c   |   4 +-
>  include/linux/ti-emif-sram.h   |  69 +

Based on the contents, I merged this into next/drivers instead of next/soc.


-Olof


Re: [GIT PULL] arm64: dts: uniphier: UniPhier DT updates (64bit) for v4.16

2018-01-04 Thread Olof Johansson
On Fri, Dec 29, 2017 at 10:35:38PM +0900, Masahiro Yamada wrote:
> Hi Arnd, Olof,
> 
> Here are UniPhier DT (64bit) updates for the v4.16 merge window.
> Please pull!
> 
> 
> The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:
> 
>   Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git
> tags/uniphier-dt64-v4.16
> 
> for you to fetch changes up to dbdae8474e08fc1194102bef95dc96db435c15da:
> 
>   arm64: dts: uniphier: enable more serial ports for PXs3 ref board
> (2017-12-29 22:03:26 +0900)
> 
> 
> UniPhier ARM64 SoC DT updates for v4.16
> 
> - clean up gpios properties by macro
> - add GPIO hog for PXs3 reference node
> - add has-transaction-translator property to generic-ehci nodes
> - enable more serial ports for PXs3 reference node

Merged, thanks!


-Olof


Re: [GIT PULL] ARM: at91: drivers for 4.16

2018-01-04 Thread Olof Johansson
On Sun, Dec 31, 2017 at 04:34:42PM +0100, Alexandre Belloni wrote:
> Arnd, Olof,
> 
> A single harmless change for this pull request. I hope you'll enjoy this
> New Year's Eve.
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
> tags/at91-ab-4.16-drivers
> 
> for you to fetch changes up to 1203839290f151b84f5e54165d6d039e9514b236:
> 
>   pcmcia: at91_cf: Use PTR_ERR_OR_ZERO() (2017-11-29 21:58:58 +0100)
> 
> 
> drivers for 4.16
> 
>  - use PTR_ERR_OR_ZERO were relevant in at91_cf

Merged, thanks.


-Olof


Re: [GIT PULL] ARM: dts: uniphier: UniPhier DT updates for v4.16

2018-01-04 Thread Olof Johansson
Hi!

On Fri, Dec 29, 2017 at 10:32:24PM +0900, Masahiro Yamada wrote:
> Hi Arnd, Olof,
> 
> Here are UniPhier DT (32bit) updates for the v4.16 merge window.
> Please pull!
> 
> 
> The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:
> 
>   Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git
> tags/uniphier-dt-v4.16

Tiny tiny nit: It makes our life a little easier if you don't linewrap the
URL+tag, since it's easier to copy-paste without a linebreak.

> for you to fetch changes up to 6fa9b0255099fcd289f7e3857714532843044c76:
> 
>   ARM: dts: uniphier: add has-transaction-translator property to usb
> node for LD4, sLD8 and Pro4 (2017-12-27 23:59:37 +0900)
> 
> 
> UniPhier ARM SoC DT updates for v4.16
> 
> - clean up gpios properties by macro
> - add efuse nodes
> - add has-transaction-translator property to generic-ehci nodes
> 
> 
> Keiji Hayashibara (1):
>   ARM: dts: uniphier: add efuse node for UniPhier 32bit SoC
> 
> Kunihiko Hayashi (1):
>   ARM: dts: uniphier: add has-transaction-translator property to
> usb node for LD4, sLD8 and Pro4

Another small nit: This patch subject is a bit on the long side. Try to
keep it to ~60 characters if you can.

Merged the branch.


Thanks!

-Olof
> 
> Masahiro Yamada (1):
>   ARM: dts: uniphier: use macros in dt-bindings header
> 
>  arch/arm/boot/dts/uniphier-ld4-ref.dts  |  2 +-
>  arch/arm/boot/dts/uniphier-ld4.dtsi | 23 +
>  arch/arm/boot/dts/uniphier-ld6b-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-pro4-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-pro4.dtsi| 27 +++
>  arch/arm/boot/dts/uniphier-pro5.dtsi| 33 
>  arch/arm/boot/dts/uniphier-pxs2.dtsi| 19 ++
>  arch/arm/boot/dts/uniphier-sld8-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-sld8.dtsi| 23 +
>  9 files changed, 129 insertions(+), 4 deletions(-)
> 
> 
> -- 
> Best Regards
> Masahiro Yamada


Re: [GIT PULL] ARM: at91: drivers for 4.16

2018-01-04 Thread Olof Johansson
On Sun, Dec 31, 2017 at 04:34:42PM +0100, Alexandre Belloni wrote:
> Arnd, Olof,
> 
> A single harmless change for this pull request. I hope you'll enjoy this
> New Year's Eve.
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
> tags/at91-ab-4.16-drivers
> 
> for you to fetch changes up to 1203839290f151b84f5e54165d6d039e9514b236:
> 
>   pcmcia: at91_cf: Use PTR_ERR_OR_ZERO() (2017-11-29 21:58:58 +0100)
> 
> 
> drivers for 4.16
> 
>  - use PTR_ERR_OR_ZERO were relevant in at91_cf

Merged, thanks.


-Olof


Re: [GIT PULL] ARM: dts: uniphier: UniPhier DT updates for v4.16

2018-01-04 Thread Olof Johansson
Hi!

On Fri, Dec 29, 2017 at 10:32:24PM +0900, Masahiro Yamada wrote:
> Hi Arnd, Olof,
> 
> Here are UniPhier DT (32bit) updates for the v4.16 merge window.
> Please pull!
> 
> 
> The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36:
> 
>   Linux 4.15-rc3 (2017-12-10 17:56:26 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git
> tags/uniphier-dt-v4.16

Tiny tiny nit: It makes our life a little easier if you don't linewrap the
URL+tag, since it's easier to copy-paste without a linebreak.

> for you to fetch changes up to 6fa9b0255099fcd289f7e3857714532843044c76:
> 
>   ARM: dts: uniphier: add has-transaction-translator property to usb
> node for LD4, sLD8 and Pro4 (2017-12-27 23:59:37 +0900)
> 
> 
> UniPhier ARM SoC DT updates for v4.16
> 
> - clean up gpios properties by macro
> - add efuse nodes
> - add has-transaction-translator property to generic-ehci nodes
> 
> 
> Keiji Hayashibara (1):
>   ARM: dts: uniphier: add efuse node for UniPhier 32bit SoC
> 
> Kunihiko Hayashi (1):
>   ARM: dts: uniphier: add has-transaction-translator property to
> usb node for LD4, sLD8 and Pro4

Another small nit: This patch subject is a bit on the long side. Try to
keep it to ~60 characters if you can.

Merged the branch.


Thanks!

-Olof
> 
> Masahiro Yamada (1):
>   ARM: dts: uniphier: use macros in dt-bindings header
> 
>  arch/arm/boot/dts/uniphier-ld4-ref.dts  |  2 +-
>  arch/arm/boot/dts/uniphier-ld4.dtsi | 23 +
>  arch/arm/boot/dts/uniphier-ld6b-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-pro4-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-pro4.dtsi| 27 +++
>  arch/arm/boot/dts/uniphier-pro5.dtsi| 33 
>  arch/arm/boot/dts/uniphier-pxs2.dtsi| 19 ++
>  arch/arm/boot/dts/uniphier-sld8-ref.dts |  2 +-
>  arch/arm/boot/dts/uniphier-sld8.dtsi| 23 +
>  9 files changed, 129 insertions(+), 4 deletions(-)
> 
> 
> -- 
> Best Regards
> Masahiro Yamada


Re: [GIT PULL 3/3] ARM: Keystone config update for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:52PM -0800, Santosh Shilimkar wrote:
> Also had patch to sync up multi-v7 config but because of conflicts
> in next, have to drop it. Will send that post merge window separately
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_config_for_4.16
> 
> for you to fetch changes up to 10f06c70f337494fc2fec623542186fec80fc395:
> 
>   ARM: configs: keystone_defconfig: Enable few peripheral drivers (2017-12-02 
> 19:34:36 -0800)
> 
> 
> ARM: Keystone configs for 4.16
> 
>   - Enable QSPI
>   - Enable LEDs
>   - Enable GPIO-decoder
> 
> 
> Vignesh R (1):
>   ARM: configs: keystone_defconfig: Enable few peripheral drivers

Merged, thanks.


-Olof


Re: [GIT PULL] ARM: at91: DT for 4.16

2018-01-04 Thread Olof Johansson
On Sun, Dec 31, 2017 at 04:11:27PM +0100, Alexandre Belloni wrote:
> Arnd, Olof,
> 
> This is the at91 DT pull request. The bulk of it is the switch to the
> new TCB bindings that were acked a long time ago. These changes are
> compatible with the current driver and taking them now will allow for a
> smooth transition.
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
> tags/at91-ab-4.16-dt
> 
> for you to fetch changes up to 34a7fc3147bcc14127d941f228ce3b1737e66381:
> 
>   ARM: dts: at91: sama5d2_ptc_ek: use TCB0 as timers (2017-12-31 15:50:20 
> +0100)

Merged, thanks!


-Olof



Re: [GIT PULL 3/3] ARM: Keystone config update for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:52PM -0800, Santosh Shilimkar wrote:
> Also had patch to sync up multi-v7 config but because of conflicts
> in next, have to drop it. Will send that post merge window separately
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_config_for_4.16
> 
> for you to fetch changes up to 10f06c70f337494fc2fec623542186fec80fc395:
> 
>   ARM: configs: keystone_defconfig: Enable few peripheral drivers (2017-12-02 
> 19:34:36 -0800)
> 
> 
> ARM: Keystone configs for 4.16
> 
>   - Enable QSPI
>   - Enable LEDs
>   - Enable GPIO-decoder
> 
> 
> Vignesh R (1):
>   ARM: configs: keystone_defconfig: Enable few peripheral drivers

Merged, thanks.


-Olof


Re: [GIT PULL] ARM: at91: DT for 4.16

2018-01-04 Thread Olof Johansson
On Sun, Dec 31, 2017 at 04:11:27PM +0100, Alexandre Belloni wrote:
> Arnd, Olof,
> 
> This is the at91 DT pull request. The bulk of it is the switch to the
> new TCB bindings that were acked a long time ago. These changes are
> compatible with the current driver and taking them now will allow for a
> smooth transition.
> 
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
> tags/at91-ab-4.16-dt
> 
> for you to fetch changes up to 34a7fc3147bcc14127d941f228ce3b1737e66381:
> 
>   ARM: dts: at91: sama5d2_ptc_ek: use TCB0 as timers (2017-12-31 15:50:20 
> +0100)

Merged, thanks!


-Olof



Re: [GIT PULL 1/3] ARM: Keystone DTS for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:50PM -0800, Santosh Shilimkar wrote:
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_dts_for_4.16
> 
> for you to fetch changes up to 4fe85b0cdd06f8fef2631923799bdc95380badb5:
> 
>   ARM: dts: keystone-k2l-clocks: Add missing unit name to clock nodes that 
> have regs (2017-12-16 14:36:57 -0800)
> 
> 
> ARM: Keystone DTS update for 4.16
> 
>  - Enable GPIO bank2 for K2L
>  - Enable QSPI for K2G & K2G-EVM
>  - Enable UART1/2 for K2G & K2G-EVM
>  - Enable peripherals for K2G-ICE
>  - Fix C1 and C2 DTS warnings

Merged, thanks.


-Olof



Re: [GIT PULL 1/3] ARM: Keystone DTS for 4.16

2018-01-04 Thread Olof Johansson
On Wed, Dec 27, 2017 at 06:07:50PM -0800, Santosh Shilimkar wrote:
> The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323:
> 
>   Linux 4.15-rc1 (2017-11-26 16:01:47 -0800)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git 
> tags/keystone_dts_for_4.16
> 
> for you to fetch changes up to 4fe85b0cdd06f8fef2631923799bdc95380badb5:
> 
>   ARM: dts: keystone-k2l-clocks: Add missing unit name to clock nodes that 
> have regs (2017-12-16 14:36:57 -0800)
> 
> 
> ARM: Keystone DTS update for 4.16
> 
>  - Enable GPIO bank2 for K2L
>  - Enable QSPI for K2G & K2G-EVM
>  - Enable UART1/2 for K2G & K2G-EVM
>  - Enable peripherals for K2G-ICE
>  - Fix C1 and C2 DTS warnings

Merged, thanks.


-Olof



[PATCH V3] nvme-pci: fix NULL pointer reference in nvme_alloc_ns

2018-01-04 Thread Jianchao Wang
When the io queues setup or tagset allocation failed, ctrl.tagset
is NULL. But the scan work will still be queued and executed, then
panic comes up due to NULL pointer reference of ctrl.tagset.

To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate
only admin queue is live. When non io queues or tagset allocation
failed, ctrl enters into this state, scan work will not be started.
But async event work and nvme dev ioctl will be still available.
This will be helpful to do further investigation and recovery.

V3:
 - s/NVME_CTRL_ADMIN_LIVE/NVME_CTRL_ADMIN_ONLY/
 - s/BUG_ON/WARN_ON_ONCE/
 - Other misc code changes
V2:
 - Based on Sagi's suggestion, add new state NVME_CTRL_ADMIN_LIVE.
 - Change patch name and comment.

Suggested-by: Sagi Grimberg 
Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/core.c | 25 ++---
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 30 +-
 3 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..a614cd7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -232,6 +232,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
old_state = ctrl->state;
switch (new_state) {
+   case NVME_CTRL_ADMIN_ONLY:
+   switch (old_state) {
+   case NVME_CTRL_RESETTING:
+   changed = true;
+   /* FALLTHRU */
+   default:
+   break;
+   }
+   break;
case NVME_CTRL_LIVE:
switch (old_state) {
case NVME_CTRL_NEW:
@@ -247,6 +256,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
changed = true;
/* FALLTHRU */
default:
@@ -266,6 +276,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
case NVME_CTRL_DELETING:
switch (old_state) {
case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
case NVME_CTRL_RESETTING:
case NVME_CTRL_RECONNECTING:
changed = true;
@@ -2337,8 +2348,14 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-   if (ctrl->state != NVME_CTRL_LIVE)
+   switch(ctrl->state) {
+   case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
+   break;
+   default:
return -EWOULDBLOCK;
+   }
+
file->private_data = ctrl;
return 0;
 }
@@ -2602,6 +2619,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
static const char *const state_name[] = {
[NVME_CTRL_NEW] = "new",
[NVME_CTRL_LIVE]= "live",
+   [NVME_CTRL_ADMIN_ONLY]  = "only-admin",
[NVME_CTRL_RESETTING]   = "resetting",
[NVME_CTRL_RECONNECTING]= "reconnecting",
[NVME_CTRL_DELETING]= "deleting",
@@ -3074,6 +3092,8 @@ static void nvme_scan_work(struct work_struct *work)
if (ctrl->state != NVME_CTRL_LIVE)
return;
 
+   WARN_ON_ONCE(!ctrl->tagset);
+
if (nvme_identify_ctrl(ctrl, ))
return;
 
@@ -3094,8 +3114,7 @@ static void nvme_scan_work(struct work_struct *work)
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
/*
-* Do not queue new scan work when a controller is reset during
-* removal.
+* Only new queue scan work when admin and IO queues are both alive
 */
if (ctrl->state == NVME_CTRL_LIVE)
queue_work(nvme_wq, >scan_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa52..eecf71c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -119,6 +119,7 @@ static inline struct nvme_request *nvme_req(struct request 
*req)
 enum nvme_ctrl_state {
NVME_CTRL_NEW,
NVME_CTRL_LIVE,
+   NVME_CTRL_ADMIN_ONLY,/* Only admin queue live */
NVME_CTRL_RESETTING,
NVME_CTRL_RECONNECTING,
NVME_CTRL_DELETING,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e758c5a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2035,13 +2035,12 @@ static void nvme_disable_io_queues(struct nvme_dev 
*dev, int queues)
 }
 
 /*
- * Return: error value if an error occurred setting up the queues or calling
- * Identify Device.  0 if these succeeded, even if adding some of the
- * namespaces failed.  At the moment, these failures are silent.  TBD which
- * failures should be reported.
+ * return error value 

[PATCH V3] nvme-pci: fix NULL pointer reference in nvme_alloc_ns

2018-01-04 Thread Jianchao Wang
When the io queues setup or tagset allocation failed, ctrl.tagset
is NULL. But the scan work will still be queued and executed, then
panic comes up due to NULL pointer reference of ctrl.tagset.

To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate
only admin queue is live. When non io queues or tagset allocation
failed, ctrl enters into this state, scan work will not be started.
But async event work and nvme dev ioctl will be still available.
This will be helpful to do further investigation and recovery.

V3:
 - s/NVME_CTRL_ADMIN_LIVE/NVME_CTRL_ADMIN_ONLY/
 - s/BUG_ON/WARN_ON_ONCE/
 - Other misc code changes
V2:
 - Based on Sagi's suggestion, add new state NVME_CTRL_ADMIN_LIVE.
 - Change patch name and comment.

Suggested-by: Sagi Grimberg 
Signed-off-by: Jianchao Wang 
---
 drivers/nvme/host/core.c | 25 ++---
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  | 30 +-
 3 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e46e60..a614cd7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -232,6 +232,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 
old_state = ctrl->state;
switch (new_state) {
+   case NVME_CTRL_ADMIN_ONLY:
+   switch (old_state) {
+   case NVME_CTRL_RESETTING:
+   changed = true;
+   /* FALLTHRU */
+   default:
+   break;
+   }
+   break;
case NVME_CTRL_LIVE:
switch (old_state) {
case NVME_CTRL_NEW:
@@ -247,6 +256,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
switch (old_state) {
case NVME_CTRL_NEW:
case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
changed = true;
/* FALLTHRU */
default:
@@ -266,6 +276,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
case NVME_CTRL_DELETING:
switch (old_state) {
case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
case NVME_CTRL_RESETTING:
case NVME_CTRL_RECONNECTING:
changed = true;
@@ -2337,8 +2348,14 @@ static int nvme_dev_open(struct inode *inode, struct 
file *file)
struct nvme_ctrl *ctrl =
container_of(inode->i_cdev, struct nvme_ctrl, cdev);
 
-   if (ctrl->state != NVME_CTRL_LIVE)
+   switch(ctrl->state) {
+   case NVME_CTRL_LIVE:
+   case NVME_CTRL_ADMIN_ONLY:
+   break;
+   default:
return -EWOULDBLOCK;
+   }
+
file->private_data = ctrl;
return 0;
 }
@@ -2602,6 +2619,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
static const char *const state_name[] = {
[NVME_CTRL_NEW] = "new",
[NVME_CTRL_LIVE]= "live",
+   [NVME_CTRL_ADMIN_ONLY]  = "only-admin",
[NVME_CTRL_RESETTING]   = "resetting",
[NVME_CTRL_RECONNECTING]= "reconnecting",
[NVME_CTRL_DELETING]= "deleting",
@@ -3074,6 +3092,8 @@ static void nvme_scan_work(struct work_struct *work)
if (ctrl->state != NVME_CTRL_LIVE)
return;
 
+   WARN_ON_ONCE(!ctrl->tagset);
+
if (nvme_identify_ctrl(ctrl, ))
return;
 
@@ -3094,8 +3114,7 @@ static void nvme_scan_work(struct work_struct *work)
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
/*
-* Do not queue new scan work when a controller is reset during
-* removal.
+* Only new queue scan work when admin and IO queues are both alive
 */
if (ctrl->state == NVME_CTRL_LIVE)
queue_work(nvme_wq, >scan_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ea1aa52..eecf71c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -119,6 +119,7 @@ static inline struct nvme_request *nvme_req(struct request 
*req)
 enum nvme_ctrl_state {
NVME_CTRL_NEW,
NVME_CTRL_LIVE,
+   NVME_CTRL_ADMIN_ONLY,/* Only admin queue live */
NVME_CTRL_RESETTING,
NVME_CTRL_RECONNECTING,
NVME_CTRL_DELETING,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f5800c3..e758c5a 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2035,13 +2035,12 @@ static void nvme_disable_io_queues(struct nvme_dev 
*dev, int queues)
 }
 
 /*
- * Return: error value if an error occurred setting up the queues or calling
- * Identify Device.  0 if these succeeded, even if adding some of the
- * namespaces failed.  At the moment, these failures are silent.  TBD which
- * failures should be reported.
+ * return error value only when tagset allocation failed
  */
 

Re: [PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state

2018-01-04 Thread Song Liu

> On Jan 4, 2018, at 10:42 PM, Yafang Shao  wrote:
> 
> sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments.
> Then we can conveniently use these two arguments to do the filter.
> 
> Suggested-by: Brendan Gregg 
> Signed-off-by: Yafang Shao 
> ---
> include/trace/events/sock.h | 24 ++--
> net/ipv4/af_inet.c  |  6 --
> 2 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> index 3537c5f..c7df70f 100644
> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -11,7 +11,11 @@
> #include 
> #include 
> 
> -/* The protocol traced by sock_set_state */
> +#define family_names \
> + EM(AF_INET) \
> + EMe(AF_INET6)
> +
> +/* The protocol traced by inet_sock_set_state */
> #define inet_protocol_names   \
>   EM(IPPROTO_TCP) \
>   EM(IPPROTO_DCCP)\
> @@ -37,6 +41,7 @@
> #define EM(a)   TRACE_DEFINE_ENUM(a);
> #define EMe(a)  TRACE_DEFINE_ENUM(a);
> 
> +family_names
> inet_protocol_names
> tcp_state_names
> 
> @@ -45,6 +50,9 @@
> #define EM(a)   { a, #a },
> #define EMe(a)  { a, #a }
> 
> +#define show_family_name(val)\
> + __print_symbolic(val, family_names)
> +
> #define show_inet_protocol_name(val)\
>   __print_symbolic(val, inet_protocol_names)
> 
> @@ -108,9 +116,10 @@
> 
> TRACE_EVENT(inet_sock_set_state,
> 
> - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate),
> + TP_PROTO(const struct sock *sk, const int family, const int protocol,
> + const int oldstate, const int newstate),

Are there cases we need protocol and/or family that is different to 
sk->sk_protocol/sk_family? If not, I think we don't need to change the 
TP_PROTO. 

Thanks,
Song

> 
> - TP_ARGS(sk, oldstate, newstate),
> + TP_ARGS(sk, family, protocol, oldstate, newstate),
> 
>   TP_STRUCT__entry(
>   __field(const void *, skaddr)
> @@ -118,6 +127,7 @@
>   __field(int, newstate)
>   __field(__u16, sport)
>   __field(__u16, dport)
> + __field(__u16, family)
>   __field(__u8, protocol)
>   __array(__u8, saddr, 4)
>   __array(__u8, daddr, 4)
> @@ -133,8 +143,9 @@
>   __entry->skaddr = sk;
>   __entry->oldstate = oldstate;
>   __entry->newstate = newstate;
> + __entry->family = family;
> + __entry->protocol = protocol;
> 
> - __entry->protocol = sk->sk_protocol;
>   __entry->sport = ntohs(inet->inet_sport);
>   __entry->dport = ntohs(inet->inet_dport);
> 
> @@ -145,7 +156,7 @@
>   *p32 =  inet->inet_daddr;
> 
> #if IS_ENABLED(CONFIG_IPV6)
> - if (sk->sk_family == AF_INET6) {
> + if (family == AF_INET6) {
>   pin6 = (struct in6_addr *)__entry->saddr_v6;
>   *pin6 = sk->sk_v6_rcv_saddr;
>   pin6 = (struct in6_addr *)__entry->daddr_v6;
> @@ -160,7 +171,8 @@
>   }
>   ),
> 
> - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 
> saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
> daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + show_family_name(__entry->family),
>   show_inet_protocol_name(__entry->protocol),
>   __entry->sport, __entry->dport,
>   __entry->saddr, __entry->daddr,
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index bab98a4..1d52796 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk)
> 
> void inet_sk_set_state(struct sock *sk, int state)
> {
> - trace_inet_sock_set_state(sk, sk->sk_state, state);
> + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
> + sk->sk_state, state);
>   sk->sk_state = state;
> }
> EXPORT_SYMBOL(inet_sk_set_state);
> 
> void inet_sk_state_store(struct sock *sk, int newstate)
> {
> - trace_inet_sock_set_state(sk, sk->sk_state, newstate);
> + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
> + sk->sk_state, newstate);
>   smp_store_release(>sk_state, newstate);
> }
> 
> --
> 1.8.3.1
> 



Re: [PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state

2018-01-04 Thread Song Liu

> On Jan 4, 2018, at 10:42 PM, Yafang Shao  wrote:
> 
> sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments.
> Then we can conveniently use these two arguments to do the filter.
> 
> Suggested-by: Brendan Gregg 
> Signed-off-by: Yafang Shao 
> ---
> include/trace/events/sock.h | 24 ++--
> net/ipv4/af_inet.c  |  6 --
> 2 files changed, 22 insertions(+), 8 deletions(-)
> 
> diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
> index 3537c5f..c7df70f 100644
> --- a/include/trace/events/sock.h
> +++ b/include/trace/events/sock.h
> @@ -11,7 +11,11 @@
> #include 
> #include 
> 
> -/* The protocol traced by sock_set_state */
> +#define family_names \
> + EM(AF_INET) \
> + EMe(AF_INET6)
> +
> +/* The protocol traced by inet_sock_set_state */
> #define inet_protocol_names   \
>   EM(IPPROTO_TCP) \
>   EM(IPPROTO_DCCP)\
> @@ -37,6 +41,7 @@
> #define EM(a)   TRACE_DEFINE_ENUM(a);
> #define EMe(a)  TRACE_DEFINE_ENUM(a);
> 
> +family_names
> inet_protocol_names
> tcp_state_names
> 
> @@ -45,6 +50,9 @@
> #define EM(a)   { a, #a },
> #define EMe(a)  { a, #a }
> 
> +#define show_family_name(val)\
> + __print_symbolic(val, family_names)
> +
> #define show_inet_protocol_name(val)\
>   __print_symbolic(val, inet_protocol_names)
> 
> @@ -108,9 +116,10 @@
> 
> TRACE_EVENT(inet_sock_set_state,
> 
> - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate),
> + TP_PROTO(const struct sock *sk, const int family, const int protocol,
> + const int oldstate, const int newstate),

Are there cases we need protocol and/or family that is different to 
sk->sk_protocol/sk_family? If not, I think we don't need to change the 
TP_PROTO. 

Thanks,
Song

> 
> - TP_ARGS(sk, oldstate, newstate),
> + TP_ARGS(sk, family, protocol, oldstate, newstate),
> 
>   TP_STRUCT__entry(
>   __field(const void *, skaddr)
> @@ -118,6 +127,7 @@
>   __field(int, newstate)
>   __field(__u16, sport)
>   __field(__u16, dport)
> + __field(__u16, family)
>   __field(__u8, protocol)
>   __array(__u8, saddr, 4)
>   __array(__u8, daddr, 4)
> @@ -133,8 +143,9 @@
>   __entry->skaddr = sk;
>   __entry->oldstate = oldstate;
>   __entry->newstate = newstate;
> + __entry->family = family;
> + __entry->protocol = protocol;
> 
> - __entry->protocol = sk->sk_protocol;
>   __entry->sport = ntohs(inet->inet_sport);
>   __entry->dport = ntohs(inet->inet_dport);
> 
> @@ -145,7 +156,7 @@
>   *p32 =  inet->inet_daddr;
> 
> #if IS_ENABLED(CONFIG_IPV6)
> - if (sk->sk_family == AF_INET6) {
> + if (family == AF_INET6) {
>   pin6 = (struct in6_addr *)__entry->saddr_v6;
>   *pin6 = sk->sk_v6_rcv_saddr;
>   pin6 = (struct in6_addr *)__entry->daddr_v6;
> @@ -160,7 +171,8 @@
>   }
>   ),
> 
> - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 
> saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
> daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
> + show_family_name(__entry->family),
>   show_inet_protocol_name(__entry->protocol),
>   __entry->sport, __entry->dport,
>   __entry->saddr, __entry->daddr,
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index bab98a4..1d52796 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk)
> 
> void inet_sk_set_state(struct sock *sk, int state)
> {
> - trace_inet_sock_set_state(sk, sk->sk_state, state);
> + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
> + sk->sk_state, state);
>   sk->sk_state = state;
> }
> EXPORT_SYMBOL(inet_sk_set_state);
> 
> void inet_sk_state_store(struct sock *sk, int newstate)
> {
> - trace_inet_sock_set_state(sk, sk->sk_state, newstate);
> + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
> + sk->sk_state, newstate);
>   smp_store_release(>sk_state, newstate);
> }
> 
> --
> 1.8.3.1
> 



[PATCH -next] um: vector: fix missing unlock on error in vector_net_open()

2018-01-04 Thread Wei Yongjun
Add the missing unlock before return from function vector_net_open()
in the error handling case.

Fixes: ad1f62ab2bd4 ("High Performance UML Vector Network Driver")
Signed-off-by: Wei Yongjun 
---
 arch/um/drivers/vector_kern.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c
index d1d5301..bb83a2d 100644
--- a/arch/um/drivers/vector_kern.c
+++ b/arch/um/drivers/vector_kern.c
@@ -1156,8 +1156,10 @@ static int vector_net_open(struct net_device *dev)
struct vector_device *vdevice;
 
spin_lock_irqsave(>lock, flags);
-   if (vp->opened)
+   if (vp->opened) {
+   spin_unlock_irqrestore(>lock, flags);
return -ENXIO;
+   }
vp->opened = true;
spin_unlock_irqrestore(>lock, flags);



[PATCH -next] um: vector: fix missing unlock on error in vector_net_open()

2018-01-04 Thread Wei Yongjun
Add the missing unlock before return from function vector_net_open()
in the error handling case.

Fixes: ad1f62ab2bd4 ("High Performance UML Vector Network Driver")
Signed-off-by: Wei Yongjun 
---
 arch/um/drivers/vector_kern.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c
index d1d5301..bb83a2d 100644
--- a/arch/um/drivers/vector_kern.c
+++ b/arch/um/drivers/vector_kern.c
@@ -1156,8 +1156,10 @@ static int vector_net_open(struct net_device *dev)
struct vector_device *vdevice;
 
spin_lock_irqsave(>lock, flags);
-   if (vp->opened)
+   if (vp->opened) {
+   spin_unlock_irqrestore(>lock, flags);
return -ENXIO;
+   }
vp->opened = true;
spin_unlock_irqrestore(>lock, flags);



Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Willy Tarreau
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote:
> On 01/04/2018 10:49 PM, Willy Tarreau wrote:
> > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
> >> On Thu, 4 Jan 2018, Jon Masters wrote:
> >>> P.S. I've an internal document where I've been tracking "nice to haves"
> >>> for later, and one of them is whether it makes sense to tag binaries as
> >>> "trusted" (e.g. extended attribute, label, whatever). It was something I
> >>> wanted to bring up at some point as potentially worth considering.
> >> Scratch that. There is no such thing as a trusted binary.
> > I disagree with you on this Thomas. "trusted" means "we agree to share the
> > risk this binary takes because it's critical to our service". When you
> > build a load balancing appliance on which 100% of the service is assured
> > by a single executable and the rest is just config management, you'd better
> > trust that process.
> 
> So you want to run this "one binary" as fast as possible and without
> mitigations in place?  But, you want mitigations *available* on that
> system at the same time?  For what?  If there's only one binary, why not
> just disable the mitigations entirely?

I'm not fond of running the mitigations, but given that a few sysops can
connect to the machine to collect stats or counters, I think it would be
better to ensure these people can't happily play with the exploits to
dump stuff they shouldn't have access to. It's even easier to understand
on a database or key-value server for example, where you may expect the
highest performance the CPU can bring for a specific process and the rest
can be mitigated and will never ever notice any performance impact at all.

That's why I was saying in another thread that it would be nice over the
long term if we could 1) make the mitigation dynamic, and 2) make it
possible for an admin to disable it for certain processes/programs.

Don't get me wrong, I'm perfectly aware that it's far from being simple
and for now we need to get a reliable mitigation. I'm just saying that
the performance impact is a huge loss for certain use cases and that
once things settle down we should start to work on ways to recover what
was lost.

Regards,
Willy


Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Willy Tarreau
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote:
> On 01/04/2018 10:49 PM, Willy Tarreau wrote:
> > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
> >> On Thu, 4 Jan 2018, Jon Masters wrote:
> >>> P.S. I've an internal document where I've been tracking "nice to haves"
> >>> for later, and one of them is whether it makes sense to tag binaries as
> >>> "trusted" (e.g. extended attribute, label, whatever). It was something I
> >>> wanted to bring up at some point as potentially worth considering.
> >> Scratch that. There is no such thing as a trusted binary.
> > I disagree with you on this Thomas. "trusted" means "we agree to share the
> > risk this binary takes because it's critical to our service". When you
> > build a load balancing appliance on which 100% of the service is assured
> > by a single executable and the rest is just config management, you'd better
> > trust that process.
> 
> So you want to run this "one binary" as fast as possible and without
> mitigations in place?  But, you want mitigations *available* on that
> system at the same time?  For what?  If there's only one binary, why not
> just disable the mitigations entirely?

I'm not fond of running the mitigations, but given that a few sysops can
connect to the machine to collect stats or counters, I think it would be
better to ensure these people can't happily play with the exploits to
dump stuff they shouldn't have access to. It's even easier to understand
on a database or key-value server for example, where you may expect the
highest performance the CPU can bring for a specific process and the rest
can be mitigated and will never ever notice any performance impact at all.

That's why I was saying in another thread that it would be nice over the
long term if we could 1) make the mitigation dynamic, and 2) make it
possible for an admin to disable it for certain processes/programs.

Don't get me wrong, I'm perfectly aware that it's far from being simple
and for now we need to get a reliable mitigation. I'm just saying that
the performance impact is a huge loss for certain use cases and that
once things settle down we should start to work on ways to recover what
was lost.

Regards,
Willy


Re: [PATCH] of: Use SPDX license tag for DT files

2018-01-04 Thread Philippe Ombredanne
On Fri, Jan 5, 2018 at 12:05 AM, Rob Herring  wrote:
> Convert remaining DT files to use SPDX-License-Identifier tags.
>
> Cc: Benjamin Herrenschmidt 
> Cc: Guennadi Liakhovetski 
> Cc: Paul Mackerras 
> Cc: Pantelis Antoniou 
> Signed-off-by: Rob Herring 
> ---
>  drivers/of/Kconfig  |  1 +
>  drivers/of/address.c|  2 +-
>  drivers/of/base.c   |  6 +-
>  drivers/of/device.c |  1 +
>  drivers/of/dynamic.c|  1 +
>  drivers/of/fdt.c|  5 +
>  drivers/of/fdt_address.c|  6 +-
>  drivers/of/irq.c|  6 +-
>  drivers/of/kobj.c   |  2 +-
>  drivers/of/of_numa.c| 13 +
>  drivers/of/of_private.h |  6 +-
>  drivers/of/of_reserved_mem.c|  6 +-
>  drivers/of/overlay.c|  5 +
>  drivers/of/pdt.c|  6 +-
>  drivers/of/platform.c   |  7 +--
>  drivers/of/property.c   |  6 +-
>  drivers/of/resolver.c   |  5 +
>  drivers/of/unittest-data/overlay_bad_symbol.dts |  1 +
>  include/linux/of.h  |  6 +-
>  include/linux/of_dma.h  |  5 +
>  include/linux/of_fdt.h  |  5 +
>  include/linux/of_gpio.h |  6 +-
>  include/linux/of_graph.h|  5 +
>  include/linux/of_pdt.h  |  6 +-
>  include/linux/of_platform.h |  7 +--
>  25 files changed, 25 insertions(+), 100 deletions(-)
>
> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
> index c2b6c11d29d1..572942c3cb15 100644
> --- a/drivers/of/Kconfig
> +++ b/drivers/of/Kconfig
> @@ -1,3 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
>  config DTC
> bool
>
> diff --git a/drivers/of/address.c b/drivers/of/address.c
> index 8591afbdfe99..b48b68c4a7a9 100644
> --- a/drivers/of/address.c
> +++ b/drivers/of/address.c
> @@ -1,4 +1,4 @@
> -
> +// SPDX-License-Identifier: GPL-2.0
>  #define pr_fmt(fmt)"OF: " fmt
>
>  #include 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 26618ba8f92a..dd0b4201f1cc 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  /*
>   * Procedures for creating, accessing and interpreting the device tree.
>   *
> @@ -11,11 +12,6 @@
>   *
>   *  Reconsolidated from arch/x/kernel/prom.c by Stephen Rothwell and
>   *  Grant Likely.
> - *
> - *  This program is free software; you can redistribute it and/or
> - *  modify it under the terms of the GNU General Public License
> - *  as published by the Free Software Foundation; either version
> - *  2 of the License, or (at your option) any later version.
>   */
>
>  #define pr_fmt(fmt)"OF: " fmt
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index 25bddf9c9fe1..064c818105bd 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704d..7bb33d22b4e2 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
>  /*
>   * Support for dynamic device trees.
>   *
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index 4675e5ac4d11..7db5353a24c0 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1,12 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0
>  /*
>   * Functions for working with the Flattened Device Tree data format
>   *
>   * Copyright 2009 Benjamin Herrenschmidt, IBM Corp
>   * b...@kernel.crashing.org
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public License
> - * version 2 as published by the Free Software Foundation.
>   */
>
>  #define pr_fmt(fmt)"OF: fdt: " fmt
> diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c
> index 843a542dac7d..1dc15ab78b10 100644
> --- a/drivers/of/fdt_address.c
> +++ b/drivers/of/fdt_address.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  /*
>   * FDT Address translation based on u-boot fdt_support.c which in turn was
>   * based on the kernel unflattened DT address translation code.
> @@ -6,11 +7,6 @@
>   * Gerald Van Baren, Custom IDEAS, vanba...@cideas.com
>   *
>   * Copyright 2010-2011 Freescale Semiconductor, Inc.
> - *
> - * This program is free software; you can redistribute it 

Re: [PATCH] of: Use SPDX license tag for DT files

2018-01-04 Thread Philippe Ombredanne
On Fri, Jan 5, 2018 at 12:05 AM, Rob Herring  wrote:
> Convert remaining DT files to use SPDX-License-Identifier tags.
>
> Cc: Benjamin Herrenschmidt 
> Cc: Guennadi Liakhovetski 
> Cc: Paul Mackerras 
> Cc: Pantelis Antoniou 
> Signed-off-by: Rob Herring 
> ---
>  drivers/of/Kconfig  |  1 +
>  drivers/of/address.c|  2 +-
>  drivers/of/base.c   |  6 +-
>  drivers/of/device.c |  1 +
>  drivers/of/dynamic.c|  1 +
>  drivers/of/fdt.c|  5 +
>  drivers/of/fdt_address.c|  6 +-
>  drivers/of/irq.c|  6 +-
>  drivers/of/kobj.c   |  2 +-
>  drivers/of/of_numa.c| 13 +
>  drivers/of/of_private.h |  6 +-
>  drivers/of/of_reserved_mem.c|  6 +-
>  drivers/of/overlay.c|  5 +
>  drivers/of/pdt.c|  6 +-
>  drivers/of/platform.c   |  7 +--
>  drivers/of/property.c   |  6 +-
>  drivers/of/resolver.c   |  5 +
>  drivers/of/unittest-data/overlay_bad_symbol.dts |  1 +
>  include/linux/of.h  |  6 +-
>  include/linux/of_dma.h  |  5 +
>  include/linux/of_fdt.h  |  5 +
>  include/linux/of_gpio.h |  6 +-
>  include/linux/of_graph.h|  5 +
>  include/linux/of_pdt.h  |  6 +-
>  include/linux/of_platform.h |  7 +--
>  25 files changed, 25 insertions(+), 100 deletions(-)
>
> diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig
> index c2b6c11d29d1..572942c3cb15 100644
> --- a/drivers/of/Kconfig
> +++ b/drivers/of/Kconfig
> @@ -1,3 +1,4 @@
> +# SPDX-License-Identifier: GPL-2.0
>  config DTC
> bool
>
> diff --git a/drivers/of/address.c b/drivers/of/address.c
> index 8591afbdfe99..b48b68c4a7a9 100644
> --- a/drivers/of/address.c
> +++ b/drivers/of/address.c
> @@ -1,4 +1,4 @@
> -
> +// SPDX-License-Identifier: GPL-2.0
>  #define pr_fmt(fmt)"OF: " fmt
>
>  #include 
> diff --git a/drivers/of/base.c b/drivers/of/base.c
> index 26618ba8f92a..dd0b4201f1cc 100644
> --- a/drivers/of/base.c
> +++ b/drivers/of/base.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  /*
>   * Procedures for creating, accessing and interpreting the device tree.
>   *
> @@ -11,11 +12,6 @@
>   *
>   *  Reconsolidated from arch/x/kernel/prom.c by Stephen Rothwell and
>   *  Grant Likely.
> - *
> - *  This program is free software; you can redistribute it and/or
> - *  modify it under the terms of the GNU General Public License
> - *  as published by the Free Software Foundation; either version
> - *  2 of the License, or (at your option) any later version.
>   */
>
>  #define pr_fmt(fmt)"OF: " fmt
> diff --git a/drivers/of/device.c b/drivers/of/device.c
> index 25bddf9c9fe1..064c818105bd 100644
> --- a/drivers/of/device.c
> +++ b/drivers/of/device.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704d..7bb33d22b4e2 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0
>  /*
>   * Support for dynamic device trees.
>   *
> diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
> index 4675e5ac4d11..7db5353a24c0 100644
> --- a/drivers/of/fdt.c
> +++ b/drivers/of/fdt.c
> @@ -1,12 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0
>  /*
>   * Functions for working with the Flattened Device Tree data format
>   *
>   * Copyright 2009 Benjamin Herrenschmidt, IBM Corp
>   * b...@kernel.crashing.org
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public License
> - * version 2 as published by the Free Software Foundation.
>   */
>
>  #define pr_fmt(fmt)"OF: fdt: " fmt
> diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c
> index 843a542dac7d..1dc15ab78b10 100644
> --- a/drivers/of/fdt_address.c
> +++ b/drivers/of/fdt_address.c
> @@ -1,3 +1,4 @@
> +// SPDX-License-Identifier: GPL-2.0+
>  /*
>   * FDT Address translation based on u-boot fdt_support.c which in turn was
>   * based on the kernel unflattened DT address translation code.
> @@ -6,11 +7,6 @@
>   * Gerald Van Baren, Custom IDEAS, vanba...@cideas.com
>   *
>   * Copyright 2010-2011 Freescale Semiconductor, Inc.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License as published by
> - * the Free Software Foundation; either 

Re: [f2fs-dev] [PATCH 1/2] f2fs: show precise # of blocks that user/root can use

2018-01-04 Thread Yunlong Song

NACK

man statfs shows:

struct statfs {
...
fsblkcnt_t   f_bfree;   /* free blocks in fs */
fsblkcnt_t   f_bavail;  /* free blocks available to
unprivileged user */
...
}

f_bfree is free blocks in fs, so buf->bfree should be

buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count;

On 2018/1/4 2:58, Jaegeuk Kim wrote:

Let's show precise # of blocks that user/root can use through bavail and bfree
respectively.

Signed-off-by: Jaegeuk Kim 
---
  fs/f2fs/super.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 0a820ba55b10..4c1c99cf54ef 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1005,9 +1005,9 @@ static int f2fs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
buf->f_bsize = sbi->blocksize;
  
  	buf->f_blocks = total_count - start_count;

-   buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count;
-   buf->f_bavail = user_block_count - valid_user_blocks(sbi) -
+   buf->f_bfree = user_block_count - valid_user_blocks(sbi) -
sbi->current_reserved_blocks;
+   buf->f_bavail = buf->f_bfree;
  
  	avail_node_count = sbi->total_node_count - sbi->nquota_files -

F2FS_RESERVED_NODE_NUM;


--
Thanks,
Yunlong Song




Re: [f2fs-dev] [PATCH 1/2] f2fs: show precise # of blocks that user/root can use

2018-01-04 Thread Yunlong Song

NACK

man statfs shows:

struct statfs {
...
fsblkcnt_t   f_bfree;   /* free blocks in fs */
fsblkcnt_t   f_bavail;  /* free blocks available to
unprivileged user */
...
}

f_bfree is free blocks in fs, so buf->bfree should be

buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count;

On 2018/1/4 2:58, Jaegeuk Kim wrote:

Let's show precise # of blocks that user/root can use through bavail and bfree
respectively.

Signed-off-by: Jaegeuk Kim 
---
  fs/f2fs/super.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 0a820ba55b10..4c1c99cf54ef 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -1005,9 +1005,9 @@ static int f2fs_statfs(struct dentry *dentry, struct 
kstatfs *buf)
buf->f_bsize = sbi->blocksize;
  
  	buf->f_blocks = total_count - start_count;

-   buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count;
-   buf->f_bavail = user_block_count - valid_user_blocks(sbi) -
+   buf->f_bfree = user_block_count - valid_user_blocks(sbi) -
sbi->current_reserved_blocks;
+   buf->f_bavail = buf->f_bfree;
  
  	avail_node_count = sbi->total_node_count - sbi->nquota_files -

F2FS_RESERVED_NODE_NUM;


--
Thanks,
Yunlong Song




Re: [PATCH] [v3] x86/doc: add PTI description

2018-01-04 Thread Randy Dunlap
On 01/04/18 21:38, Dave Hansen wrote:

> +Page Table Management
> +=
> +
> +When PTI is enabled, the kernel manages two sets of page tables.
> +The first set is very similar to the single set which is present in
> +kernels without PTI.  This includes a complete mapping of userspace
> +that the kernel can use for things like copy_to_user().
> +
> +Although _complete_, the user portion of the kernel page tables is
> +crippled by setting the NX bit in the top level.  This ensures
> +that any missed kernel->user CR3 switch will immediately crash
> +userspace upon executing its first instruction.
> +
> +The userspace page tables map only the kernel data needed to enter
> +and exit the kernel.  This data is entirely contained in the 'struct
> +cpu_entry_area' structure which is placed in the fixmap which gives
> +each CPU's copy of the area has a compile-time-fixed virtual
> +address.

drop /has/ above.

> +
> +For new userspace mappings, the kernel makes the entries in its
> +page tables like normal.  The only difference is when the kernel
> +makes entries in the top (PGD) level.  In addition to setting the
> +entry in the main kernel PGD, a copy of the entry is made in the
> +userspace page tables' PGD.

-- 
~Randy


Re: [PATCH] [v3] x86/doc: add PTI description

2018-01-04 Thread Randy Dunlap
On 01/04/18 21:38, Dave Hansen wrote:

> +Page Table Management
> +=
> +
> +When PTI is enabled, the kernel manages two sets of page tables.
> +The first set is very similar to the single set which is present in
> +kernels without PTI.  This includes a complete mapping of userspace
> +that the kernel can use for things like copy_to_user().
> +
> +Although _complete_, the user portion of the kernel page tables is
> +crippled by setting the NX bit in the top level.  This ensures
> +that any missed kernel->user CR3 switch will immediately crash
> +userspace upon executing its first instruction.
> +
> +The userspace page tables map only the kernel data needed to enter
> +and exit the kernel.  This data is entirely contained in the 'struct
> +cpu_entry_area' structure which is placed in the fixmap which gives
> +each CPU's copy of the area has a compile-time-fixed virtual
> +address.

drop /has/ above.

> +
> +For new userspace mappings, the kernel makes the entries in its
> +page tables like normal.  The only difference is when the kernel
> +makes entries in the top (PGD) level.  In addition to setting the
> +entry in the main kernel PGD, a copy of the entry is made in the
> +userspace page tables' PGD.

-- 
~Randy


Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Dave Hansen
On 01/04/2018 10:49 PM, Willy Tarreau wrote:
> On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
>> On Thu, 4 Jan 2018, Jon Masters wrote:
>>> P.S. I've an internal document where I've been tracking "nice to haves"
>>> for later, and one of them is whether it makes sense to tag binaries as
>>> "trusted" (e.g. extended attribute, label, whatever). It was something I
>>> wanted to bring up at some point as potentially worth considering.
>> Scratch that. There is no such thing as a trusted binary.
> I disagree with you on this Thomas. "trusted" means "we agree to share the
> risk this binary takes because it's critical to our service". When you
> build a load balancing appliance on which 100% of the service is assured
> by a single executable and the rest is just config management, you'd better
> trust that process.

So you want to run this "one binary" as fast as possible and without
mitigations in place?  But, you want mitigations *available* on that
system at the same time?  For what?  If there's only one binary, why not
just disable the mitigations entirely?


Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Dave Hansen
On 01/04/2018 10:49 PM, Willy Tarreau wrote:
> On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
>> On Thu, 4 Jan 2018, Jon Masters wrote:
>>> P.S. I've an internal document where I've been tracking "nice to haves"
>>> for later, and one of them is whether it makes sense to tag binaries as
>>> "trusted" (e.g. extended attribute, label, whatever). It was something I
>>> wanted to bring up at some point as potentially worth considering.
>> Scratch that. There is no such thing as a trusted binary.
> I disagree with you on this Thomas. "trusted" means "we agree to share the
> risk this binary takes because it's critical to our service". When you
> build a load balancing appliance on which 100% of the service is assured
> by a single executable and the rest is just config management, you'd better
> trust that process.

So you want to run this "one binary" as fast as possible and without
mitigations in place?  But, you want mitigations *available* on that
system at the same time?  For what?  If there's only one binary, why not
just disable the mitigations entirely?


Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Willy Tarreau
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, Jon Masters wrote:
> > P.S. I've an internal document where I've been tracking "nice to haves"
> > for later, and one of them is whether it makes sense to tag binaries as
> > "trusted" (e.g. extended attribute, label, whatever). It was something I
> > wanted to bring up at some point as potentially worth considering.
> 
> Scratch that. There is no such thing as a trusted binary.

I disagree with you on this Thomas. "trusted" means "we agree to share the
risk this binary takes because it's critical to our service". When you
build a load balancing appliance on which 100% of the service is assured
by a single executable and the rest is just config management, you'd better
trust that process. If the binary or process cannot be trusted, the product
is dead anyway. It doesn't mean the binary is safe. It just means that for
the product there's nothing worse than its compromission or failure. And
when it suffers from the performance impact of workarounds supposed to
protect the whole device against this process' possible abuses, you
easily see how the situation becomes ridiculous.

We need to still think about performance a lot. There's already an ongoing
trend of kernel bypass mechanisms in the wild for performance reasons, and
the new increase of syscall costs will necessarily amplify this willingness
to avoid the kernel. I personally don't want to see the kernel being reduced
to booting and executing SSH to manage the machines.

Willy



Re: Avoid speculative indirect calls in kernel

2018-01-04 Thread Willy Tarreau
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote:
> On Thu, 4 Jan 2018, Jon Masters wrote:
> > P.S. I've an internal document where I've been tracking "nice to haves"
> > for later, and one of them is whether it makes sense to tag binaries as
> > "trusted" (e.g. extended attribute, label, whatever). It was something I
> > wanted to bring up at some point as potentially worth considering.
> 
> Scratch that. There is no such thing as a trusted binary.

I disagree with you on this Thomas. "trusted" means "we agree to share the
risk this binary takes because it's critical to our service". When you
build a load balancing appliance on which 100% of the service is assured
by a single executable and the rest is just config management, you'd better
trust that process. If the binary or process cannot be trusted, the product
is dead anyway. It doesn't mean the binary is safe. It just means that for
the product there's nothing worse than its compromission or failure. And
when it suffers from the performance impact of workarounds supposed to
protect the whole device against this process' possible abuses, you
easily see how the situation becomes ridiculous.

We need to still think about performance a lot. There's already an ongoing
trend of kernel bypass mechanisms in the wild for performance reasons, and
the new increase of syscall costs will necessarily amplify this willingness
to avoid the kernel. I personally don't want to see the kernel being reduced
to booting and executing SSH to manage the machines.

Willy



Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices

2018-01-04 Thread Jerome Glisse
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> On Thu, 4 Jan 2018 17:00:47 -0700
> Logan Gunthorpe  wrote:
> 
> > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > on the isolation of that downstream port and I suspect also any peers
> > > within the same PCI slot of that port and their downstream devices.  The
> > > entire sub-hierarchy grouping needs to be re-evaluated.  This
> > > potentially affects running devices that depend on that isolation, so
> > > I'm not sure how that happens dynamically.  A boot option might be
> > > easier.  Thanks,  
> > 
> > I don't see how this is the case in current kernel code. It appears to 
> > only enable ACS globally if the IOMMU requests it.
> 
> IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> both request ACS be enabled if an IOMMU is present, so I'm not sure
> what you're getting at here.  Also, in reply to your other email, if
> the IOMMU is enabled, every device handled by the IOMMU is a member of
> an IOMMU group, see struct device.iommu_group.  There's an
> iommu_group_get() accessor to get a reference to it.
>  
> > I also don't see how turning off ACS isolation for a specific device is 
> > going to hurt anything. The IOMMU should still be able to keep going on 
> > unaware that anything has changed. The only worry is that a security 
> > hole may now be created if a user was relying on the isolation between 
> > two devices that are in different VMs or something. However, if a user 
> > was relying on this, they probably shouldn't have turned on P2P in the 
> > first place.
> 
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices.  By poking this hole, the
> IOMMU group is invalid.  We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices.  Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
>  
> > We started with a fairly unintelligent choice to simply disable ACS on 
> > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like 
> > a good idea going forward. Instead, we now selectively disable the ACS 
> > bit only on the downstream ports that are involved in P2P transactions. 
> > This seems like the safest choice and still allows people to (carefully) 
> > use P2P adjacent to other devices that need to be isolated.
> 
> I don't see that the code is doing much checking that adjacent devices
> are also affected by the p2p change and of course the IOMMU group is
> entirely invalid once the p2p holes start getting poked.
> 
> > I don't think anyone wants another boot option that must be set in order 
> > to use this functionality (and only some hardware would require this). 
> > That's just a huge pain for users.
> 
> No, but nor do we need IOMMU groups that no longer represent what
> they're intended to describe or runtime, unchecked routing changes
> through the topology for devices that might already be using
> conflicting IOVA ranges.  Maybe soft hotplugs are another possibility,
> designate a sub-hierarchy to be removed and re-scanned with ACS
> disabled.  Otherwise it seems like disabling and re-enabling ACS needs
> to also handle merging and splitting groups dynamically.  Thanks,
> 

Dumb question, can we use a PCI bar address of one device into the
IOMMU page table of another address ie like we would DMA map a
regular system page ?

It would be much better in my view to follow down such path if that
is at all possible from hardware point of view (i am not sure where
to dig in the specification to answer my above question).

Cheers,
Jérôme


Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices

2018-01-04 Thread Jerome Glisse
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote:
> On Thu, 4 Jan 2018 17:00:47 -0700
> Logan Gunthorpe  wrote:
> 
> > On 04/01/18 03:35 PM, Alex Williamson wrote:
> > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend
> > > on the isolation of that downstream port and I suspect also any peers
> > > within the same PCI slot of that port and their downstream devices.  The
> > > entire sub-hierarchy grouping needs to be re-evaluated.  This
> > > potentially affects running devices that depend on that isolation, so
> > > I'm not sure how that happens dynamically.  A boot option might be
> > > easier.  Thanks,  
> > 
> > I don't see how this is the case in current kernel code. It appears to 
> > only enable ACS globally if the IOMMU requests it.
> 
> IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM
> both request ACS be enabled if an IOMMU is present, so I'm not sure
> what you're getting at here.  Also, in reply to your other email, if
> the IOMMU is enabled, every device handled by the IOMMU is a member of
> an IOMMU group, see struct device.iommu_group.  There's an
> iommu_group_get() accessor to get a reference to it.
>  
> > I also don't see how turning off ACS isolation for a specific device is 
> > going to hurt anything. The IOMMU should still be able to keep going on 
> > unaware that anything has changed. The only worry is that a security 
> > hole may now be created if a user was relying on the isolation between 
> > two devices that are in different VMs or something. However, if a user 
> > was relying on this, they probably shouldn't have turned on P2P in the 
> > first place.
> 
> That's exactly what IOMMU groups represent, the smallest set of devices
> which have DMA isolation from other devices.  By poking this hole, the
> IOMMU group is invalid.  We cannot turn off ACS only for a specific
> device, in order to enable p2p it needs to be disabled at every
> downstream port between the devices where we want to enable p2p.
> Depending on the topology, that could mean we're also enabling p2p for
> unrelated devices.  Those unrelated devices might be in active use and
> the p2p IOVAs now have a different destination which is no longer IOMMU
> translated.
>  
> > We started with a fairly unintelligent choice to simply disable ACS on 
> > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like 
> > a good idea going forward. Instead, we now selectively disable the ACS 
> > bit only on the downstream ports that are involved in P2P transactions. 
> > This seems like the safest choice and still allows people to (carefully) 
> > use P2P adjacent to other devices that need to be isolated.
> 
> I don't see that the code is doing much checking that adjacent devices
> are also affected by the p2p change and of course the IOMMU group is
> entirely invalid once the p2p holes start getting poked.
> 
> > I don't think anyone wants another boot option that must be set in order 
> > to use this functionality (and only some hardware would require this). 
> > That's just a huge pain for users.
> 
> No, but nor do we need IOMMU groups that no longer represent what
> they're intended to describe or runtime, unchecked routing changes
> through the topology for devices that might already be using
> conflicting IOVA ranges.  Maybe soft hotplugs are another possibility,
> designate a sub-hierarchy to be removed and re-scanned with ACS
> disabled.  Otherwise it seems like disabling and re-enabling ACS needs
> to also handle merging and splitting groups dynamically.  Thanks,
> 

Dumb question, can we use a PCI bar address of one device into the
IOMMU page table of another address ie like we would DMA map a
regular system page ?

It would be much better in my view to follow down such path if that
is at all possible from hardware point of view (i am not sure where
to dig in the specification to answer my above question).

Cheers,
Jérôme


Re: [PATCH 1/2] Move kfree_call_rcu() to slab_common.c

2018-01-04 Thread Joe Perches
On Thu, 2018-01-04 at 16:07 -0800, Matthew Wilcox wrote:
> On Thu, Jan 04, 2018 at 03:47:32PM -0800, Paul E. McKenney wrote:
> > I was under the impression that typeof did not actually evaluate its
> > argument, but rather only returned its type.  And there are a few macros
> > with this pattern in mainline.
> > 
> > Or am I confused about what typeof does?
> 
> I think checkpatch is confused by the '*' in the typeof argument:
> 
> $ git diff |./scripts/checkpatch.pl --strict
> CHECK: Macro argument reuse 'ptr' - possible side-effects?
> #29: FILE: include/linux/rcupdate.h:896:
> +#define kfree_rcu(ptr, rcu_head)\
> + __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
> 
> If one removes the '*', the warning goes away.
> 
> I'm no perlista, but Joe, would this regexp modification make sense?
> 
> +++ b/scripts/checkpatch.pl
> @@ -4957,7 +4957,7 @@ sub process {
> next if ($arg =~ /\.\.\./);
> next if ($arg =~ /^type$/i);
> my $tmp_stmt = $define_stmt;
> -   $tmp_stmt =~ 
> s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\s*$arg\s*\)*\b//g;
> +   $tmp_stmt =~ 
> s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\**\(*\s*$arg\s*\)*\b//g;

I supposed ideally it'd be more like

$tmp_stmt =~ 
s/\b(?:typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*(?:\s*\*\s*)*\s*\(*\s*$arg\s*\)*\b//g;

Adding ?: at the start to not capture and
(?:\s*\*\s*)* for any number of * with any
surrounding spacings.


Re: [PATCH 1/2] Move kfree_call_rcu() to slab_common.c

2018-01-04 Thread Joe Perches
On Thu, 2018-01-04 at 16:07 -0800, Matthew Wilcox wrote:
> On Thu, Jan 04, 2018 at 03:47:32PM -0800, Paul E. McKenney wrote:
> > I was under the impression that typeof did not actually evaluate its
> > argument, but rather only returned its type.  And there are a few macros
> > with this pattern in mainline.
> > 
> > Or am I confused about what typeof does?
> 
> I think checkpatch is confused by the '*' in the typeof argument:
> 
> $ git diff |./scripts/checkpatch.pl --strict
> CHECK: Macro argument reuse 'ptr' - possible side-effects?
> #29: FILE: include/linux/rcupdate.h:896:
> +#define kfree_rcu(ptr, rcu_head)\
> + __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
> 
> If one removes the '*', the warning goes away.
> 
> I'm no perlista, but Joe, would this regexp modification make sense?
> 
> +++ b/scripts/checkpatch.pl
> @@ -4957,7 +4957,7 @@ sub process {
> next if ($arg =~ /\.\.\./);
> next if ($arg =~ /^type$/i);
> my $tmp_stmt = $define_stmt;
> -   $tmp_stmt =~ 
> s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\s*$arg\s*\)*\b//g;
> +   $tmp_stmt =~ 
> s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\**\(*\s*$arg\s*\)*\b//g;

I supposed ideally it'd be more like

$tmp_stmt =~ 
s/\b(?:typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*(?:\s*\*\s*)*\s*\(*\s*$arg\s*\)*\b//g;

Adding ?: at the start to not capture and
(?:\s*\*\s*)* for any number of * with any
surrounding spacings.


Re: mmotm 2018-01-04-16-19 uploaded

2018-01-04 Thread Anshuman Khandual
On 01/05/2018 05:50 AM, a...@linux-foundation.org wrote:
> The mm-of-the-moment snapshot 2018-01-04-16-19 has been uploaded to
> 
>http://www.ozlabs.org/~akpm/mmotm/
> 
> mmotm-readme.txt says
> 
> README for mm-of-the-moment:
> 
> http://www.ozlabs.org/~akpm/mmotm/
> 
> This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
> more than once a week.
> 
> You will need quilt to apply these patches to the latest Linus release (4.x
> or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
> http://ozlabs.org/~akpm/mmotm/series
> 
> The file broken-out.tar.gz contains two datestamp files: .DATE and
> .DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
> followed by the base kernel version against which this patch series is to
> be applied.
> 
> This tree is partially included in linux-next.  To see which patches are
> included in linux-next, consult the `series' file.  Only the patches
> within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
> linux-next.
> 
> A git tree which contains the memory management portion of this tree is
> maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Seems like this latest snapshot mmotm-2018-01-04-16-19 has not been
updated in this git tree. I could not fetch not it shows up in the
http link below.

https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

The last one mmotm-2017-12-22-17-55 seems to have some regression on
powerpc with respect to ELF loading of binaries (see below). Seems to
be related to recent MAP_FIXED_SAFE (or MAP_FIXED_NOREPLACE as seen
now in the code). IIUC (have not been following the series last month)
MAP_FIXED_NOREPLACE will fail an allocation request if the hint address
cannot be reserve instead of changing existing mappings. Is it possible
that ELF loading needs to be fixed at a higher level to deal with these
new possible mmap() failures because of MAP_FIXED_NOREPLACE ?

[   22.448068] 9060 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[   22.450135] 9063 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.456484] 9066 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[   22.458171] 9069 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.505341] 9078 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.506961] 9081 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.508736] 9084 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.510589] 9087 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.512442] 9090 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.514685] 9093 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.565793] 9103 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.567874] 9106 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[  123.469490] 9173 (fprintd): Uhuuh, elf segment at 1002 requested 
but the memory is mapped already
[  137.468372] 9182 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[  137.644647] 9205 (pkg-config): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[  137.811893] 9219 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[  164.739135] 9232 (less): Uhuuh, elf segment at 1004 requested 
but the memory is mapped already



Re: mmotm 2018-01-04-16-19 uploaded

2018-01-04 Thread Anshuman Khandual
On 01/05/2018 05:50 AM, a...@linux-foundation.org wrote:
> The mm-of-the-moment snapshot 2018-01-04-16-19 has been uploaded to
> 
>http://www.ozlabs.org/~akpm/mmotm/
> 
> mmotm-readme.txt says
> 
> README for mm-of-the-moment:
> 
> http://www.ozlabs.org/~akpm/mmotm/
> 
> This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
> more than once a week.
> 
> You will need quilt to apply these patches to the latest Linus release (4.x
> or 4.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
> http://ozlabs.org/~akpm/mmotm/series
> 
> The file broken-out.tar.gz contains two datestamp files: .DATE and
> .DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
> followed by the base kernel version against which this patch series is to
> be applied.
> 
> This tree is partially included in linux-next.  To see which patches are
> included in linux-next, consult the `series' file.  Only the patches
> within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
> linux-next.
> 
> A git tree which contains the memory management portion of this tree is
> maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Seems like this latest snapshot mmotm-2018-01-04-16-19 has not been
updated in this git tree. I could not fetch not it shows up in the
http link below.

https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

The last one mmotm-2017-12-22-17-55 seems to have some regression on
powerpc with respect to ELF loading of binaries (see below). Seems to
be related to recent MAP_FIXED_SAFE (or MAP_FIXED_NOREPLACE as seen
now in the code). IIUC (have not been following the series last month)
MAP_FIXED_NOREPLACE will fail an allocation request if the hint address
cannot be reserve instead of changing existing mappings. Is it possible
that ELF loading needs to be fixed at a higher level to deal with these
new possible mmap() failures because of MAP_FIXED_NOREPLACE ?

[   22.448068] 9060 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[   22.450135] 9063 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.456484] 9066 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[   22.458171] 9069 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.505341] 9078 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.506961] 9081 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.508736] 9084 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.510589] 9087 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.512442] 9090 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.514685] 9093 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.565793] 9103 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[   22.567874] 9106 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[  123.469490] 9173 (fprintd): Uhuuh, elf segment at 1002 requested 
but the memory is mapped already
[  137.468372] 9182 (hostname): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[  137.644647] 9205 (pkg-config): Uhuuh, elf segment at 1002 
requested but the memory is mapped already
[  137.811893] 9219 (sed): Uhuuh, elf segment at 1003 requested but 
the memory is mapped already
[  164.739135] 9232 (less): Uhuuh, elf segment at 1004 requested 
but the memory is mapped already



[PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state

2018-01-04 Thread Yafang Shao
sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments.
Then we can conveniently use these two arguments to do the filter.

Suggested-by: Brendan Gregg 
Signed-off-by: Yafang Shao 
---
 include/trace/events/sock.h | 24 ++--
 net/ipv4/af_inet.c  |  6 --
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 3537c5f..c7df70f 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -11,7 +11,11 @@
 #include 
 #include 

-/* The protocol traced by sock_set_state */
+#define family_names   \
+   EM(AF_INET) \
+   EMe(AF_INET6)
+
+/* The protocol traced by inet_sock_set_state */
 #define inet_protocol_names\
EM(IPPROTO_TCP) \
EM(IPPROTO_DCCP)\
@@ -37,6 +41,7 @@
 #define EM(a)   TRACE_DEFINE_ENUM(a);
 #define EMe(a)  TRACE_DEFINE_ENUM(a);

+family_names
 inet_protocol_names
 tcp_state_names

@@ -45,6 +50,9 @@
 #define EM(a)   { a, #a },
 #define EMe(a)  { a, #a }

+#define show_family_name(val)  \
+   __print_symbolic(val, family_names)
+
 #define show_inet_protocol_name(val)\
__print_symbolic(val, inet_protocol_names)

@@ -108,9 +116,10 @@

 TRACE_EVENT(inet_sock_set_state,

-   TP_PROTO(const struct sock *sk, const int oldstate, const int newstate),
+   TP_PROTO(const struct sock *sk, const int family, const int protocol,
+   const int oldstate, const int newstate),

-   TP_ARGS(sk, oldstate, newstate),
+   TP_ARGS(sk, family, protocol, oldstate, newstate),

TP_STRUCT__entry(
__field(const void *, skaddr)
@@ -118,6 +127,7 @@
__field(int, newstate)
__field(__u16, sport)
__field(__u16, dport)
+   __field(__u16, family)
__field(__u8, protocol)
__array(__u8, saddr, 4)
__array(__u8, daddr, 4)
@@ -133,8 +143,9 @@
__entry->skaddr = sk;
__entry->oldstate = oldstate;
__entry->newstate = newstate;
+   __entry->family = family;
+   __entry->protocol = protocol;

-   __entry->protocol = sk->sk_protocol;
__entry->sport = ntohs(inet->inet_sport);
__entry->dport = ntohs(inet->inet_dport);

@@ -145,7 +156,7 @@
*p32 =  inet->inet_daddr;

 #if IS_ENABLED(CONFIG_IPV6)
-   if (sk->sk_family == AF_INET6) {
+   if (family == AF_INET6) {
pin6 = (struct in6_addr *)__entry->saddr_v6;
*pin6 = sk->sk_v6_rcv_saddr;
pin6 = (struct in6_addr *)__entry->daddr_v6;
@@ -160,7 +171,8 @@
}
),

-   TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 
saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
+   TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
+   show_family_name(__entry->family),
show_inet_protocol_name(__entry->protocol),
__entry->sport, __entry->dport,
__entry->saddr, __entry->daddr,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index bab98a4..1d52796 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk)

 void inet_sk_set_state(struct sock *sk, int state)
 {
-   trace_inet_sock_set_state(sk, sk->sk_state, state);
+   trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
+   sk->sk_state, state);
sk->sk_state = state;
 }
 EXPORT_SYMBOL(inet_sk_set_state);

 void inet_sk_state_store(struct sock *sk, int newstate)
 {
-   trace_inet_sock_set_state(sk, sk->sk_state, newstate);
+   trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
+   sk->sk_state, newstate);
smp_store_release(>sk_state, newstate);
 }

--
1.8.3.1



[PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state

2018-01-04 Thread Yafang Shao
sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments.
Then we can conveniently use these two arguments to do the filter.

Suggested-by: Brendan Gregg 
Signed-off-by: Yafang Shao 
---
 include/trace/events/sock.h | 24 ++--
 net/ipv4/af_inet.c  |  6 --
 2 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 3537c5f..c7df70f 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -11,7 +11,11 @@
 #include 
 #include 

-/* The protocol traced by sock_set_state */
+#define family_names   \
+   EM(AF_INET) \
+   EMe(AF_INET6)
+
+/* The protocol traced by inet_sock_set_state */
 #define inet_protocol_names\
EM(IPPROTO_TCP) \
EM(IPPROTO_DCCP)\
@@ -37,6 +41,7 @@
 #define EM(a)   TRACE_DEFINE_ENUM(a);
 #define EMe(a)  TRACE_DEFINE_ENUM(a);

+family_names
 inet_protocol_names
 tcp_state_names

@@ -45,6 +50,9 @@
 #define EM(a)   { a, #a },
 #define EMe(a)  { a, #a }

+#define show_family_name(val)  \
+   __print_symbolic(val, family_names)
+
 #define show_inet_protocol_name(val)\
__print_symbolic(val, inet_protocol_names)

@@ -108,9 +116,10 @@

 TRACE_EVENT(inet_sock_set_state,

-   TP_PROTO(const struct sock *sk, const int oldstate, const int newstate),
+   TP_PROTO(const struct sock *sk, const int family, const int protocol,
+   const int oldstate, const int newstate),

-   TP_ARGS(sk, oldstate, newstate),
+   TP_ARGS(sk, family, protocol, oldstate, newstate),

TP_STRUCT__entry(
__field(const void *, skaddr)
@@ -118,6 +127,7 @@
__field(int, newstate)
__field(__u16, sport)
__field(__u16, dport)
+   __field(__u16, family)
__field(__u8, protocol)
__array(__u8, saddr, 4)
__array(__u8, daddr, 4)
@@ -133,8 +143,9 @@
__entry->skaddr = sk;
__entry->oldstate = oldstate;
__entry->newstate = newstate;
+   __entry->family = family;
+   __entry->protocol = protocol;

-   __entry->protocol = sk->sk_protocol;
__entry->sport = ntohs(inet->inet_sport);
__entry->dport = ntohs(inet->inet_dport);

@@ -145,7 +156,7 @@
*p32 =  inet->inet_daddr;

 #if IS_ENABLED(CONFIG_IPV6)
-   if (sk->sk_family == AF_INET6) {
+   if (family == AF_INET6) {
pin6 = (struct in6_addr *)__entry->saddr_v6;
*pin6 = sk->sk_v6_rcv_saddr;
pin6 = (struct in6_addr *)__entry->daddr_v6;
@@ -160,7 +171,8 @@
}
),

-   TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 
saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
+   TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 
daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
+   show_family_name(__entry->family),
show_inet_protocol_name(__entry->protocol),
__entry->sport, __entry->dport,
__entry->saddr, __entry->daddr,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index bab98a4..1d52796 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk)

 void inet_sk_set_state(struct sock *sk, int state)
 {
-   trace_inet_sock_set_state(sk, sk->sk_state, state);
+   trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
+   sk->sk_state, state);
sk->sk_state = state;
 }
 EXPORT_SYMBOL(inet_sk_set_state);

 void inet_sk_state_store(struct sock *sk, int newstate)
 {
-   trace_inet_sock_set_state(sk, sk->sk_state, newstate);
+   trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol,
+   sk->sk_state, newstate);
smp_store_release(>sk_state, newstate);
 }

--
1.8.3.1



Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2018-01-04 Thread Gang He
Hi Andrew,

Happy new year.
Could you help to pick up this patch, which is used to fix a old patch 
1cce4df04f37.
If we have not this patch, some multiple node test cases will trigger 
softlockup problems,
also make HA communication daemon (e.g. corosync) timeout and the node will has 
to be fenced.

Thanks
Gang  


>>> 

> 
> On 17/12/28 15:48, Gang He wrote:
>> If we can't get inode lock immediately in the function
>> ocfs2_inode_lock_with_page() when reading a page, we should not
>> return directly here, since this will lead to a softlockup problem
>> when the kernel is configured with CONFIG_PREEMPT is not set.
>> The method is to get a blocking lock and immediately unlock before
>> returning, this can avoid CPU resource waste due to lots of retries,
>> and benefits fairness in getting lock among multiple nodes, increase
>> efficiency in case modifying the same file frequently from multiple
>> nodes.
>> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
>> looks like,
>> Kernel panic - not syncing: softlockup: hung tasks
>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> Call Trace:
>>   
>>   dump_stack+0x5c/0x82
>>   panic+0xd5/0x21e
>>   watchdog_timer_fn+0x208/0x210
>>   ? watchdog_park_threads+0x70/0x70
>>   __hrtimer_run_queues+0xcc/0x200
>>   hrtimer_interrupt+0xa6/0x1f0
>>   smp_apic_timer_interrupt+0x34/0x50
>>   apic_timer_interrupt+0x96/0xa0
>>   
>>  RIP: 0010:unlock_page+0x17/0x30
>>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>>  RBP:  R08:  R09: af154080bb00
>>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>>  R13:  R14: f21e009f5300 R15: f21e009f5300
>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>   ? pagecache_get_page+0x30/0x200
>>   filemap_fault+0x12b/0x5c0
>>   ? recalc_sigpending+0x17/0x50
>>   ? __set_task_blocked+0x28/0x70
>>   ? __set_current_blocked+0x3d/0x60
>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>   __do_fault+0x1a/0xa0
>>   __handle_mm_fault+0xbe8/0x1090
>>   handle_mm_fault+0xaa/0x1f0
>>   __do_page_fault+0x235/0x4b0
>>   trace_do_page_fault+0x3c/0x110
>>   async_page_fault+0x28/0x30
>>  RIP: 0033:0x7fa75ded638e
>>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>>  RBP: 0003 R08: 000e R09: 
>>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>>  R13: 000e R14: 1770 R15: 
>> 
>> About performance improvement, we can see the testing time is reduced,
>> and CPU utilization decreases, the detailed data is as follows.
>> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
>> Before apply this patch,
>>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 
> corosync
>> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
>> 
>> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
>> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
>> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
>> Tests with "-b 4096 -C 32768"
>> Thu Dec 28 14:44:52 CST 2017
>> multi_mmap..Passed.
>> Runtime 783 seconds.
>> 
>> After apply this patch,
>>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
>> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>>   535 root  

Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

2018-01-04 Thread Gang He
Hi Andrew,

Happy new year.
Could you help to pick up this patch, which is used to fix a old patch 
1cce4df04f37.
If we have not this patch, some multiple node test cases will trigger 
softlockup problems,
also make HA communication daemon (e.g. corosync) timeout and the node will has 
to be fenced.

Thanks
Gang  


>>> 

> 
> On 17/12/28 15:48, Gang He wrote:
>> If we can't get inode lock immediately in the function
>> ocfs2_inode_lock_with_page() when reading a page, we should not
>> return directly here, since this will lead to a softlockup problem
>> when the kernel is configured with CONFIG_PREEMPT is not set.
>> The method is to get a blocking lock and immediately unlock before
>> returning, this can avoid CPU resource waste due to lots of retries,
>> and benefits fairness in getting lock among multiple nodes, increase
>> efficiency in case modifying the same file frequently from multiple
>> nodes.
>> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
>> looks like,
>> Kernel panic - not syncing: softlockup: hung tasks
>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>> Call Trace:
>>   
>>   dump_stack+0x5c/0x82
>>   panic+0xd5/0x21e
>>   watchdog_timer_fn+0x208/0x210
>>   ? watchdog_park_threads+0x70/0x70
>>   __hrtimer_run_queues+0xcc/0x200
>>   hrtimer_interrupt+0xa6/0x1f0
>>   smp_apic_timer_interrupt+0x34/0x50
>>   apic_timer_interrupt+0x96/0xa0
>>   
>>  RIP: 0010:unlock_page+0x17/0x30
>>  RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10
>>  RAX: dead0100 RBX: f21e009f5300 RCX: 0004
>>  RDX: dead00ff RSI: 0202 RDI: f21e009f5300
>>  RBP:  R08:  R09: af154080bb00
>>  R10: af154080bc30 R11: 0040 R12: 993749a39518
>>  R13:  R14: f21e009f5300 R15: f21e009f5300
>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>   ? pagecache_get_page+0x30/0x200
>>   filemap_fault+0x12b/0x5c0
>>   ? recalc_sigpending+0x17/0x50
>>   ? __set_task_blocked+0x28/0x70
>>   ? __set_current_blocked+0x3d/0x60
>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>   __do_fault+0x1a/0xa0
>>   __handle_mm_fault+0xbe8/0x1090
>>   handle_mm_fault+0xaa/0x1f0
>>   __do_page_fault+0x235/0x4b0
>>   trace_do_page_fault+0x3c/0x110
>>   async_page_fault+0x28/0x30
>>  RIP: 0033:0x7fa75ded638e
>>  RSP: 002b:7ffd6657db18 EFLAGS: 00010287
>>  RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700
>>  RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700
>>  RBP: 0003 R08: 000e R09: 
>>  R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770
>>  R13: 000e R14: 1770 R15: 
>> 
>> About performance improvement, we can see the testing time is reduced,
>> and CPU utilization decreases, the detailed data is as follows.
>> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster.
>> Before apply this patch,
>>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>>  2754 ocfs2te+  20   0  170248   6980   4856 D 80.73 0.341   0:18.71 
> multi_mmap
>>  1505 root  rt   0  36 123060  97224 S 2.658 6.015   0:01.44 
> corosync
>> 5 root  20   0   0  0  0 S 1.329 0.000   0:00.19 
> kworker/u8:0
>>95 root  20   0   0  0  0 S 1.329 0.000   0:00.25 
> kworker/u8:1
>>  2728 root  20   0   0  0  0 S 0.997 0.000   0:00.24 
> jbd2/sda1-33
>>  2721 root  20   0   0  0  0 S 0.664 0.000   0:00.07 
> ocfs2dc-3C8CFD4
>>  2750 ocfs2te+  20   0  142976   4652   3532 S 0.664 0.227   0:00.28 mpirun
>> 
>> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o 
>> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d 
>> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
>> Tests with "-b 4096 -C 32768"
>> Thu Dec 28 14:44:52 CST 2017
>> multi_mmap..Passed.
>> Runtime 783 seconds.
>> 
>> After apply this patch,
>>   PID USER  PR  NIVIRTRESSHR S  %CPU  %MEM TIME+ COMMAND
>>  2508 ocfs2te+  20   0  170248   6804   4680 R 54.00 0.333   0:55.37 
> multi_mmap
>>   155 root  20   0   0  0  0 S 2.667 0.000   0:01.20 
> kworker/u8:3
>>95 root  20   0   0  0  0 S 2.000 0.000   0:01.58 
> kworker/u8:1
>>  2504 ocfs2te+  20   0  142976   4604   3480 R 1.667 0.225   0:01.65 mpirun
>> 5 root  20   0   0  0  0 S 1.000 0.000   0:01.36 
> kworker/u8:0
>>  2482 root  20   0   0  0  0 S 1.000 0.000   0:00.86 
> jbd2/sda1-33
>>   299 root   0 -20   0  0  0 S 0.333 0.000   0:00.13 
> kworker/2:1H
>>   335 root   0 -20   0  0  0 S 0.333 0.000   0:00.17 
> kworker/1:1H
>>   535 root  

Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Dave Hansen
On 01/04/2018 10:16 PM, Yisheng Xie wrote:
> BTW, we have just reported a bug caused by kaiser[1], which looks like
> caused by SMEP. Could you please help to have a look?
> 
> [1] https://lkml.org/lkml/2018/1/5/3

Please report that to your kernel vendor.  Your EFI page tables have the
NX bit set on the low addresses.  There have been a bunch of iterations
of this, but you need to make sure that the EFI kernel mappings don't
get _PAGE_NX set on them.  Look at what __pti_set_user_pgd() does in
mainline.


Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Dave Hansen
On 01/04/2018 10:16 PM, Yisheng Xie wrote:
> BTW, we have just reported a bug caused by kaiser[1], which looks like
> caused by SMEP. Could you please help to have a look?
> 
> [1] https://lkml.org/lkml/2018/1/5/3

Please report that to your kernel vendor.  Your EFI page tables have the
NX bit set on the low addresses.  There have been a bunch of iterations
of this, but you need to make sure that the EFI kernel mappings don't
get _PAGE_NX set on them.  Look at what __pti_set_user_pgd() does in
mainline.


[PATCH] f2fs: implement cgroup writeback supprot

2018-01-04 Thread Yufen Yu
Cgroup writeback requires explicit support from the filesystem.
f2fs's data and node writeback IOs go through __write_data_page,
which sets fio for submiting IOs. So, we add io_wbc for fio,
associate bios with blkcg by invoking wbc_init_bio() and
account IOs issuing by wbc_account_io().
In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB.

Meta writeback IOs is left alone by this patch and will always be
attributed to the root cgroup.

The results show that f2fs can throttle writeback nicely for
data writing and file creating.

Signed-off-by: Yufen Yu 
---
 fs/f2fs/data.c  | 11 +--
 fs/f2fs/f2fs.h  |  1 +
 fs/f2fs/node.c  |  1 +
 fs/f2fs/super.c |  1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 516fa0d..402df03 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -169,6 +169,7 @@ static bool __same_bdev(struct f2fs_sb_info *sbi,
  * Low-level block read/write IO operations.
  */
 static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr,
+   struct writeback_control *wbc,
int npages, bool is_read)
 {
struct bio *bio;
@@ -178,6 +179,8 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, 
block_t blk_addr,
f2fs_target_device(sbi, blk_addr, bio);
bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io;
bio->bi_private = is_read ? NULL : sbi;
+   if (wbc)
+   wbc_init_bio(wbc, bio);
 
return bio;
 }
@@ -373,7 +376,8 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
f2fs_trace_ios(fio, 0);
 
/* Allocate a new bio */
-   bio = __bio_alloc(fio->sbi, fio->new_blkaddr, 1, is_read_io(fio->op));
+   bio = __bio_alloc(fio->sbi, fio->new_blkaddr, fio->io_wbc,
+   1, is_read_io(fio->op));
 
if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) {
bio_put(bio);
@@ -435,7 +439,7 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio)
dec_page_count(sbi, WB_DATA_TYPE(bio_page));
goto out_fail;
}
-   io->bio = __bio_alloc(sbi, fio->new_blkaddr,
+   io->bio = __bio_alloc(sbi, fio->new_blkaddr, fio->io_wbc,
BIO_MAX_PAGES, false);
io->fio = *fio;
}
@@ -443,6 +447,8 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio)
if (bio_add_page(io->bio, bio_page, PAGE_SIZE, 0) < PAGE_SIZE) {
__submit_merged_bio(io);
goto alloc_new;
+   } else if (fio->io_wbc) {
+   wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE);
}
 
io->last_block_in_bio = fio->new_blkaddr;
@@ -1508,6 +1514,7 @@ static int __write_data_page(struct page *page, bool 
*submitted,
.submitted = false,
.need_lock = LOCK_RETRY,
.io_type = io_type,
+   .io_wbc = wbc,
};
 
trace_f2fs_writepage(page, DATA);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6abf26c..4887dde 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -957,6 +957,7 @@ struct f2fs_io_info {
int need_lock;  /* indicate we need to lock cp_rwsem */
bool in_list;   /* indicate fio is in io_list */
enum iostat_type io_type;   /* io type */
+   struct writeback_control *io_wbc; /* writeback control */
 };
 
 #define is_read_io(rw) ((rw) == READ)
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index d332275..e4f8bb0 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1336,6 +1336,7 @@ static int __write_node_page(struct page *page, bool 
atomic, bool *submitted,
.encrypted_page = NULL,
.submitted = false,
.io_type = io_type,
+   .io_wbc = wbc,
};
 
trace_f2fs_writepage(page, NODE);
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 708155d..deeba98 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2475,6 +2475,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
(test_opt(sbi, POSIX_ACL) ? SB_POSIXACL : 0);
memcpy(>s_uuid, raw_super->uuid, sizeof(raw_super->uuid));
+   sb->s_iflags |= SB_I_CGROUPWB;
 
/* init f2fs-specific super block info */
sbi->valid_super_block = valid_super_block;
-- 
2.9.5



[PATCH] f2fs: implement cgroup writeback supprot

2018-01-04 Thread Yufen Yu
Cgroup writeback requires explicit support from the filesystem.
f2fs's data and node writeback IOs go through __write_data_page,
which sets fio for submiting IOs. So, we add io_wbc for fio,
associate bios with blkcg by invoking wbc_init_bio() and
account IOs issuing by wbc_account_io().
In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB.

Meta writeback IOs is left alone by this patch and will always be
attributed to the root cgroup.

The results show that f2fs can throttle writeback nicely for
data writing and file creating.

Signed-off-by: Yufen Yu 
---
 fs/f2fs/data.c  | 11 +--
 fs/f2fs/f2fs.h  |  1 +
 fs/f2fs/node.c  |  1 +
 fs/f2fs/super.c |  1 +
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 516fa0d..402df03 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -169,6 +169,7 @@ static bool __same_bdev(struct f2fs_sb_info *sbi,
  * Low-level block read/write IO operations.
  */
 static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr,
+   struct writeback_control *wbc,
int npages, bool is_read)
 {
struct bio *bio;
@@ -178,6 +179,8 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, 
block_t blk_addr,
f2fs_target_device(sbi, blk_addr, bio);
bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io;
bio->bi_private = is_read ? NULL : sbi;
+   if (wbc)
+   wbc_init_bio(wbc, bio);
 
return bio;
 }
@@ -373,7 +376,8 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio)
f2fs_trace_ios(fio, 0);
 
/* Allocate a new bio */
-   bio = __bio_alloc(fio->sbi, fio->new_blkaddr, 1, is_read_io(fio->op));
+   bio = __bio_alloc(fio->sbi, fio->new_blkaddr, fio->io_wbc,
+   1, is_read_io(fio->op));
 
if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) {
bio_put(bio);
@@ -435,7 +439,7 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio)
dec_page_count(sbi, WB_DATA_TYPE(bio_page));
goto out_fail;
}
-   io->bio = __bio_alloc(sbi, fio->new_blkaddr,
+   io->bio = __bio_alloc(sbi, fio->new_blkaddr, fio->io_wbc,
BIO_MAX_PAGES, false);
io->fio = *fio;
}
@@ -443,6 +447,8 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio)
if (bio_add_page(io->bio, bio_page, PAGE_SIZE, 0) < PAGE_SIZE) {
__submit_merged_bio(io);
goto alloc_new;
+   } else if (fio->io_wbc) {
+   wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE);
}
 
io->last_block_in_bio = fio->new_blkaddr;
@@ -1508,6 +1514,7 @@ static int __write_data_page(struct page *page, bool 
*submitted,
.submitted = false,
.need_lock = LOCK_RETRY,
.io_type = io_type,
+   .io_wbc = wbc,
};
 
trace_f2fs_writepage(page, DATA);
diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 6abf26c..4887dde 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -957,6 +957,7 @@ struct f2fs_io_info {
int need_lock;  /* indicate we need to lock cp_rwsem */
bool in_list;   /* indicate fio is in io_list */
enum iostat_type io_type;   /* io type */
+   struct writeback_control *io_wbc; /* writeback control */
 };
 
 #define is_read_io(rw) ((rw) == READ)
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index d332275..e4f8bb0 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1336,6 +1336,7 @@ static int __write_node_page(struct page *page, bool 
atomic, bool *submitted,
.encrypted_page = NULL,
.submitted = false,
.io_type = io_type,
+   .io_wbc = wbc,
};
 
trace_f2fs_writepage(page, NODE);
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 708155d..deeba98 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -2475,6 +2475,7 @@ static int f2fs_fill_super(struct super_block *sb, void 
*data, int silent)
sb->s_flags = (sb->s_flags & ~SB_POSIXACL) |
(test_opt(sbi, POSIX_ACL) ? SB_POSIXACL : 0);
memcpy(>s_uuid, raw_super->uuid, sizeof(raw_super->uuid));
+   sb->s_iflags |= SB_I_CGROUPWB;
 
/* init f2fs-specific super block info */
sbi->valid_super_block = valid_super_block;
-- 
2.9.5



Re: [linux-sunxi] Re: [PATCH 06/11] dt-bindings: display: sun4i-drm: Add A83T HDMI pipeline

2018-01-04 Thread Jernej Škrabec
Hi,

Dne petek, 05. januar 2018 ob 03:49:09 CET je Icenowy Zheng napisal(a):
> 于 2018年1月5日 GMT+08:00 上午2:52:10, Maxime Ripard  写到:
> >On Wed, Jan 03, 2018 at 10:32:26PM +0100, Jernej Škrabec wrote:
> >> Hi Rob,
> >> 
> >> Dne sreda, 03. januar 2018 ob 21:21:54 CET je Rob Herring napisal(a):
> >> > On Sat, Dec 30, 2017 at 10:01:58PM +0100, Jernej Skrabec wrote:
> >> > > This commit adds all necessary compatibles and descriptions
> >
> >needed to
> >
> >> > > implement A83T HDMI pipeline.
> >> > > 
> >> > > Mixer is already properly described, so only compatible is added.
> >> > > 
> >> > > However, A83T TCON1, which is connected to HDMI, doesn't have
> >
> >channel 0,
> >
> >> > > contrary to all TCONs currently described. Because of that, TCON
> >> > > documentation is extended.
> >> > > 
> >> > > A83T features Synopsys DW HDMI controller with a custom PHY which
> >
> >looks
> >
> >> > > like Synopsys Gen2 PHY with few additions. Since there is no
> >> > > documentation, needed properties were found out through
> >
> >experimentation
> >
> >> > > and reading BSP code.
> >> > > 
> >> > > At the end, example is added for newer SoCs, which features DE2
> >
> >and DW
> >
> >> > > HDMI.
> >> > > 
> >> > > Signed-off-by: Jernej Skrabec 
> >> > > ---
> >> > > 
> >> > >  .../bindings/display/sunxi/sun4i-drm.txt   | 188
> >> > >  - 1 file changed, 181 insertions(+), 7
> >
> >deletions(-)
> >
> >> > > diff --git
> >
> >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >index
> >
> >> > > 9f073af4c711..3eca258096a5 100644
> >> > > ---
> >
> >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > +++
> >
> >b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > @@ -64,6 +64,40 @@ Required properties:
> >> > >  first port should be the input endpoint. The second should
> >
> >be the
> >
> >> > >  output, usually to an HDMI connector.
> >> > > 
> >> > > +DWC HDMI TX Encoder
> >> > > +-
> >> > > +
> >> > > +The HDMI transmitter is a Synopsys DesignWare HDMI 1.4 TX
> >
> >controller IP
> >
> >> > > +with Allwinner's own PHY IP. It supports audio and video outputs
> >
> >and CEC.
> >
> >> > > +
> >> > > +These DT bindings follow the Synopsys DWC HDMI TX bindings
> >
> >defined in
> >
> >> > > +Documentation/devicetree/bindings/display/bridge/dw_hdmi.txt
> >
> >with the
> >
> >> > > +following device-specific properties.
> >> > > +
> >> > > +Required properties:
> >> > > +
> >> > > +  - compatible: value must be one of:
> >> > > +* "allwinner,sun8i-a83t-dw-hdmi"
> >> > > +  - reg: two pairs of base address and size of memory-mapped
> >
> >region,
> >
> >> > > first
> >> > > +for controller and second for PHY
> >> > > +registers.
> >> > 
> >> > Seems like the phy should be a separate node and use the phy
> >
> >binding.
> >
> >> > You can use the phy binding even if you don't use the kernel phy
> >> > framework...
> >> 
> >> Unfortunately, it's not so straighforward. Phy is actually accessed
> >
> >through
> >
> >> I2C implemented in HDMI controller. Second memory region in this case
> >
> >has
> >
> >> small influence on phy. However, it has big influence on controller.
> >
> >For
> >
> >> example, magic number has to be written in one register in second
> >
> >memory
> >
> >> region in order to unlock read access to any register from first
> >
> >memory region
> >
> >> (controller). However, they shouldn't be merged to one region,
> >
> >because first
> >
> >> memory region requires byte access while second memory region can be
> >
> >accessed
> >
> >> per byte or word.
> >> 
> >> To complicate things more, later I want to add support for another
> >
> >SoC which
> >
> >> has same glue layer (unlocking read access, etc.) and uses memory
> >
> >mapped phy
> >
> >> registers in second memory region.
> >> 
> >> I think current binding is the least complicated way to represent
> >
> >this.
> >
> >I agree with Rob here. I did a similar thing for the DSI patches I've
> >sent a few monthes ago and it turned out to not be that difficult, so
> >I'm sure you can come up with something :)
> 
> In A83T/H3/A64/H5/R40 this part is not purely a PHY.
> It controls the access of main controller's register (e.g. read/write
> lock and register obfuscation). So it should be called a "glue"
> with PHY part (and on A83T seems a pure glue) but not a simple
>  PHY.

It's not so simple. Actually it has PHY settings also on A83T. For example, 
value at 0x01EF0001 depends on polarity. Value at 0x01EF0002 sets PHY I2C 
address. Bit 7 at 0x01EF0007 enables/disables external resistor. That is info 
I discovered/received after I sent patches, so it's not cleary marked.

Proper memory map (starts at 0x01EE):
0x0 - 0x1 -> DW HDMI controller
0x1 - 0x10010 -> (almost?) 

Re: [linux-sunxi] Re: [PATCH 06/11] dt-bindings: display: sun4i-drm: Add A83T HDMI pipeline

2018-01-04 Thread Jernej Škrabec
Hi,

Dne petek, 05. januar 2018 ob 03:49:09 CET je Icenowy Zheng napisal(a):
> 于 2018年1月5日 GMT+08:00 上午2:52:10, Maxime Ripard  写到:
> >On Wed, Jan 03, 2018 at 10:32:26PM +0100, Jernej Škrabec wrote:
> >> Hi Rob,
> >> 
> >> Dne sreda, 03. januar 2018 ob 21:21:54 CET je Rob Herring napisal(a):
> >> > On Sat, Dec 30, 2017 at 10:01:58PM +0100, Jernej Skrabec wrote:
> >> > > This commit adds all necessary compatibles and descriptions
> >
> >needed to
> >
> >> > > implement A83T HDMI pipeline.
> >> > > 
> >> > > Mixer is already properly described, so only compatible is added.
> >> > > 
> >> > > However, A83T TCON1, which is connected to HDMI, doesn't have
> >
> >channel 0,
> >
> >> > > contrary to all TCONs currently described. Because of that, TCON
> >> > > documentation is extended.
> >> > > 
> >> > > A83T features Synopsys DW HDMI controller with a custom PHY which
> >
> >looks
> >
> >> > > like Synopsys Gen2 PHY with few additions. Since there is no
> >> > > documentation, needed properties were found out through
> >
> >experimentation
> >
> >> > > and reading BSP code.
> >> > > 
> >> > > At the end, example is added for newer SoCs, which features DE2
> >
> >and DW
> >
> >> > > HDMI.
> >> > > 
> >> > > Signed-off-by: Jernej Skrabec 
> >> > > ---
> >> > > 
> >> > >  .../bindings/display/sunxi/sun4i-drm.txt   | 188
> >> > >  - 1 file changed, 181 insertions(+), 7
> >
> >deletions(-)
> >
> >> > > diff --git
> >
> >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >index
> >
> >> > > 9f073af4c711..3eca258096a5 100644
> >> > > ---
> >
> >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > +++
> >
> >b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
> >
> >> > > @@ -64,6 +64,40 @@ Required properties:
> >> > >  first port should be the input endpoint. The second should
> >
> >be the
> >
> >> > >  output, usually to an HDMI connector.
> >> > > 
> >> > > +DWC HDMI TX Encoder
> >> > > +-
> >> > > +
> >> > > +The HDMI transmitter is a Synopsys DesignWare HDMI 1.4 TX
> >
> >controller IP
> >
> >> > > +with Allwinner's own PHY IP. It supports audio and video outputs
> >
> >and CEC.
> >
> >> > > +
> >> > > +These DT bindings follow the Synopsys DWC HDMI TX bindings
> >
> >defined in
> >
> >> > > +Documentation/devicetree/bindings/display/bridge/dw_hdmi.txt
> >
> >with the
> >
> >> > > +following device-specific properties.
> >> > > +
> >> > > +Required properties:
> >> > > +
> >> > > +  - compatible: value must be one of:
> >> > > +* "allwinner,sun8i-a83t-dw-hdmi"
> >> > > +  - reg: two pairs of base address and size of memory-mapped
> >
> >region,
> >
> >> > > first
> >> > > +for controller and second for PHY
> >> > > +registers.
> >> > 
> >> > Seems like the phy should be a separate node and use the phy
> >
> >binding.
> >
> >> > You can use the phy binding even if you don't use the kernel phy
> >> > framework...
> >> 
> >> Unfortunately, it's not so straighforward. Phy is actually accessed
> >
> >through
> >
> >> I2C implemented in HDMI controller. Second memory region in this case
> >
> >has
> >
> >> small influence on phy. However, it has big influence on controller.
> >
> >For
> >
> >> example, magic number has to be written in one register in second
> >
> >memory
> >
> >> region in order to unlock read access to any register from first
> >
> >memory region
> >
> >> (controller). However, they shouldn't be merged to one region,
> >
> >because first
> >
> >> memory region requires byte access while second memory region can be
> >
> >accessed
> >
> >> per byte or word.
> >> 
> >> To complicate things more, later I want to add support for another
> >
> >SoC which
> >
> >> has same glue layer (unlocking read access, etc.) and uses memory
> >
> >mapped phy
> >
> >> registers in second memory region.
> >> 
> >> I think current binding is the least complicated way to represent
> >
> >this.
> >
> >I agree with Rob here. I did a similar thing for the DSI patches I've
> >sent a few monthes ago and it turned out to not be that difficult, so
> >I'm sure you can come up with something :)
> 
> In A83T/H3/A64/H5/R40 this part is not purely a PHY.
> It controls the access of main controller's register (e.g. read/write
> lock and register obfuscation). So it should be called a "glue"
> with PHY part (and on A83T seems a pure glue) but not a simple
>  PHY.

It's not so simple. Actually it has PHY settings also on A83T. For example, 
value at 0x01EF0001 depends on polarity. Value at 0x01EF0002 sets PHY I2C 
address. Bit 7 at 0x01EF0007 enables/disables external resistor. That is info 
I discovered/received after I sent patches, so it's not cleary marked.

Proper memory map (starts at 0x01EE):
0x0 - 0x1 -> DW HDMI controller
0x1 - 0x10010 -> (almost?) Common PHY settings
0x10010 - 0x10020 -> Allwinner 

[PATCH] f2fs: add resgid and resuid to reserve root blocks

2018-01-04 Thread Jaegeuk Kim
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u.
It only activates with reserve_root=%u.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h  | 26 --
 fs/f2fs/super.c | 46 --
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 4d255aac49bb..e5554b851fd8 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -131,6 +131,12 @@ struct f2fs_mount_info {
 #define F2FS_CLEAR_FEATURE(sb, mask)   \
(F2FS_SB(sb)->raw_super->feature &= ~cpu_to_le32(mask))
 
+/*
+ * Default values for user and/or group using reserved blocks
+ */
+#defineF2FS_DEF_RESUID 0
+#defineF2FS_DEF_RESGID 0
+
 /*
  * For checkpoint manager
  */
@@ -,6 +1117,8 @@ struct f2fs_sb_info {
block_t reserved_blocks;/* configurable reserved blocks 
*/
block_t current_reserved_blocks;/* current reserved blocks */
block_t root_reserved_blocks;   /* root reserved blocks */
+   kuid_t s_resuid;/* reserved blocks for uid */
+   kgid_t s_resgid;/* reserved blocks for gid */
 
unsigned int nquota_files;  /* # of quota sysfile */
 
@@ -1563,6 +1571,20 @@ static inline bool f2fs_has_xattr_block(unsigned int ofs)
return ofs == XATTR_NODE_OFFSET;
 }
 
+static inline bool __allow_reserved_blocks(struct f2fs_sb_info *sbi)
+{
+   if (!test_opt(sbi, RESERVE_ROOT))
+   return false;
+   if (capable(CAP_SYS_RESOURCE))
+   return true;
+   if (uid_eq(sbi->s_resuid, current_fsuid()))
+   return true;
+   if (!gid_eq(sbi->s_resgid, GLOBAL_ROOT_GID) &&
+   in_group_p(sbi->s_resgid))
+   return true;
+   return false;
+}
+
 static inline void f2fs_i_blocks_write(struct inode *, block_t, bool, bool);
 static inline int inc_valid_block_count(struct f2fs_sb_info *sbi,
 struct inode *inode, blkcnt_t *count)
@@ -1593,7 +1615,7 @@ static inline int inc_valid_block_count(struct 
f2fs_sb_info *sbi,
avail_user_block_count = sbi->user_block_count -
sbi->current_reserved_blocks;
 
-   if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE)))
+   if (!__allow_reserved_blocks(sbi))
avail_user_block_count -= sbi->root_reserved_blocks;
 
if (unlikely(sbi->total_valid_block_count > avail_user_block_count)) {
@@ -1794,7 +1816,7 @@ static inline int inc_valid_node_count(struct 
f2fs_sb_info *sbi,
valid_block_count = sbi->total_valid_block_count +
sbi->current_reserved_blocks + 1;
 
-   if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE)))
+   if (!__allow_reserved_blocks(sbi))
valid_block_count += sbi->root_reserved_blocks;
 
if (unlikely(valid_block_count > sbi->user_block_count)) {
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 4904d1644052..ef40bc3d91e8 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -108,6 +108,8 @@ enum {
Opt_noinline_data,
Opt_data_flush,
Opt_reserve_root,
+   Opt_resgid,
+   Opt_resuid,
Opt_mode,
Opt_io_size_bits,
Opt_fault_injection,
@@ -159,6 +161,8 @@ static match_table_t f2fs_tokens = {
{Opt_noinline_data, "noinline_data"},
{Opt_data_flush, "data_flush"},
{Opt_reserve_root, "reserve_root=%u"},
+   {Opt_resgid, "resgid=%u"},
+   {Opt_resuid, "resuid=%u"},
{Opt_mode, "mode=%s"},
{Opt_io_size_bits, "io_bits=%u"},
{Opt_fault_injection, "fault_injection=%u"},
@@ -204,6 +208,15 @@ static inline void limit_reserve_root(struct f2fs_sb_info 
*sbi)
"Reduce reserved blocks for root = %u",
sbi->root_reserved_blocks);
}
+   if (!test_opt(sbi, RESERVE_ROOT) &&
+   (!uid_eq(sbi->s_resuid,
+   make_kuid(_user_ns, F2FS_DEF_RESUID)) ||
+   !gid_eq(sbi->s_resgid,
+   make_kgid(_user_ns, F2FS_DEF_RESGID
+   f2fs_msg(sbi->sb, KERN_INFO,
+   "Ignore s_resuid=%u, s_resgid=%u w/o reserve_root",
+   from_kuid_munged(_user_ns, sbi->s_resuid),
+   from_kgid_munged(_user_ns, sbi->s_resgid));
 }
 
 static void init_once(void *foo)
@@ -336,6 +349,8 @@ static int parse_options(struct super_block *sb, char 
*options)
substring_t args[MAX_OPT_ARGS];
char *p, *name;
int arg = 0;
+   kuid_t uid;
+   kgid_t gid;
 #ifdef CONFIG_QUOTA
int ret;
 #endif
@@ -515,6 +530,28 @@ static int parse_options(struct super_block *sb, char 

[PATCH] f2fs: add resgid and resuid to reserve root blocks

2018-01-04 Thread Jaegeuk Kim
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u.
It only activates with reserve_root=%u.

Signed-off-by: Jaegeuk Kim 
---
 fs/f2fs/f2fs.h  | 26 --
 fs/f2fs/super.c | 46 --
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index 4d255aac49bb..e5554b851fd8 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -131,6 +131,12 @@ struct f2fs_mount_info {
 #define F2FS_CLEAR_FEATURE(sb, mask)   \
(F2FS_SB(sb)->raw_super->feature &= ~cpu_to_le32(mask))
 
+/*
+ * Default values for user and/or group using reserved blocks
+ */
+#defineF2FS_DEF_RESUID 0
+#defineF2FS_DEF_RESGID 0
+
 /*
  * For checkpoint manager
  */
@@ -,6 +1117,8 @@ struct f2fs_sb_info {
block_t reserved_blocks;/* configurable reserved blocks 
*/
block_t current_reserved_blocks;/* current reserved blocks */
block_t root_reserved_blocks;   /* root reserved blocks */
+   kuid_t s_resuid;/* reserved blocks for uid */
+   kgid_t s_resgid;/* reserved blocks for gid */
 
unsigned int nquota_files;  /* # of quota sysfile */
 
@@ -1563,6 +1571,20 @@ static inline bool f2fs_has_xattr_block(unsigned int ofs)
return ofs == XATTR_NODE_OFFSET;
 }
 
+static inline bool __allow_reserved_blocks(struct f2fs_sb_info *sbi)
+{
+   if (!test_opt(sbi, RESERVE_ROOT))
+   return false;
+   if (capable(CAP_SYS_RESOURCE))
+   return true;
+   if (uid_eq(sbi->s_resuid, current_fsuid()))
+   return true;
+   if (!gid_eq(sbi->s_resgid, GLOBAL_ROOT_GID) &&
+   in_group_p(sbi->s_resgid))
+   return true;
+   return false;
+}
+
 static inline void f2fs_i_blocks_write(struct inode *, block_t, bool, bool);
 static inline int inc_valid_block_count(struct f2fs_sb_info *sbi,
 struct inode *inode, blkcnt_t *count)
@@ -1593,7 +1615,7 @@ static inline int inc_valid_block_count(struct 
f2fs_sb_info *sbi,
avail_user_block_count = sbi->user_block_count -
sbi->current_reserved_blocks;
 
-   if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE)))
+   if (!__allow_reserved_blocks(sbi))
avail_user_block_count -= sbi->root_reserved_blocks;
 
if (unlikely(sbi->total_valid_block_count > avail_user_block_count)) {
@@ -1794,7 +1816,7 @@ static inline int inc_valid_node_count(struct 
f2fs_sb_info *sbi,
valid_block_count = sbi->total_valid_block_count +
sbi->current_reserved_blocks + 1;
 
-   if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE)))
+   if (!__allow_reserved_blocks(sbi))
valid_block_count += sbi->root_reserved_blocks;
 
if (unlikely(valid_block_count > sbi->user_block_count)) {
diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c
index 4904d1644052..ef40bc3d91e8 100644
--- a/fs/f2fs/super.c
+++ b/fs/f2fs/super.c
@@ -108,6 +108,8 @@ enum {
Opt_noinline_data,
Opt_data_flush,
Opt_reserve_root,
+   Opt_resgid,
+   Opt_resuid,
Opt_mode,
Opt_io_size_bits,
Opt_fault_injection,
@@ -159,6 +161,8 @@ static match_table_t f2fs_tokens = {
{Opt_noinline_data, "noinline_data"},
{Opt_data_flush, "data_flush"},
{Opt_reserve_root, "reserve_root=%u"},
+   {Opt_resgid, "resgid=%u"},
+   {Opt_resuid, "resuid=%u"},
{Opt_mode, "mode=%s"},
{Opt_io_size_bits, "io_bits=%u"},
{Opt_fault_injection, "fault_injection=%u"},
@@ -204,6 +208,15 @@ static inline void limit_reserve_root(struct f2fs_sb_info 
*sbi)
"Reduce reserved blocks for root = %u",
sbi->root_reserved_blocks);
}
+   if (!test_opt(sbi, RESERVE_ROOT) &&
+   (!uid_eq(sbi->s_resuid,
+   make_kuid(_user_ns, F2FS_DEF_RESUID)) ||
+   !gid_eq(sbi->s_resgid,
+   make_kgid(_user_ns, F2FS_DEF_RESGID
+   f2fs_msg(sbi->sb, KERN_INFO,
+   "Ignore s_resuid=%u, s_resgid=%u w/o reserve_root",
+   from_kuid_munged(_user_ns, sbi->s_resuid),
+   from_kgid_munged(_user_ns, sbi->s_resgid));
 }
 
 static void init_once(void *foo)
@@ -336,6 +349,8 @@ static int parse_options(struct super_block *sb, char 
*options)
substring_t args[MAX_OPT_ARGS];
char *p, *name;
int arg = 0;
+   kuid_t uid;
+   kgid_t gid;
 #ifdef CONFIG_QUOTA
int ret;
 #endif
@@ -515,6 +530,28 @@ static int parse_options(struct super_block *sb, char 
*options)
   

Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Yisheng Xie
Hi Dave,

On 2018/1/5 13:18, Dave Hansen wrote:
> On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>>> === Page Table Poisoning ===
>>>
>>> KAISER has two copies of the page tables: one for the kernel and
>>> one for when running in userspace.  
>>
>> So, we have 2 page table, thinking about this case:
>> If _ONE_ process includes _TWO_ threads, one run in user space, the other
>> run in kernel, they can run in one core with Hyper-Threading, right?
> 
> Yes.
> 
>> So both userspace and kernel space is valid, right? And for one core
>> with Hyper-Threading, they may share TLB, so the timing problem
>> described in the paper may still exist?
> 
> No.  The TLB is managed per logical CPU (hyperthread), as is the CR3
> register that points to the page tables.  Two threads running the same
> process might use the same CR3 _value_, but that does not mean they
> share TLB entries.

Get it, and thanks for your explain.

BTW, we have just reported a bug caused by kaiser[1], which looks like
caused by SMEP. Could you please help to have a look?

[1] https://lkml.org/lkml/2018/1/5/3

Thanks
Yisheng

> 
> One thread *can* be in the kernel with the kernel page tables while the
> other is in userspace with the user page tables active.  They will even
> use a different PCID/ASID for the same page tables normally.
> 
>> Can this case still be protected by KAISER?
> 
> Yes.
> 
> .
> 



Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Yisheng Xie
Hi Dave,

On 2018/1/5 13:18, Dave Hansen wrote:
> On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>>> === Page Table Poisoning ===
>>>
>>> KAISER has two copies of the page tables: one for the kernel and
>>> one for when running in userspace.  
>>
>> So, we have 2 page table, thinking about this case:
>> If _ONE_ process includes _TWO_ threads, one run in user space, the other
>> run in kernel, they can run in one core with Hyper-Threading, right?
> 
> Yes.
> 
>> So both userspace and kernel space is valid, right? And for one core
>> with Hyper-Threading, they may share TLB, so the timing problem
>> described in the paper may still exist?
> 
> No.  The TLB is managed per logical CPU (hyperthread), as is the CR3
> register that points to the page tables.  Two threads running the same
> process might use the same CR3 _value_, but that does not mean they
> share TLB entries.

Get it, and thanks for your explain.

BTW, we have just reported a bug caused by kaiser[1], which looks like
caused by SMEP. Could you please help to have a look?

[1] https://lkml.org/lkml/2018/1/5/3

Thanks
Yisheng

> 
> One thread *can* be in the kernel with the kernel page tables while the
> other is in userspace with the user page tables active.  They will even
> use a different PCID/ASID for the same page tables normally.
> 
>> Can this case still be protected by KAISER?
> 
> Yes.
> 
> .
> 



[PATCH v2] mm/fadvise: discard partial page if endbyte is also EOF

2018-01-04 Thread 夷则(Caspar)
From: shidao.ytt 

During our recent testing with fadvise(FADV_DONTNEED), we find that if
given offset/length is not page-aligned, the last page will not be
discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we
map a 10KB-sized file into memory and then try to run this tool to evict
the whole file mapping, but the last single page always remains staying
in the memory:

$./vmtouch -e test_10K
   Files: 1
 Directories: 0
   Evicted Pages: 3 (12K)
 Elapsed: 2.1e-05 seconds

$./vmtouch test_10K
   Files: 1
 Directories: 0
  Resident Pages: 1/3  4K/12K  33.3%
 Elapsed: 5.5e-05 seconds

However when we test with an older kernel, say 3.10, this problem is
gone. So we wonder if this is a regression:

$./vmtouch -e test_10K
   Files: 1
 Directories: 0
   Evicted Pages: 3 (12K)
 Elapsed: 8.2e-05 seconds

$./vmtouch test_10K
   Files: 1
 Directories: 0
  Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
 Elapsed: 5e-05 seconds

After digging a little bit into this problem, we find it seems not a
regression. Not discarding partial page is likely to be on purpose
according to commit 441c228f817f7 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") written by
Mel Gorman. He explained why partial pages should be preserved instead
of being discarded when using fadvise(FADV_DONTNEED). However, the
interesting part is that the actual code did NOT work as the same as it
was described, the partial page was still discarded anyway, due to a
calculation mistake of `end_index' passed to invalidate_mapping_pages().
This mistake has not been fixed until recently, that's why we fail to
reproduce our problem in old kernels. The fix is done in commit
18aba41cbf ("mm/fadvise.c: do not discard partial pages with
POSIX_FADV_DONTNEED") by Oleg Drokin.

Back to the original testing, our problem becomes that there is a
speical case that, if the page-unaligned `endbyte' is also the end
of file, it is not necessary at all to preserve the last partial page,
as we all know no one else will use the rest of it. It should be safe
enough if we just discard the whole page. So we add an EOF check in this
patch.

We also find a poosbile real world issue in mainline kernel. Assume such
scenario: A userspace backup application want to backup a huge amount of
small files (<4k) at once, the developer might (I guess) want to use
fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't
really happen since the only page mapped is a partial page, and kernel
will preserve it. Our patch also fixes this problem, since we know the
endbyte is EOF, so we discard it.

Here is a simple reproducer to reproduce and verify each scenario we
described above:

  test_fadvise.c
  ==
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  int main(int argc, char **argv)
  {
int i, fd, ret, len;
struct stat buf;
void *addr;
unsigned char *vec;
char *strbuf;
ssize_t pagesize = getpagesize();
ssize_t filesize;

fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0)
return -1;
filesize = strtoul(argv[2], NULL, 10);

strbuf = malloc(filesize);
memset(strbuf, 42, filesize);
write(fd, strbuf, filesize);
free(strbuf);
fsync(fd);

len = (filesize + pagesize - 1) / pagesize;
printf("length of pages: %d\n", len);

addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
return -1;

ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
if (ret < 0)
return -1;

vec = malloc(len);
ret = mincore(addr, filesize, (void *)vec);
if (ret < 0)
return -1;

for (i = 0; i < len; i++)
printf("pages[%d]: %x\n", i, vec[i] & 0x1);

free(vec);
close(fd);

return 0;
  }
  ==

Test 1: running on kernel with commit 18aba41cbf reverted:

[root@caspar ~]# uname -r
4.15.0-rc6.revert+
[root@caspar ~]# ./test_fadvise file1 1024
length of pages: 1
pages[0]: 0# <-- partial page discarded
[root@caspar ~]# ./test_fadvise file2 8192
length of pages: 2
pages[0]: 0
pages[1]: 0
[root@caspar ~]# ./test_fadvise file3 10240
length of pages: 3
pages[0]: 0
pages[1]: 0
pages[2]: 0# <-- partial page discarded

Test 2: running on mainline kernel:

[root@caspar ~]# uname -r
4.15.0-rc6+
[root@caspar ~]# ./test_fadvise test1 1024
length of pages: 1
pages[0]: 1# <-- partial and the only page not discarded
[root@caspar ~]# ./test_fadvise test2 8192
length of pages: 2
pages[0]: 0
pages[1]: 0
[root@caspar ~]# ./test_fadvise test3 10240
length of pages: 3
pages[0]: 0
pages[1]: 0
pages[2]: 1# <-- partial page not 

[PATCH v2] mm/fadvise: discard partial page if endbyte is also EOF

2018-01-04 Thread 夷则(Caspar)
From: shidao.ytt 

During our recent testing with fadvise(FADV_DONTNEED), we find that if
given offset/length is not page-aligned, the last page will not be
discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we
map a 10KB-sized file into memory and then try to run this tool to evict
the whole file mapping, but the last single page always remains staying
in the memory:

$./vmtouch -e test_10K
   Files: 1
 Directories: 0
   Evicted Pages: 3 (12K)
 Elapsed: 2.1e-05 seconds

$./vmtouch test_10K
   Files: 1
 Directories: 0
  Resident Pages: 1/3  4K/12K  33.3%
 Elapsed: 5.5e-05 seconds

However when we test with an older kernel, say 3.10, this problem is
gone. So we wonder if this is a regression:

$./vmtouch -e test_10K
   Files: 1
 Directories: 0
   Evicted Pages: 3 (12K)
 Elapsed: 8.2e-05 seconds

$./vmtouch test_10K
   Files: 1
 Directories: 0
  Resident Pages: 0/3  0/12K  0%  <-- partial page also discarded
 Elapsed: 5e-05 seconds

After digging a little bit into this problem, we find it seems not a
regression. Not discarding partial page is likely to be on purpose
according to commit 441c228f817f7 ("mm: fadvise: document the
fadvise(FADV_DONTNEED) behaviour for partial pages") written by
Mel Gorman. He explained why partial pages should be preserved instead
of being discarded when using fadvise(FADV_DONTNEED). However, the
interesting part is that the actual code did NOT work as the same as it
was described, the partial page was still discarded anyway, due to a
calculation mistake of `end_index' passed to invalidate_mapping_pages().
This mistake has not been fixed until recently, that's why we fail to
reproduce our problem in old kernels. The fix is done in commit
18aba41cbf ("mm/fadvise.c: do not discard partial pages with
POSIX_FADV_DONTNEED") by Oleg Drokin.

Back to the original testing, our problem becomes that there is a
speical case that, if the page-unaligned `endbyte' is also the end
of file, it is not necessary at all to preserve the last partial page,
as we all know no one else will use the rest of it. It should be safe
enough if we just discard the whole page. So we add an EOF check in this
patch.

We also find a poosbile real world issue in mainline kernel. Assume such
scenario: A userspace backup application want to backup a huge amount of
small files (<4k) at once, the developer might (I guess) want to use
fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't
really happen since the only page mapped is a partial page, and kernel
will preserve it. Our patch also fixes this problem, since we know the
endbyte is EOF, so we discard it.

Here is a simple reproducer to reproduce and verify each scenario we
described above:

  test_fadvise.c
  ==
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 
  #include 

  int main(int argc, char **argv)
  {
int i, fd, ret, len;
struct stat buf;
void *addr;
unsigned char *vec;
char *strbuf;
ssize_t pagesize = getpagesize();
ssize_t filesize;

fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR);
if (fd < 0)
return -1;
filesize = strtoul(argv[2], NULL, 10);

strbuf = malloc(filesize);
memset(strbuf, 42, filesize);
write(fd, strbuf, filesize);
free(strbuf);
fsync(fd);

len = (filesize + pagesize - 1) / pagesize;
printf("length of pages: %d\n", len);

addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
return -1;

ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED);
if (ret < 0)
return -1;

vec = malloc(len);
ret = mincore(addr, filesize, (void *)vec);
if (ret < 0)
return -1;

for (i = 0; i < len; i++)
printf("pages[%d]: %x\n", i, vec[i] & 0x1);

free(vec);
close(fd);

return 0;
  }
  ==

Test 1: running on kernel with commit 18aba41cbf reverted:

[root@caspar ~]# uname -r
4.15.0-rc6.revert+
[root@caspar ~]# ./test_fadvise file1 1024
length of pages: 1
pages[0]: 0# <-- partial page discarded
[root@caspar ~]# ./test_fadvise file2 8192
length of pages: 2
pages[0]: 0
pages[1]: 0
[root@caspar ~]# ./test_fadvise file3 10240
length of pages: 3
pages[0]: 0
pages[1]: 0
pages[2]: 0# <-- partial page discarded

Test 2: running on mainline kernel:

[root@caspar ~]# uname -r
4.15.0-rc6+
[root@caspar ~]# ./test_fadvise test1 1024
length of pages: 1
pages[0]: 1# <-- partial and the only page not discarded
[root@caspar ~]# ./test_fadvise test2 8192
length of pages: 2
pages[0]: 0
pages[1]: 0
[root@caspar ~]# ./test_fadvise test3 10240
length of pages: 3
pages[0]: 0
pages[1]: 0
pages[2]: 1# <-- partial page not discarded

Test 3: running on 

Re: [PATCH 0/7] IBRS patch series

2018-01-04 Thread Florian Weimer
* Linus Torvalds:

> On Thu, Jan 4, 2018 at 9:56 AM, Tim Chen  wrote:
>>
>> Speculation on Skylake and later requires these patches ("dynamic IBRS")
>> be used instead of retpoline[1].
>
> Can somebody explain this part?
>
> I was assuming that retpoline would work around this issue on all uarchs.
>
> This seems to say "retpoline does nothing on Skylake+"

Retpoline also looks incompatible with CET, so future Intel CPUs will
eventually need a different approach anyway.


Re: [PATCH 0/7] IBRS patch series

2018-01-04 Thread Florian Weimer
* Linus Torvalds:

> On Thu, Jan 4, 2018 at 9:56 AM, Tim Chen  wrote:
>>
>> Speculation on Skylake and later requires these patches ("dynamic IBRS")
>> be used instead of retpoline[1].
>
> Can somebody explain this part?
>
> I was assuming that retpoline would work around this issue on all uarchs.
>
> This seems to say "retpoline does nothing on Skylake+"

Retpoline also looks incompatible with CET, so future Intel CPUs will
eventually need a different approach anyway.


Re: [PATCH] driver: input :touchscreen :Modify Raydium Firmware update input file

2018-01-04 Thread Dmitry Torokhov
Hi Jeffrey,

On Thu, Dec 21, 2017 at 09:51:22PM +0800, jeffrey.lin wrote:
> Modify update firmware to accept alternative file name
> 
> Signed-off-by: jeffrey.lin 
> ---
>  drivers/input/touchscreen/raydium_i2c_ts.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c 
> b/drivers/input/touchscreen/raydium_i2c_ts.c
> index a99fb5cac5a0..439d43c3519c 100644
> --- a/drivers/input/touchscreen/raydium_i2c_ts.c
> +++ b/drivers/input/touchscreen/raydium_i2c_ts.c
> @@ -130,6 +130,7 @@ struct raydium_data {
>   struct gpio_desc *reset_gpio;
>  
>   struct raydium_info info;
> + char fw_file[64];

You do not really need to keep the firmware name in driver data, just
use a temporary in raydium_i2c_fw_update().

>  
>   struct mutex sysfs_mutex;
>  
> @@ -752,12 +753,16 @@ static int raydium_i2c_fw_update(struct raydium_data 
> *ts)
>  {
>   struct i2c_client *client = ts->client;
>   const struct firmware *fw = NULL;
> - const char *fw_file = "raydium.fw";
>   int error;
>  
> - error = request_firmware(, fw_file, >dev);
> + /* Firmware name */
> + snprintf(ts->fw_file, sizeof(ts->fw_file),
> + "raydium_%x.fw", ts->info.hw_ver);

hw_ver is LE32, you need to convert it to CPU endianness before using.
Also it would be better if we used the same encoding for the hardware
version as the one that we use when we output it in sysfs. It makes
userspace life a bit easier I think.

How about the version of the patch below?

Thanks.

-- 
Dmitry


Input: raydium_i2c_ts - include hardware version in firmware name

From: Jeffrey Lin 

Add hardware version to the firmware file name to handle scenarios where
single system image supports variety of devices.

Signed-off-by: Jeffrey Lin 
Signed-off-by: Dmitry Torokhov 
---
 drivers/input/touchscreen/raydium_i2c_ts.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c 
b/drivers/input/touchscreen/raydium_i2c_ts.c
index 100538d64fff..d1c09e6a2cb6 100644
--- a/drivers/input/touchscreen/raydium_i2c_ts.c
+++ b/drivers/input/touchscreen/raydium_i2c_ts.c
@@ -752,13 +752,20 @@ static int raydium_i2c_fw_update(struct raydium_data *ts)
 {
struct i2c_client *client = ts->client;
const struct firmware *fw = NULL;
-   const char *fw_file = "raydium.fw";
+   char *fw_file;
int error;
 
+   fw_file = kasprintf(GFP_KERNEL, "raydium_%#04x.fw",
+   le32_to_cpu(ts->info.hw_ver));
+   if (!fw_file)
+   return -ENOMEM;
+
+   dev_dbg(>dev, "firmware name: %s\n", fw_file);
+
error = request_firmware(, fw_file, >dev);
if (error) {
dev_err(>dev, "Unable to open firmware %s\n", fw_file);
-   return error;
+   goto out_free_fw_file;
}
 
disable_irq(client->irq);
@@ -787,6 +794,9 @@ static int raydium_i2c_fw_update(struct raydium_data *ts)
 
release_firmware(fw);
 
+out_free_fw_file:
+   kfree(fw_file);
+
return error;
 }
 


Re: [PATCH] driver: input :touchscreen :Modify Raydium Firmware update input file

2018-01-04 Thread Dmitry Torokhov
Hi Jeffrey,

On Thu, Dec 21, 2017 at 09:51:22PM +0800, jeffrey.lin wrote:
> Modify update firmware to accept alternative file name
> 
> Signed-off-by: jeffrey.lin 
> ---
>  drivers/input/touchscreen/raydium_i2c_ts.c | 11 ---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c 
> b/drivers/input/touchscreen/raydium_i2c_ts.c
> index a99fb5cac5a0..439d43c3519c 100644
> --- a/drivers/input/touchscreen/raydium_i2c_ts.c
> +++ b/drivers/input/touchscreen/raydium_i2c_ts.c
> @@ -130,6 +130,7 @@ struct raydium_data {
>   struct gpio_desc *reset_gpio;
>  
>   struct raydium_info info;
> + char fw_file[64];

You do not really need to keep the firmware name in driver data, just
use a temporary in raydium_i2c_fw_update().

>  
>   struct mutex sysfs_mutex;
>  
> @@ -752,12 +753,16 @@ static int raydium_i2c_fw_update(struct raydium_data 
> *ts)
>  {
>   struct i2c_client *client = ts->client;
>   const struct firmware *fw = NULL;
> - const char *fw_file = "raydium.fw";
>   int error;
>  
> - error = request_firmware(, fw_file, >dev);
> + /* Firmware name */
> + snprintf(ts->fw_file, sizeof(ts->fw_file),
> + "raydium_%x.fw", ts->info.hw_ver);

hw_ver is LE32, you need to convert it to CPU endianness before using.
Also it would be better if we used the same encoding for the hardware
version as the one that we use when we output it in sysfs. It makes
userspace life a bit easier I think.

How about the version of the patch below?

Thanks.

-- 
Dmitry


Input: raydium_i2c_ts - include hardware version in firmware name

From: Jeffrey Lin 

Add hardware version to the firmware file name to handle scenarios where
single system image supports variety of devices.

Signed-off-by: Jeffrey Lin 
Signed-off-by: Dmitry Torokhov 
---
 drivers/input/touchscreen/raydium_i2c_ts.c |   14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c 
b/drivers/input/touchscreen/raydium_i2c_ts.c
index 100538d64fff..d1c09e6a2cb6 100644
--- a/drivers/input/touchscreen/raydium_i2c_ts.c
+++ b/drivers/input/touchscreen/raydium_i2c_ts.c
@@ -752,13 +752,20 @@ static int raydium_i2c_fw_update(struct raydium_data *ts)
 {
struct i2c_client *client = ts->client;
const struct firmware *fw = NULL;
-   const char *fw_file = "raydium.fw";
+   char *fw_file;
int error;
 
+   fw_file = kasprintf(GFP_KERNEL, "raydium_%#04x.fw",
+   le32_to_cpu(ts->info.hw_ver));
+   if (!fw_file)
+   return -ENOMEM;
+
+   dev_dbg(>dev, "firmware name: %s\n", fw_file);
+
error = request_firmware(, fw_file, >dev);
if (error) {
dev_err(>dev, "Unable to open firmware %s\n", fw_file);
-   return error;
+   goto out_free_fw_file;
}
 
disable_irq(client->irq);
@@ -787,6 +794,9 @@ static int raydium_i2c_fw_update(struct raydium_data *ts)
 
release_firmware(fw);
 
+out_free_fw_file:
+   kfree(fw_file);
+
return error;
 }
 


linux-next: Tree for Jan 5

2018-01-04 Thread Stephen Rothwell
Hi all,

Changes since 20180104:

The drm tree gained a conflict against the drm-intel-fixes tree.

The akpm-current tree gained a build failure for which I applied a patch.

Non-merge commits (relative to Linus' tree): 6981
 7369 files changed, 288333 insertions(+), 202735 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 255 trees (counting Linus' and 43 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e1915c8195b3 Merge tag 'armsoc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc)
Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi)
Merging kbuild-current/fixes (cfe17c9bbe6a kbuild: move cc-option and 
cc-disable-warning after incl. arch Makefile)
Merging arc-current/for-curr (af1be2e21203 ARC: handle gcc generated 
__builtin_trap for older compiler)
Merging arm-current/fixes (36b0cb84ee85 ARM: 8731/1: Fix 
csum_partial_copy_from_user() stack mismatch)
Merging m68k-current/for-linus (5e387199c17c m68k/defconfig: Update defconfigs 
for v4.14-rc7)
Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups)
Merging powerpc-fixes/fixes (ecb101aed861 powerpc/mm: Fix SEGV on mapped region 
to return SEGV_ACCERR)
Merging sparc/master (59585b4be9ae sparc64: repair calling incorrect hweight 
function from stubs)
Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and 
linking special files)
Merging net/master (6926e041a892 uapi/if_ether.h: prevent redefinition of 
struct ethhdr)
Merging bpf/master (820d1d5eba5e Merge branch '40GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue)
Merging ipsec/master (2f10a61cee8f xfrm: fix rcu usage in xfrm_get_type_offload)
Merging netfilter/master (8bea728dce89 netfilter: nf_tables: fix potential 
NULL-ptr deref in nf_tables_dump_obj_done())
Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook 
mask only if set)
Merging wireless-drivers/master (a41886f56b7b Merge tag 
'iwlwifi-for-kalle-2017-12-05' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes)
Merging mac80211/master (736a80bbfda7 mac80211: mesh: drop frames appearing to 
be from us)
Merging sound-current/for-linus (db6f09448550 ALSA: pcm: Workaround for weird 
PulseAudio behavior on rewind error)
Merging pci-current/for-linus (1291a0d5049d Linux 4.15-rc4)
Merging driver-core.current/driver-core-linus (30a7acd57389 Linux 4.15-rc6)
Merging tty.current/tty-linus (30a7acd57389 Linux 4.15-rc6)
Merging usb.current/usb-linus (5fd77a3a0e40 usbip: vudc_tx: fix 
v_send_ret_submit() vulnerability to null xfer buffer)
Merging usb-gadget-fixes/fixes (1291a0d5049d Linux 4.15-rc4)
Merging usb-serial-fixes/usb-linus (d14ac576d10f USB: serial: cp210x: add new 
device ID ELV ALC 8xxx)
Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: 
fix ulpi-node lookup)
Merging phy/fixes (2b88212c4cc6 phy: rcar-gen3-usb2: select USB_COMMON)
Merging staging.current/staging-linus (30a7acd57389 Linux 4.15-rc6)
Merging char-misc.current/char-misc-linus (06e7e776ca4d Bluetooth: Prevent 
stack info leak from the EFS element.)
Merging input-current/for-linus (8b7e9d9e2d8b Input: hideep - fix compile error 
due to missing i

linux-next: Tree for Jan 5

2018-01-04 Thread Stephen Rothwell
Hi all,

Changes since 20180104:

The drm tree gained a conflict against the drm-intel-fixes tree.

The akpm-current tree gained a build failure for which I applied a patch.

Non-merge commits (relative to Linus' tree): 6981
 7369 files changed, 288333 insertions(+), 202735 deletions(-)



I have created today's linux-next tree at
git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
(patches at http://www.kernel.org/pub/linux/kernel/next/ ).  If you
are tracking the linux-next tree using git, you should not use "git pull"
to do so as that will try to merge the new linux-next release with the
old one.  You should use "git fetch" and checkout or reset to the new
master.

You can see which trees have been included by looking in the Next/Trees
file in the source.  There are also quilt-import.log and merge.log
files in the Next directory.  Between each merge, the tree was built
with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a
multi_v7_defconfig for arm and a native build of tools/perf. After
the final fixups (if any), I do an x86_64 modules_install followed by
builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit),
ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc
and sparc64 defconfig. And finally, a simple boot test of the powerpc
pseries_le_defconfig kernel in qemu (with and without kvm enabled).

Below is a summary of the state of the merge.

I am currently merging 255 trees (counting Linus' and 43 trees of bug
fix patches pending for the current merge release).

Stats about the size of the tree over time can be seen at
http://neuling.org/linux-next-size.html .

Status of my local build tests will be at
http://kisskb.ellerman.id.au/linux-next .  If maintainers want to give
advice about cross compilers/configs that work, we are always open to add
more builds.

Thanks to Randy Dunlap for doing many randconfig builds.  And to Paul
Gortmaker for triage and bug fixes.

-- 
Cheers,
Stephen Rothwell

$ git checkout master
$ git reset --hard stable
Merging origin/master (e1915c8195b3 Merge tag 'armsoc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc)
Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi)
Merging kbuild-current/fixes (cfe17c9bbe6a kbuild: move cc-option and 
cc-disable-warning after incl. arch Makefile)
Merging arc-current/for-curr (af1be2e21203 ARC: handle gcc generated 
__builtin_trap for older compiler)
Merging arm-current/fixes (36b0cb84ee85 ARM: 8731/1: Fix 
csum_partial_copy_from_user() stack mismatch)
Merging m68k-current/for-linus (5e387199c17c m68k/defconfig: Update defconfigs 
for v4.14-rc7)
Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups)
Merging powerpc-fixes/fixes (ecb101aed861 powerpc/mm: Fix SEGV on mapped region 
to return SEGV_ACCERR)
Merging sparc/master (59585b4be9ae sparc64: repair calling incorrect hweight 
function from stubs)
Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and 
linking special files)
Merging net/master (6926e041a892 uapi/if_ether.h: prevent redefinition of 
struct ethhdr)
Merging bpf/master (820d1d5eba5e Merge branch '40GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue)
Merging ipsec/master (2f10a61cee8f xfrm: fix rcu usage in xfrm_get_type_offload)
Merging netfilter/master (8bea728dce89 netfilter: nf_tables: fix potential 
NULL-ptr deref in nf_tables_dump_obj_done())
Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook 
mask only if set)
Merging wireless-drivers/master (a41886f56b7b Merge tag 
'iwlwifi-for-kalle-2017-12-05' of 
git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes)
Merging mac80211/master (736a80bbfda7 mac80211: mesh: drop frames appearing to 
be from us)
Merging sound-current/for-linus (db6f09448550 ALSA: pcm: Workaround for weird 
PulseAudio behavior on rewind error)
Merging pci-current/for-linus (1291a0d5049d Linux 4.15-rc4)
Merging driver-core.current/driver-core-linus (30a7acd57389 Linux 4.15-rc6)
Merging tty.current/tty-linus (30a7acd57389 Linux 4.15-rc6)
Merging usb.current/usb-linus (5fd77a3a0e40 usbip: vudc_tx: fix 
v_send_ret_submit() vulnerability to null xfer buffer)
Merging usb-gadget-fixes/fixes (1291a0d5049d Linux 4.15-rc4)
Merging usb-serial-fixes/usb-linus (d14ac576d10f USB: serial: cp210x: add new 
device ID ELV ALC 8xxx)
Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: 
fix ulpi-node lookup)
Merging phy/fixes (2b88212c4cc6 phy: rcar-gen3-usb2: select USB_COMMON)
Merging staging.current/staging-linus (30a7acd57389 Linux 4.15-rc6)
Merging char-misc.current/char-misc-linus (06e7e776ca4d Bluetooth: Prevent 
stack info leak from the EFS element.)
Merging input-current/for-linus (8b7e9d9e2d8b Input: hideep - fix compile error 
due to missing i

[PATCH] [v3] x86/doc: add PTI description

2018-01-04 Thread Dave Hansen

Changes from v2:
 * Update some wording
 * Minor typo and grammar fixes
 * Further clarify what INVPCID is.

Changes from v1:
 * update kernel-parameters.txt to clarify that the pti= option
   is not just for disabling.  Also describe what 'pti=auto' does
   and why
 * Add a note about the presence of NX in the user portion of the
   kernel page tables
 * Clarify _additional_ 4k of PGD space
 * Add a note about the runtime overhead of PCID without INVPCID

---

From: Dave Hansen 

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'nopti'.

Signed-off-by: Dave Hansen 
Reviewed-by: Kees Cook 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/Documentation/admin-guide/kernel-parameters.txt |   21 +-
 b/Documentation/x86/pti.txt   |  187 ++
 2 files changed, 201 insertions(+), 7 deletions(-)

diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc 
Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc  2018-01-03 
17:04:23.255028797 -0800
+++ b/Documentation/admin-guide/kernel-parameters.txt   2018-01-04 
21:30:58.402773426 -0800
@@ -2712,8 +2712,6 @@
steal time is computed, but won't influence scheduler
behaviour
 
-   nopti   [X86-64] Disable kernel page table isolation
-
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
@@ -3288,11 +3286,20 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.
 
-   pti=[X86_64]
-   Control user/kernel address space isolation:
-   on - enable
-   off - disable
-   auto - default setting
+   pti=[X86_64] Control Page Table Isolation of user and
+   kernel address spaces.  Disabling this feature
+   removes hardening, but improves performance of
+   system calls and interrupts.
+
+   on   - unconditionally enable
+   off  - unconditionally disable
+   auto - kernel detects whether your CPU model is
+  vulnerable to issues that PTI mitigates
+
+   Not specifying this option is equivalent to pti=auto.
+
+   nopti   [X86_64]
+   Equivalent to pti=off
 
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
diff -puN /dev/null Documentation/x86/pti.txt
--- /dev/null   2017-12-15 13:48:30.454245127 -0800
+++ b/Documentation/x86/pti.txt 2018-01-04 21:38:28.826772303 -0800
@@ -0,0 +1,187 @@
+Overview
+
+
+Page Table Isolation (pti, previously known as KAISER[1]) is a
+countermeasure against attacks on the shared user/kernel address
+space such as the "Meltdown" approach[2].
+
+To mitigate this class of attacks, we create an independent set of
+page tables for use only when running userspace applications.  When
+the kernel is entered via syscalls, interrupts or exceptions, the
+page tables are switched to the full "kernel" copy.  When the system
+switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT).  There are a few strictly unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in pti.c).
+
+This approach helps to ensure that side-channel attacks leveraging
+the paging structures do not function when PTI is enabled.  It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
+
+Page Table Management
+=
+
+When PTI is enabled, the kernel manages two sets of page tables.
+The first set is very similar to the single set which is present in
+kernels without PTI.  This includes a complete mapping of userspace
+that the kernel can use for things like copy_to_user().
+
+Although _complete_, the user portion of the kernel page tables is
+crippled by setting the NX bit in the top level.  This ensures
+that any missed 

[PATCH] [v3] x86/doc: add PTI description

2018-01-04 Thread Dave Hansen

Changes from v2:
 * Update some wording
 * Minor typo and grammar fixes
 * Further clarify what INVPCID is.

Changes from v1:
 * update kernel-parameters.txt to clarify that the pti= option
   is not just for disabling.  Also describe what 'pti=auto' does
   and why
 * Add a note about the presence of NX in the user portion of the
   kernel page tables
 * Clarify _additional_ 4k of PGD space
 * Add a note about the runtime overhead of PCID without INVPCID

---

From: Dave Hansen 

Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.

Also document the kernel parameter: 'nopti'.

Signed-off-by: Dave Hansen 
Reviewed-by: Kees Cook 
Cc: Moritz Lipp 
Cc: Daniel Gruss 
Cc: Michael Schwarz 
Cc: Richard Fellner 
Cc: Andy Lutomirski 
Cc: Linus Torvalds 
Cc: Hugh Dickins 
Cc: x...@kernel.org
---

 b/Documentation/admin-guide/kernel-parameters.txt |   21 +-
 b/Documentation/x86/pti.txt   |  187 ++
 2 files changed, 201 insertions(+), 7 deletions(-)

diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc 
Documentation/admin-guide/kernel-parameters.txt
--- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc  2018-01-03 
17:04:23.255028797 -0800
+++ b/Documentation/admin-guide/kernel-parameters.txt   2018-01-04 
21:30:58.402773426 -0800
@@ -2712,8 +2712,6 @@
steal time is computed, but won't influence scheduler
behaviour
 
-   nopti   [X86-64] Disable kernel page table isolation
-
nolapic [X86-32,APIC] Do not enable or use the local APIC.
 
nolapic_timer   [X86-32,APIC] Do not use the local APIC timer.
@@ -3288,11 +3286,20 @@
pt. [PARIDE]
See Documentation/blockdev/paride.txt.
 
-   pti=[X86_64]
-   Control user/kernel address space isolation:
-   on - enable
-   off - disable
-   auto - default setting
+   pti=[X86_64] Control Page Table Isolation of user and
+   kernel address spaces.  Disabling this feature
+   removes hardening, but improves performance of
+   system calls and interrupts.
+
+   on   - unconditionally enable
+   off  - unconditionally disable
+   auto - kernel detects whether your CPU model is
+  vulnerable to issues that PTI mitigates
+
+   Not specifying this option is equivalent to pti=auto.
+
+   nopti   [X86_64]
+   Equivalent to pti=off
 
pty.legacy_count=
[KNL] Number of legacy pty's. Overwrites compiled-in
diff -puN /dev/null Documentation/x86/pti.txt
--- /dev/null   2017-12-15 13:48:30.454245127 -0800
+++ b/Documentation/x86/pti.txt 2018-01-04 21:38:28.826772303 -0800
@@ -0,0 +1,187 @@
+Overview
+
+
+Page Table Isolation (pti, previously known as KAISER[1]) is a
+countermeasure against attacks on the shared user/kernel address
+space such as the "Meltdown" approach[2].
+
+To mitigate this class of attacks, we create an independent set of
+page tables for use only when running userspace applications.  When
+the kernel is entered via syscalls, interrupts or exceptions, the
+page tables are switched to the full "kernel" copy.  When the system
+switches back to user mode, the user copy is used again.
+
+The userspace page tables contain only a minimal amount of kernel
+data: only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor table
+(IDT).  There are a few strictly unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in pti.c).
+
+This approach helps to ensure that side-channel attacks leveraging
+the paging structures do not function when PTI is enabled.  It can be
+enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time.
+Once enabled at compile-time, it can be disabled at boot with the
+'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt).
+
+Page Table Management
+=
+
+When PTI is enabled, the kernel manages two sets of page tables.
+The first set is very similar to the single set which is present in
+kernels without PTI.  This includes a complete mapping of userspace
+that the kernel can use for things like copy_to_user().
+
+Although _complete_, the user portion of the kernel page tables is
+crippled by setting the NX bit in the top level.  This ensures
+that any missed kernel->user CR3 switch will immediately crash
+userspace upon executing its first instruction.
+
+The userspace page tables map only the kernel data needed to enter
+and exit the kernel.  This data is entirely contained in the 'struct
+cpu_entry_area' structure which is 

[RFC] selftests/x86: Add test_vsyscall

2018-01-04 Thread Andy Lutomirski
This tests that the vsyscall entries do what they're expected to do.
It also confirms that attempts to read the vsyscall page behave as
expected.

If changes are made to the vsyscall code or its memory map handling,
running this test in all three of vsyscall=none, vsyscall=emulate,
and vsyscall=native are helpful.

(Because it's easy, this also compares the vsyscall results to their
 vDSO equivalents.)

Signed-off-by: Andy Lutomirski 
---

It's RFC because I want to re-read it myself first.  It's also missing
a test that will reliably make sure that vsyscall=none prevents use of
vsyscalls.

Also, I want to add vsyscall=emulate_noread that makes the vsyscall
page be --x.  And I want to add a per-process option to turn off
vsyscalls.

 tools/testing/selftests/x86/Makefile|   2 +-
 tools/testing/selftests/x86/test_vsyscall.c | 435 
 2 files changed, 436 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_vsyscall.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index 939a337128db..5d4f10ac2af2 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -7,7 +7,7 @@ include ../lib.mk
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
ptrace_syscall test_mremap_vdso \
check_initial_reg_state sigreturn ldt_gdt iopl 
mpx-mini-test ioperm \
-   protection_keys test_vdso
+   protection_keys test_vdso test_vsyscall
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso 
unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
diff --git a/tools/testing/selftests/x86/test_vsyscall.c 
b/tools/testing/selftests/x86/test_vsyscall.c
new file mode 100644
index ..44d873d71b85
--- /dev/null
+++ b/tools/testing/selftests/x86/test_vsyscall.c
@@ -0,0 +1,435 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define VSYS(x) (x)
+#else
+# define VSYS(x) 0
+#endif
+
+#ifndef SYS_getcpu
+# ifdef __x86_64__
+#  define SYS_getcpu 309
+# else
+#  define SYS_getcpu 318
+# endif
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+  int flags)
+{
+   struct sigaction sa;
+   memset(, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(_mask);
+   if (sigaction(sig, , 0))
+   err(1, "sigaction");
+}
+
+/* vsyscalls and vDSO */
+bool should_read_vsyscall = false;
+
+typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz);
+gtod_t vgtod = (gtod_t)VSYS(0xff60);
+gtod_t vdso_gtod;
+
+typedef int (*vgettime_t)(clockid_t, struct timespec *);
+vgettime_t vdso_gettime;
+
+typedef long (*time_func_t)(time_t *t);
+time_func_t vtime = (time_func_t)VSYS(0xff600400);
+time_func_t vdso_time;
+
+typedef long (*getcpu_t)(unsigned *, unsigned *, void *);
+getcpu_t vgetcpu = (getcpu_t)VSYS(0xff600800);
+getcpu_t vdso_getcpu;
+
+static void init_vdso(void)
+{
+   void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | 
RTLD_NOLOAD);
+   if (!vdso)
+   vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | 
RTLD_NOLOAD);
+   if (!vdso) {
+   printf("Warning: failed to find vDSO\n");
+   return;
+   }
+
+   vdso_gtod = (gtod_t)dlsym(vdso, "__vdso_gettimeofday");
+   if (!vdso_gtod)
+   printf("Warning: failed to find gettimeofday in vDSO\n");
+
+   vdso_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime");
+   if (!vdso_gettime)
+   printf("Warning: failed to find clock_gettime in vDSO\n");
+
+   vdso_time = (time_func_t)dlsym(vdso, "__vdso_time");
+   if (!vdso_time)
+   printf("Warning: failed to find time in vDSO\n");
+
+   vdso_getcpu = (getcpu_t)dlsym(vdso, "__vdso_getcpu");
+   if (!vdso_getcpu)
+   printf("Warning: failed to find getcpu in vDSO\n");
+}
+
+static int init_vsys(void)
+{
+#ifdef __x86_64__
+   int nerrs = 0;
+   FILE *maps;
+   char line[128];
+   bool found = false;
+
+   maps = fopen("/proc/self/maps", "r");
+   if (!maps) {
+   printf("[WARN]\tCould not open /proc/self/maps -- assuming 
vsyscall is r-x\n");
+   should_read_vsyscall = true;
+   return 0;
+   }
+
+   while (fgets(line, sizeof(line), maps)) {
+   char r, x;
+   void *start, *end;
+   char name[128];
+   if (sscanf(line, "%p-%p %c-%cp %*x %*x:%*x %*u %s",
+  , , , , name) 

[RFC] selftests/x86: Add test_vsyscall

2018-01-04 Thread Andy Lutomirski
This tests that the vsyscall entries do what they're expected to do.
It also confirms that attempts to read the vsyscall page behave as
expected.

If changes are made to the vsyscall code or its memory map handling,
running this test in all three of vsyscall=none, vsyscall=emulate,
and vsyscall=native are helpful.

(Because it's easy, this also compares the vsyscall results to their
 vDSO equivalents.)

Signed-off-by: Andy Lutomirski 
---

It's RFC because I want to re-read it myself first.  It's also missing
a test that will reliably make sure that vsyscall=none prevents use of
vsyscalls.

Also, I want to add vsyscall=emulate_noread that makes the vsyscall
page be --x.  And I want to add a per-process option to turn off
vsyscalls.

 tools/testing/selftests/x86/Makefile|   2 +-
 tools/testing/selftests/x86/test_vsyscall.c | 435 
 2 files changed, 436 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/test_vsyscall.c

diff --git a/tools/testing/selftests/x86/Makefile 
b/tools/testing/selftests/x86/Makefile
index 939a337128db..5d4f10ac2af2 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -7,7 +7,7 @@ include ../lib.mk
 
 TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt 
ptrace_syscall test_mremap_vdso \
check_initial_reg_state sigreturn ldt_gdt iopl 
mpx-mini-test ioperm \
-   protection_keys test_vdso
+   protection_keys test_vdso test_vsyscall
 TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso 
unwind_vdso \
test_FCMOV test_FCOMI test_FISTTP \
vdso_restorer
diff --git a/tools/testing/selftests/x86/test_vsyscall.c 
b/tools/testing/selftests/x86/test_vsyscall.c
new file mode 100644
index ..44d873d71b85
--- /dev/null
+++ b/tools/testing/selftests/x86/test_vsyscall.c
@@ -0,0 +1,435 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef __x86_64__
+# define VSYS(x) (x)
+#else
+# define VSYS(x) 0
+#endif
+
+#ifndef SYS_getcpu
+# ifdef __x86_64__
+#  define SYS_getcpu 309
+# else
+#  define SYS_getcpu 318
+# endif
+#endif
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+  int flags)
+{
+   struct sigaction sa;
+   memset(, 0, sizeof(sa));
+   sa.sa_sigaction = handler;
+   sa.sa_flags = SA_SIGINFO | flags;
+   sigemptyset(_mask);
+   if (sigaction(sig, , 0))
+   err(1, "sigaction");
+}
+
+/* vsyscalls and vDSO */
+bool should_read_vsyscall = false;
+
+typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz);
+gtod_t vgtod = (gtod_t)VSYS(0xff60);
+gtod_t vdso_gtod;
+
+typedef int (*vgettime_t)(clockid_t, struct timespec *);
+vgettime_t vdso_gettime;
+
+typedef long (*time_func_t)(time_t *t);
+time_func_t vtime = (time_func_t)VSYS(0xff600400);
+time_func_t vdso_time;
+
+typedef long (*getcpu_t)(unsigned *, unsigned *, void *);
+getcpu_t vgetcpu = (getcpu_t)VSYS(0xff600800);
+getcpu_t vdso_getcpu;
+
+static void init_vdso(void)
+{
+   void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | 
RTLD_NOLOAD);
+   if (!vdso)
+   vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | 
RTLD_NOLOAD);
+   if (!vdso) {
+   printf("Warning: failed to find vDSO\n");
+   return;
+   }
+
+   vdso_gtod = (gtod_t)dlsym(vdso, "__vdso_gettimeofday");
+   if (!vdso_gtod)
+   printf("Warning: failed to find gettimeofday in vDSO\n");
+
+   vdso_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime");
+   if (!vdso_gettime)
+   printf("Warning: failed to find clock_gettime in vDSO\n");
+
+   vdso_time = (time_func_t)dlsym(vdso, "__vdso_time");
+   if (!vdso_time)
+   printf("Warning: failed to find time in vDSO\n");
+
+   vdso_getcpu = (getcpu_t)dlsym(vdso, "__vdso_getcpu");
+   if (!vdso_getcpu)
+   printf("Warning: failed to find getcpu in vDSO\n");
+}
+
+static int init_vsys(void)
+{
+#ifdef __x86_64__
+   int nerrs = 0;
+   FILE *maps;
+   char line[128];
+   bool found = false;
+
+   maps = fopen("/proc/self/maps", "r");
+   if (!maps) {
+   printf("[WARN]\tCould not open /proc/self/maps -- assuming 
vsyscall is r-x\n");
+   should_read_vsyscall = true;
+   return 0;
+   }
+
+   while (fgets(line, sizeof(line), maps)) {
+   char r, x;
+   void *start, *end;
+   char name[128];
+   if (sscanf(line, "%p-%p %c-%cp %*x %*x:%*x %*u %s",
+  , , , , name) != 5)
+   

Re: [PATCH] nvme-pci: fix the timeout case when reset is ongoing

2018-01-04 Thread jianchao.wang
Hi Christoph

Many thanks for your kindly response.

On 01/04/2018 06:35 PM, Christoph Hellwig wrote:
> On Wed, Jan 03, 2018 at 06:31:44AM +0800, Jianchao Wang wrote:
>> NVME_CTRL_RESETTING used to indicate the range of nvme initializing
>> strictly in fd634f41(nvme: merge probe_work and reset_work), but it
>> is not now. The NVME_CTRL_RESETTING is set before queue the
>> reset_work, there could be a big gap before the reset work handles
>> the outstanding requests. So when the NVME_CTRL_RESETTING is set,
>> nvme_timeout will not only meet the admin requests from the
>> initializing procedure, but also the IO and admin requests from
>> previous work before nvme_dev_disable is invoked.
>>
>> To fix it, introduce a flag NVME_DEV_FLAG_INITIALIZING to mark the
>> range of initializing. When this flag is not set, handle the expried
>> requests as nvme_cancel_request. Otherwise, the requests should be
>> from the initializing procedure. Handle them as before. Because the
>> nvme_reset_work will see the error and disable the dev itself, so
>> discard the nvme_dev_disable here.
> 
> Instead of a parallel set of states we'll need to split
> NVME_CTRL_RESET into NVME_CTRL_RESET_SCHEDULED and NVME_CTRL_RESETTING.
> 
> And if my memory doesn't fail me we were already considering that a while
> ago.
> 
Yes, it is indeed more reasonable to split current NVME_CTRL_RESETTING into 
two states, but the nvme_dev_disable() in nvme_reset_work() should be the 
boundary.
After that, all the in-flight requests are requeued and request queue is 
quiesced,
the nvme driver is clear. So the new state maybe something like 
NEW_CTRL_RESET_PREPARE.:)

Thanks
Jianchao 


Re: [PATCH] nvme-pci: fix the timeout case when reset is ongoing

2018-01-04 Thread jianchao.wang
Hi Christoph

Many thanks for your kindly response.

On 01/04/2018 06:35 PM, Christoph Hellwig wrote:
> On Wed, Jan 03, 2018 at 06:31:44AM +0800, Jianchao Wang wrote:
>> NVME_CTRL_RESETTING used to indicate the range of nvme initializing
>> strictly in fd634f41(nvme: merge probe_work and reset_work), but it
>> is not now. The NVME_CTRL_RESETTING is set before queue the
>> reset_work, there could be a big gap before the reset work handles
>> the outstanding requests. So when the NVME_CTRL_RESETTING is set,
>> nvme_timeout will not only meet the admin requests from the
>> initializing procedure, but also the IO and admin requests from
>> previous work before nvme_dev_disable is invoked.
>>
>> To fix it, introduce a flag NVME_DEV_FLAG_INITIALIZING to mark the
>> range of initializing. When this flag is not set, handle the expried
>> requests as nvme_cancel_request. Otherwise, the requests should be
>> from the initializing procedure. Handle them as before. Because the
>> nvme_reset_work will see the error and disable the dev itself, so
>> discard the nvme_dev_disable here.
> 
> Instead of a parallel set of states we'll need to split
> NVME_CTRL_RESET into NVME_CTRL_RESET_SCHEDULED and NVME_CTRL_RESETTING.
> 
> And if my memory doesn't fail me we were already considering that a while
> ago.
> 
Yes, it is indeed more reasonable to split current NVME_CTRL_RESETTING into 
two states, but the nvme_dev_disable() in nvme_reset_work() should be the 
boundary.
After that, all the in-flight requests are requeued and request queue is 
quiesced,
the nvme driver is clear. So the new state maybe something like 
NEW_CTRL_RESET_PREPARE.:)

Thanks
Jianchao 


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski  wrote:
>
>> On Jan 4, 2018, at 12:29 PM, Linus Torvalds  
>> wrote:
>>
>>> On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle  wrote:
>>>
>>> Attached a screenshot.
>>> Is that useful? Are there some debug options I can add?
>>
>> Not much of an oops, because the SIGSEGV happens in user space. The
>> only reason you get any kernel stack printout at all is because 'init'
>> dying will make the kernel print that out.
>>
>> The segfault address for init looks like the fixmap area to me (first
>> byte in the last page of the fixmap?). "Error 5" means that it's a
>> user-space read that got a protection fault. So it's not a LDT of GDT
>> update or anything like that, it's a normal access from user space (or
>> a qemu emulation bug, but that sounds unlikely).
>>
>> Is that the vsyscall page?
>>
>> Adding Luto to the participants. I think he noticed one of the
>> vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4
>> series had something similar..
>>
>
> That's almost certainly it.
>
> I'll try to find some time today or tomorrow to add a proper selftest.
>

Give this a shot:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pti=17c5ebeb2e00879b0af1a9c32bf37ecdd9b9b31b

Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate
and run both the 32-bit and 64-bit variants of that test.  All six
combinations should pass.  But I bet they don't on 4.4.


Re: [PATCH 4.4 00/37] 4.4.110-stable review

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski  wrote:
>
>> On Jan 4, 2018, at 12:29 PM, Linus Torvalds  
>> wrote:
>>
>>> On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle  wrote:
>>>
>>> Attached a screenshot.
>>> Is that useful? Are there some debug options I can add?
>>
>> Not much of an oops, because the SIGSEGV happens in user space. The
>> only reason you get any kernel stack printout at all is because 'init'
>> dying will make the kernel print that out.
>>
>> The segfault address for init looks like the fixmap area to me (first
>> byte in the last page of the fixmap?). "Error 5" means that it's a
>> user-space read that got a protection fault. So it's not a LDT of GDT
>> update or anything like that, it's a normal access from user space (or
>> a qemu emulation bug, but that sounds unlikely).
>>
>> Is that the vsyscall page?
>>
>> Adding Luto to the participants. I think he noticed one of the
>> vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4
>> series had something similar..
>>
>
> That's almost certainly it.
>
> I'll try to find some time today or tomorrow to add a proper selftest.
>

Give this a shot:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pti=17c5ebeb2e00879b0af1a9c32bf37ecdd9b9b31b

Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate
and run both the 32-bit and 64-bit variants of that test.  All six
combinations should pass.  But I bet they don't on 4.4.


Re: [PATCH] [v2] x86/doc: add PTI description

2018-01-04 Thread Dave Hansen
On 01/04/2018 05:43 PM, Hector Martin 'marcan' wrote:
> On 2018-01-05 09:24, Dave Hansen wrote:
>> +Not specifying this option nothing is equivalent to
>> +pti=auto.
> 
> -nothing

Sure, will fix.

>> +Page Table Isolation (pti, previously known as KAISER[1]) is a
>> +countermeasure against attacks on kernel address information such
>> +as the "Meltdown" approach[2].
> 
> It's not really just address information, but any data. Maybe "attacks
> that leak kernel memory"?

It's not just kernel leaks either, though.

>> +To avoid leaking address information, we create an new, independent
> 
> Same issue here. Also an -> a.

Will fix.

>> +copy of the page tables which are used only when running userspace
> 
> are -> is. The copy is singular.

I've reworded the sentence to remove the ambiguity.

>> +applications.  When the kernel is entered via syscalls, interrupts or
>> +exceptions, page tables are switched to the full "kernel" copy.  When
> 
> "the page tables".

No thanks.  It's fine the way it is.

>> +crippled by setting the NX bit in the top level.  This ensures
>> +that if a kernel->user CR3 switch is missed that userspace will
>> +crash immediately upon executing its first instruction.
> 
> "that userspace" -> "then userspace"




Re: [PATCH] [v2] x86/doc: add PTI description

2018-01-04 Thread Dave Hansen
On 01/04/2018 05:43 PM, Hector Martin 'marcan' wrote:
> On 2018-01-05 09:24, Dave Hansen wrote:
>> +Not specifying this option nothing is equivalent to
>> +pti=auto.
> 
> -nothing

Sure, will fix.

>> +Page Table Isolation (pti, previously known as KAISER[1]) is a
>> +countermeasure against attacks on kernel address information such
>> +as the "Meltdown" approach[2].
> 
> It's not really just address information, but any data. Maybe "attacks
> that leak kernel memory"?

It's not just kernel leaks either, though.

>> +To avoid leaking address information, we create an new, independent
> 
> Same issue here. Also an -> a.

Will fix.

>> +copy of the page tables which are used only when running userspace
> 
> are -> is. The copy is singular.

I've reworded the sentence to remove the ambiguity.

>> +applications.  When the kernel is entered via syscalls, interrupts or
>> +exceptions, page tables are switched to the full "kernel" copy.  When
> 
> "the page tables".

No thanks.  It's fine the way it is.

>> +crippled by setting the NX bit in the top level.  This ensures
>> +that if a kernel->user CR3 switch is missed that userspace will
>> +crash immediately upon executing its first instruction.
> 
> "that userspace" -> "then userspace"




Re: [PATCH V7 12/12] arm64: dts: add clocks for SC9860

2018-01-04 Thread Chunyan Zhang
On 5 January 2018 at 07:01, Arnd Bergmann  wrote:
> On Thu, Jan 4, 2018 at 10:34 PM, Arnd Bergmann  wrote:
>> On Thu, Dec 7, 2017 at 1:57 PM, Chunyan Zhang
>>  wrote:
>>> Some clocks on SC9860 are in the same address area with syscon devices,
>>> those are what have a property of 'sprd,syscon' which would refer to
>>> syscon devices, others would have a reg property indicated their address
>>> ranges.
>>>
>>> Signed-off-by: Chunyan Zhang 
>>> ---
>>>  arch/arm64/boot/dts/sprd/sc9860.dtsi | 115 
>>> +++
>>>  arch/arm64/boot/dts/sprd/whale2.dtsi |  18 +-
>>>  2 files changed, 131 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/boot/dts/sprd/sc9860.dtsi 
>>> b/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> index 7b7d8ce..bf03da4 100644
>>> --- a/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> +++ b/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> @@ -7,6 +7,7 @@
>>>   */
>>>
>>>  #include 
>>> +#include 
>>>  #include "whale2.dtsi"
>>
>> This caused a build error since the sprd,sc9860-clk.h file does not
>> exist, I'll revert or undo the patch tomorrow.
>
> I've taken another look, and fixing it by removing the broken #include
> was easier than undoing the patches, so I did that now, see
> https://patchwork.kernel.org/patch/10145773/

Ok, thanks Arnd!

Chunyan

>
>   Arnd


Re: [PATCH V7 12/12] arm64: dts: add clocks for SC9860

2018-01-04 Thread Chunyan Zhang
On 5 January 2018 at 07:01, Arnd Bergmann  wrote:
> On Thu, Jan 4, 2018 at 10:34 PM, Arnd Bergmann  wrote:
>> On Thu, Dec 7, 2017 at 1:57 PM, Chunyan Zhang
>>  wrote:
>>> Some clocks on SC9860 are in the same address area with syscon devices,
>>> those are what have a property of 'sprd,syscon' which would refer to
>>> syscon devices, others would have a reg property indicated their address
>>> ranges.
>>>
>>> Signed-off-by: Chunyan Zhang 
>>> ---
>>>  arch/arm64/boot/dts/sprd/sc9860.dtsi | 115 
>>> +++
>>>  arch/arm64/boot/dts/sprd/whale2.dtsi |  18 +-
>>>  2 files changed, 131 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/boot/dts/sprd/sc9860.dtsi 
>>> b/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> index 7b7d8ce..bf03da4 100644
>>> --- a/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> +++ b/arch/arm64/boot/dts/sprd/sc9860.dtsi
>>> @@ -7,6 +7,7 @@
>>>   */
>>>
>>>  #include 
>>> +#include 
>>>  #include "whale2.dtsi"
>>
>> This caused a build error since the sprd,sc9860-clk.h file does not
>> exist, I'll revert or undo the patch tomorrow.
>
> I've taken another look, and fixing it by removing the broken #include
> was easier than undoing the patches, so I did that now, see
> https://patchwork.kernel.org/patch/10145773/

Ok, thanks Arnd!

Chunyan

>
>   Arnd


Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>> === Page Table Poisoning ===
>>
>> KAISER has two copies of the page tables: one for the kernel and
>> one for when running in userspace.  
> 
> So, we have 2 page table, thinking about this case:
> If _ONE_ process includes _TWO_ threads, one run in user space, the other
> run in kernel, they can run in one core with Hyper-Threading, right?

Yes.

> So both userspace and kernel space is valid, right? And for one core
> with Hyper-Threading, they may share TLB, so the timing problem
> described in the paper may still exist?

No.  The TLB is managed per logical CPU (hyperthread), as is the CR3
register that points to the page tables.  Two threads running the same
process might use the same CR3 _value_, but that does not mean they
share TLB entries.

One thread *can* be in the kernel with the kernel page tables while the
other is in userspace with the user page tables active.  They will even
use a different PCID/ASID for the same page tables normally.

> Can this case still be protected by KAISER?

Yes.


Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:16 PM, Yisheng Xie wrote:
>> === Page Table Poisoning ===
>>
>> KAISER has two copies of the page tables: one for the kernel and
>> one for when running in userspace.  
> 
> So, we have 2 page table, thinking about this case:
> If _ONE_ process includes _TWO_ threads, one run in user space, the other
> run in kernel, they can run in one core with Hyper-Threading, right?

Yes.

> So both userspace and kernel space is valid, right? And for one core
> with Hyper-Threading, they may share TLB, so the timing problem
> described in the paper may still exist?

No.  The TLB is managed per logical CPU (hyperthread), as is the CR3
register that points to the page tables.  Two threads running the same
process might use the same CR3 _value_, but that does not mean they
share TLB entries.

One thread *can* be in the kernel with the kernel page tables while the
other is in userspace with the user page tables active.  They will even
use a different PCID/ASID for the same page tables normally.

> Can this case still be protected by KAISER?

Yes.


[RFC] boot failed when enable KAISER/KPTI

2018-01-04 Thread Xishi Qiu
I run the latest RHEL 7.2 with the KAISER/KPTI patch, and boot failed.

...
[0.00] PM: Registered nosave memory: [mem 0x810-0x8ff]
[0.00] PM: Registered nosave memory: [mem 0x910-0xfff]
[0.00] PM: Registered nosave memory: [mem 0x1010-0x10ff]
[0.00] PM: Registered nosave memory: [mem 0x1110-0x17ff]
[0.00] PM: Regitered nosave memory: [mem 0x1810-0x18ff]
[0.00] e820: [mem 0x9000-0xfed1bfff] available for PCI devices
[0.00] Booting paravirtualized kernel on bare hardware
[0.00] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:1536 nr_cpu_ids:1536 
nr_node_ids:8
[0.00] PERCPU: max_distance=0x180ffe24 too large for vmalloc space 
0x1fff
[0.00] setup_percpu: auto allocator failed (-22), falling back to page 
size
[0.00] PERCPU: 32 4K pages/cpu @c900 s107200 r8192 d15680
[0.00] Built 8 zonelists in Zone order, mobility grouping on.  Total 
pages: 132001804
[0.00] Policy zone: Normal
iosdevname=0 8250.nr_uarts=8 efi=old_map rdloaddriver=usb_storage 
rdloaddriver=sd_mod udev.event-timeout=600 softlockup_panic=0 
rcupdate.rcu_cpu_stall_timeout=300
[0.00] Intel-IOMMU: enabled
[0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
[0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[0.00] xsave: enabled xstate_bv 0x7, cntxt size 0x340
[0.00] AGP: Checking aperture...
[0.00] AGP: No AGP bridge found
[0.00] Memory: 526901612k/26910638080k available (6528k kernel code, 
26374249692k absent, 9486776k reserved, 4302k data, 1676k init)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1536, Nodes=8
[0.00] x86/pti: Unmapping kernel while in userspace
[0.00] Hierarchical RCU implementation.
[0.00]  RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=1536.
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1535.
[0.00] NR_IRQS:327936 nr_irqs:15976 0
[0.00] Console: colour dummy device 80x25
[0.00] console [tty0] enabled
[0.00] console [ttyS0] enabled
[0.00] allocated 2145910784 bytes of page_cgroup
[0.00] please try 'cgroup_disable=memory' option if you don't want 
memory cgroups
[0.00] Enabling automatic NUMA balancing. Configure with 
numa_balancing= or the kernel.numa_balancing sysctl
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 2799.999 MHz processor
[0.001803] Calibrating delay loop (skipped), value calculated using timer 
frequency.. 5599.99 BogoMIPS (lpj=279)
[0.012408] pid_max: default: 1572864 minimum: 12288
[0.017987] init_memory_mapping: [mem 0x5947f000-0x5b47efff]
[0.023701] init_memory_mapping: [mem 0x5b47f000-0x5b87efff]
[0.029369] init_memory_mapping: [mem 0x6d368000-0x6d3edfff]
[0.039130] BUG: unable to handle kernel paging request at 5b835f90
[0.046101] IP: [<5b835f90>] 0x5b835f8f
[0.050637] PGD 81f61067 PUD 190ffefff067 PMD 190ffeffd067 PTE 
5b835063
[0.057989] Oops: 0011 [#1] SMP 
[0.061241] Modules linked in:
[0.064304] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-327.59.59.46.h42.x86_64 #1
[0.072280] Hardware name: Huawei FusionServer9032/IT91SMUB, BIOS BLXSV316 
11/14/2017
[0.080082] task: 8196e440 ti: 81958000 task.ti: 
81958000
[0.087539] RIP: 0010:[<5b835f90>]  [<5b835f90>] 0x5b835f8f
[0.094494] RSP: :8195be28  EFLAGS: 00010046
[0.099788] RAX: 80050033 RBX: 910fbc802000 RCX: 02d0
[0.106897] RDX: 0030 RSI: 02d0 RDI: 5b835f90
[0.114006] RBP: 8195bf38 R08: 0001 R09: 090fbc802000
[0.121116] R10: 88ffbcc07340 R11: 0001 R12: 0001
[0.128225] R13: 090fbc802000 R14: 02d0 R15: 0001
[0.135336] FS:  () GS:c900() 
knlGS:
[0.143398] CS:  0010 DS:  ES:  CR0: 80050033
[0.149124] CR2: 5b835f90 CR3: 01966000 CR4: 000606b0
[0.156234] DR0:  DR1:  DR2: 
[0.163344] DR3:  DR6: fffe0ff0 DR7: 0400
[0.170454] Call Trace:
[0.172899]  [] ? efi_call4+0x6c/0xf0
[0.178108]  [] ? native_flush_tlb_global+0x8e/0xc0
[0.184527]  [] ? set_memory_x+0x43/0x50
[0.189997]  [] ? efi_enter_virtual_mode+0x3bc/0x538
[0.196505]  [] start_kernel+0x39f/0x44f
[0.201972]  [] ? repair_env_string+0x5c/0x5c
[0.207872]  [] ? early_idt_handlers+0x120/0x120
[0.214030]  [] x86_64_start_reservations+0x2a/0x2c
[0.220449]  [] x86_64_start_kernel+0x152/0x175
[0.226521] Code:  Bad 

[RFC] boot failed when enable KAISER/KPTI

2018-01-04 Thread Xishi Qiu
I run the latest RHEL 7.2 with the KAISER/KPTI patch, and boot failed.

...
[0.00] PM: Registered nosave memory: [mem 0x810-0x8ff]
[0.00] PM: Registered nosave memory: [mem 0x910-0xfff]
[0.00] PM: Registered nosave memory: [mem 0x1010-0x10ff]
[0.00] PM: Registered nosave memory: [mem 0x1110-0x17ff]
[0.00] PM: Regitered nosave memory: [mem 0x1810-0x18ff]
[0.00] e820: [mem 0x9000-0xfed1bfff] available for PCI devices
[0.00] Booting paravirtualized kernel on bare hardware
[0.00] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:1536 nr_cpu_ids:1536 
nr_node_ids:8
[0.00] PERCPU: max_distance=0x180ffe24 too large for vmalloc space 
0x1fff
[0.00] setup_percpu: auto allocator failed (-22), falling back to page 
size
[0.00] PERCPU: 32 4K pages/cpu @c900 s107200 r8192 d15680
[0.00] Built 8 zonelists in Zone order, mobility grouping on.  Total 
pages: 132001804
[0.00] Policy zone: Normal
iosdevname=0 8250.nr_uarts=8 efi=old_map rdloaddriver=usb_storage 
rdloaddriver=sd_mod udev.event-timeout=600 softlockup_panic=0 
rcupdate.rcu_cpu_stall_timeout=300
[0.00] Intel-IOMMU: enabled
[0.00] PID hash table entries: 4096 (order: 3, 32768 bytes)
[0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100
[0.00] xsave: enabled xstate_bv 0x7, cntxt size 0x340
[0.00] AGP: Checking aperture...
[0.00] AGP: No AGP bridge found
[0.00] Memory: 526901612k/26910638080k available (6528k kernel code, 
26374249692k absent, 9486776k reserved, 4302k data, 1676k init)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1536, Nodes=8
[0.00] x86/pti: Unmapping kernel while in userspace
[0.00] Hierarchical RCU implementation.
[0.00]  RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=1536.
[0.00]  Offload RCU callbacks from all CPUs
[0.00]  Offload RCU callbacks from CPUs: 0-1535.
[0.00] NR_IRQS:327936 nr_irqs:15976 0
[0.00] Console: colour dummy device 80x25
[0.00] console [tty0] enabled
[0.00] console [ttyS0] enabled
[0.00] allocated 2145910784 bytes of page_cgroup
[0.00] please try 'cgroup_disable=memory' option if you don't want 
memory cgroups
[0.00] Enabling automatic NUMA balancing. Configure with 
numa_balancing= or the kernel.numa_balancing sysctl
[0.00] tsc: Fast TSC calibration using PIT
[0.00] tsc: Detected 2799.999 MHz processor
[0.001803] Calibrating delay loop (skipped), value calculated using timer 
frequency.. 5599.99 BogoMIPS (lpj=279)
[0.012408] pid_max: default: 1572864 minimum: 12288
[0.017987] init_memory_mapping: [mem 0x5947f000-0x5b47efff]
[0.023701] init_memory_mapping: [mem 0x5b47f000-0x5b87efff]
[0.029369] init_memory_mapping: [mem 0x6d368000-0x6d3edfff]
[0.039130] BUG: unable to handle kernel paging request at 5b835f90
[0.046101] IP: [<5b835f90>] 0x5b835f8f
[0.050637] PGD 81f61067 PUD 190ffefff067 PMD 190ffeffd067 PTE 
5b835063
[0.057989] Oops: 0011 [#1] SMP 
[0.061241] Modules linked in:
[0.064304] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-327.59.59.46.h42.x86_64 #1
[0.072280] Hardware name: Huawei FusionServer9032/IT91SMUB, BIOS BLXSV316 
11/14/2017
[0.080082] task: 8196e440 ti: 81958000 task.ti: 
81958000
[0.087539] RIP: 0010:[<5b835f90>]  [<5b835f90>] 0x5b835f8f
[0.094494] RSP: :8195be28  EFLAGS: 00010046
[0.099788] RAX: 80050033 RBX: 910fbc802000 RCX: 02d0
[0.106897] RDX: 0030 RSI: 02d0 RDI: 5b835f90
[0.114006] RBP: 8195bf38 R08: 0001 R09: 090fbc802000
[0.121116] R10: 88ffbcc07340 R11: 0001 R12: 0001
[0.128225] R13: 090fbc802000 R14: 02d0 R15: 0001
[0.135336] FS:  () GS:c900() 
knlGS:
[0.143398] CS:  0010 DS:  ES:  CR0: 80050033
[0.149124] CR2: 5b835f90 CR3: 01966000 CR4: 000606b0
[0.156234] DR0:  DR1:  DR2: 
[0.163344] DR3:  DR6: fffe0ff0 DR7: 0400
[0.170454] Call Trace:
[0.172899]  [] ? efi_call4+0x6c/0xf0
[0.178108]  [] ? native_flush_tlb_global+0x8e/0xc0
[0.184527]  [] ? set_memory_x+0x43/0x50
[0.189997]  [] ? efi_enter_virtual_mode+0x3bc/0x538
[0.196505]  [] start_kernel+0x39f/0x44f
[0.201972]  [] ? repair_env_string+0x5c/0x5c
[0.207872]  [] ? early_idt_handlers+0x120/0x120
[0.214030]  [] x86_64_start_reservations+0x2a/0x2c
[0.220449]  [] x86_64_start_kernel+0x152/0x175
[0.226521] Code:  Bad 

Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:51 PM, Andy Lutomirski wrote:
> Do we need an arch_prctl() to enable IBRS for user mode?

Eventually, once the dust settles.  I think there's a spectrum of
paranoia here, that is roughly (with increasing paranoia):

1. do nothing
2. do retpoline
3. do IBRS in kernel
4. do IBRS always

I think you're asking for ~3.5.

Patches for 1-3 are out there and 4 is pretty straightforward.  Doing a
arch_prctl() is still straightforward, but will be a much more niche
thing than any of the other choices.  Plus, with a user interface, we
have to argue over the ABI for at least a month or two. ;)


Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:51 PM, Andy Lutomirski wrote:
> Do we need an arch_prctl() to enable IBRS for user mode?

Eventually, once the dust settles.  I think there's a spectrum of
paranoia here, that is roughly (with increasing paranoia):

1. do nothing
2. do retpoline
3. do IBRS in kernel
4. do IBRS always

I think you're asking for ~3.5.

Patches for 1-3 are out there and 4 is pretty straightforward.  Doing a
arch_prctl() is still straightforward, but will be a much more niche
thing than any of the other choices.  Plus, with a user interface, we
have to argue over the ABI for at least a month or two. ;)


Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:54 PM, Andy Lutomirski wrote:
> On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen  wrote:
>> On 01/04/2018 02:21 PM, Tim Chen wrote:
 Does this really have to live outside of arch/x86/entry/ ?

>>> There are some inline C routines later in this file
>>> that will be needed by other functions.  Want to consolidate
>>> them in the same file.
>>
>> We could put all of the assembly into calling.h along with the PTI
>> assembly.  Seems as sane a place as anywhere else to put it.
> 
> We should also stop thinking that NMI is at all special.  All the
> paranoid entry paths + NMI should just save and restore it, just like
> CR3.  Otherwise we get nasty corner cases with MCE, kprobes, etc.

I've probably been too imprecise in my language here.  The goal is
absolutely to deal with all the paranoid paths.  It's just that the NMI
one is the easiest to understand and easiest to exercise.

It also *is* special because it's the only one needing paranoid handling
that does not use paranoid_exit itself.


Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS

2018-01-04 Thread Dave Hansen
On 01/04/2018 08:54 PM, Andy Lutomirski wrote:
> On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen  wrote:
>> On 01/04/2018 02:21 PM, Tim Chen wrote:
 Does this really have to live outside of arch/x86/entry/ ?

>>> There are some inline C routines later in this file
>>> that will be needed by other functions.  Want to consolidate
>>> them in the same file.
>>
>> We could put all of the assembly into calling.h along with the PTI
>> assembly.  Seems as sane a place as anywhere else to put it.
> 
> We should also stop thinking that NMI is at all special.  All the
> paranoid entry paths + NMI should just save and restore it, just like
> CR3.  Otherwise we get nasty corner cases with MCE, kprobes, etc.

I've probably been too imprecise in my language here.  The goal is
absolutely to deal with all the paranoid paths.  It's just that the NMI
one is the easiest to understand and easiest to exercise.

It also *is* special because it's the only one needing paranoid handling
that does not use paranoid_exit itself.


linux-next: build failure after merge of the akpm-current tree

2018-01-04 Thread Stephen Rothwell
Hi Andrew,

After merging the akpm-current tree, today's linux-next build (x86_64
allmodconfig) failed like this:

mm/migrate.c: In function 'migrate_misplaced_page':
mm/migrate.c:1933:46: error: passing argument 2 of 'migrate_pages' from 
incompatible pointer type [-Werror=incompatible-pointer-types]
  nr_remaining = migrate_pages(, alloc_misplaced_dst_page,
  ^
mm/migrate.c:1358:5: note: expected 'struct page * (*)(struct page *, long 
unsigned int)' but argument is of type 'struct page * (*)(struct page *, long 
unsigned int,  int **)'
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
 ^

Caused by commit

  d6f08a86f78a ("mm, migrate: remove reason argument from new_page_t")

I applied the following fix patch for today (the mm/memory-failure.c
error turned up after fixing the above):

From: Stephen Rothwell 
Date: Fri, 5 Jan 2018 15:46:02 +1100
Subject: [PATCH] mm, migrate: remove reason argument from new_page_t fix

Signed-off-by: Stephen Rothwell 
---
 mm/memory-failure.c | 2 +-
 mm/migrate.c| 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4acdf393a801..d530ac1db680 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,7 @@ int unpoison_memory(unsigned long pfn)
 }
 EXPORT_SYMBOL(unpoison_memory);
 
-static struct page *new_page(struct page *p, unsigned long private, int **x)
+static struct page *new_page(struct page *p, unsigned long private)
 {
int nid = page_to_nid(p);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 3cb0f5955b41..5d0dc7b85f90 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1797,8 +1797,7 @@ static bool migrate_balanced_pgdat(struct pglist_data 
*pgdat,
 }
 
 static struct page *alloc_misplaced_dst_page(struct page *page,
-  unsigned long data,
-  int **result)
+  unsigned long data)
 {
int nid = (int) data;
struct page *newpage;
-- 
2.15.0

-- 
Cheers,
Stephen Rothwell


linux-next: build failure after merge of the akpm-current tree

2018-01-04 Thread Stephen Rothwell
Hi Andrew,

After merging the akpm-current tree, today's linux-next build (x86_64
allmodconfig) failed like this:

mm/migrate.c: In function 'migrate_misplaced_page':
mm/migrate.c:1933:46: error: passing argument 2 of 'migrate_pages' from 
incompatible pointer type [-Werror=incompatible-pointer-types]
  nr_remaining = migrate_pages(, alloc_misplaced_dst_page,
  ^
mm/migrate.c:1358:5: note: expected 'struct page * (*)(struct page *, long 
unsigned int)' but argument is of type 'struct page * (*)(struct page *, long 
unsigned int,  int **)'
 int migrate_pages(struct list_head *from, new_page_t get_new_page,
 ^

Caused by commit

  d6f08a86f78a ("mm, migrate: remove reason argument from new_page_t")

I applied the following fix patch for today (the mm/memory-failure.c
error turned up after fixing the above):

From: Stephen Rothwell 
Date: Fri, 5 Jan 2018 15:46:02 +1100
Subject: [PATCH] mm, migrate: remove reason argument from new_page_t fix

Signed-off-by: Stephen Rothwell 
---
 mm/memory-failure.c | 2 +-
 mm/migrate.c| 3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4acdf393a801..d530ac1db680 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1483,7 +1483,7 @@ int unpoison_memory(unsigned long pfn)
 }
 EXPORT_SYMBOL(unpoison_memory);
 
-static struct page *new_page(struct page *p, unsigned long private, int **x)
+static struct page *new_page(struct page *p, unsigned long private)
 {
int nid = page_to_nid(p);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 3cb0f5955b41..5d0dc7b85f90 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1797,8 +1797,7 @@ static bool migrate_balanced_pgdat(struct pglist_data 
*pgdat,
 }
 
 static struct page *alloc_misplaced_dst_page(struct page *page,
-  unsigned long data,
-  int **result)
+  unsigned long data)
 {
int nid = (int) data;
struct page *newpage;
-- 
2.15.0

-- 
Cheers,
Stephen Rothwell


Re: KASAN: slab-out-of-bounds Read in cap_inode_getsecurity

2018-01-04 Thread Eric Biggers
On Thu, Jan 04, 2018 at 08:58:02AM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 71ee203389f7cb1c1927eab22b95baa01405791c
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+37db7b2a61b64a9ab...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> audit: type=1400 audit(1514753657.623:7): avc:  denied  { map } for
> pid=3504 comm="syzkaller926656" path="/root/syzkaller926656864" dev="sda1"
> ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> ==
> BUG: KASAN: slab-out-of-bounds in cap_inode_getsecurity+0x621/0x7d0
> security/commoncap.c:408
> Read of size 4 at addr 8801bea30b00 by task syzkaller926656/3504
> 
> CPU: 1 PID: 3504 Comm: syzkaller926656 Not tainted 4.15.0-rc5+ #244
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
>  cap_inode_getsecurity+0x621/0x7d0 security/commoncap.c:408
>  security_inode_getsecurity+0xcd/0x110 security/security.c:809
>  xattr_getsecurity+0xd3/0x1f0 fs/xattr.c:244
>  vfs_getxattr+0xc8/0x110 fs/xattr.c:333
>  getxattr+0x116/0x2a0 fs/xattr.c:540
>  path_getxattr+0xed/0x170 fs/xattr.c:568
>  SYSC_getxattr fs/xattr.c:580 [inline]
>  SyS_getxattr+0x33/0x40 fs/xattr.c:577
>  entry_SYSCALL_64_fastpath+0x23/0x9a

Already fixed in Linus's tree.

#syz fix: capabilities: fix buffer overread on very short xattr


Re: KASAN: slab-out-of-bounds Read in cap_inode_getsecurity

2018-01-04 Thread Eric Biggers
On Thu, Jan 04, 2018 at 08:58:02AM -0800, syzbot wrote:
> Hello,
> 
> syzkaller hit the following crash on
> 71ee203389f7cb1c1927eab22b95baa01405791c
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+37db7b2a61b64a9ab...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
> 
> audit: type=1400 audit(1514753657.623:7): avc:  denied  { map } for
> pid=3504 comm="syzkaller926656" path="/root/syzkaller926656864" dev="sda1"
> ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> ==
> BUG: KASAN: slab-out-of-bounds in cap_inode_getsecurity+0x621/0x7d0
> security/commoncap.c:408
> Read of size 4 at addr 8801bea30b00 by task syzkaller926656/3504
> 
> CPU: 1 PID: 3504 Comm: syzkaller926656 Not tainted 4.15.0-rc5+ #244
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429
>  cap_inode_getsecurity+0x621/0x7d0 security/commoncap.c:408
>  security_inode_getsecurity+0xcd/0x110 security/security.c:809
>  xattr_getsecurity+0xd3/0x1f0 fs/xattr.c:244
>  vfs_getxattr+0xc8/0x110 fs/xattr.c:333
>  getxattr+0x116/0x2a0 fs/xattr.c:540
>  path_getxattr+0xed/0x170 fs/xattr.c:568
>  SYSC_getxattr fs/xattr.c:580 [inline]
>  SyS_getxattr+0x33/0x40 fs/xattr.c:577
>  entry_SYSCALL_64_fastpath+0x23/0x9a

Already fixed in Linus's tree.

#syz fix: capabilities: fix buffer overread on very short xattr


Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen  wrote:
> On 01/04/2018 02:21 PM, Tim Chen wrote:
>>> Does this really have to live outside of arch/x86/entry/ ?
>>>
>> There are some inline C routines later in this file
>> that will be needed by other functions.  Want to consolidate
>> them in the same file.
>
> We could put all of the assembly into calling.h along with the PTI
> assembly.  Seems as sane a place as anywhere else to put it.
>

We should also stop thinking that NMI is at all special.  All the
paranoid entry paths + NMI should just save and restore it, just like
CR3.  Otherwise we get nasty corner cases with MCE, kprobes, etc.


Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen  wrote:
> On 01/04/2018 02:21 PM, Tim Chen wrote:
>>> Does this really have to live outside of arch/x86/entry/ ?
>>>
>> There are some inline C routines later in this file
>> that will be needed by other functions.  Want to consolidate
>> them in the same file.
>
> We could put all of the assembly into calling.h along with the PTI
> assembly.  Seems as sane a place as anywhere else to put it.
>

We should also stop thinking that NMI is at all special.  All the
paranoid entry paths + NMI should just save and restore it, just like
CR3.  Otherwise we get nasty corner cases with MCE, kprobes, etc.


Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 4:08 PM, Dave Hansen  wrote:
> On 01/04/2018 02:33 PM, Peter Zijlstra wrote:
>> On Thu, Jan 04, 2018 at 09:56:44AM -0800, Tim Chen wrote:
>>> Set IBRS upon kernel entrance via syscall and interrupts. Clear it
>>> upon exit.
>>
>> So not only did we add a CR3 write, we're now adding an MSR write to the
>> entry/exit paths. Please tell me that these are 'fast' MSRs? Given
>> people are already reporting stupid numbers with just the existing
>> PTI/CR3, what kind of pain are we going to get from adding this?
>
> This "dynamic IBRS" that does runtime switching will not be on by
> default and will be patched around by alternatives unless someone
> explicitly opts in.
>
> If you decide you want the additional protection that it provides, you
> can take the performance hit.  How much is that?  We've been saying that
> these new MSRs are roughly as expensive as the CR3 writes.  How
> expensive are those?  Don't take my word for it, a few folks were
> talking about it today:
>
> Google says[1]: "We see negligible impact on performance."
> Amazon says[2]: "We don’t expect meaningful performance impact."
>
> I chopped a few qualifiers out of there, but I think that roughly
> captures the sentiment.
>
> 1.
> https://security.googleblog.com/2018/01/more-details-about-mitigations-for-cpu_4.html
> 2.
> http://www.businessinsider.com/google-amazon-performance-hit-meltdown-spectre-fixes-overblown-2018-1

Do we need an arch_prctl() to enable IBRS for user mode?


Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts

2018-01-04 Thread Andy Lutomirski
On Thu, Jan 4, 2018 at 4:08 PM, Dave Hansen  wrote:
> On 01/04/2018 02:33 PM, Peter Zijlstra wrote:
>> On Thu, Jan 04, 2018 at 09:56:44AM -0800, Tim Chen wrote:
>>> Set IBRS upon kernel entrance via syscall and interrupts. Clear it
>>> upon exit.
>>
>> So not only did we add a CR3 write, we're now adding an MSR write to the
>> entry/exit paths. Please tell me that these are 'fast' MSRs? Given
>> people are already reporting stupid numbers with just the existing
>> PTI/CR3, what kind of pain are we going to get from adding this?
>
> This "dynamic IBRS" that does runtime switching will not be on by
> default and will be patched around by alternatives unless someone
> explicitly opts in.
>
> If you decide you want the additional protection that it provides, you
> can take the performance hit.  How much is that?  We've been saying that
> these new MSRs are roughly as expensive as the CR3 writes.  How
> expensive are those?  Don't take my word for it, a few folks were
> talking about it today:
>
> Google says[1]: "We see negligible impact on performance."
> Amazon says[2]: "We don’t expect meaningful performance impact."
>
> I chopped a few qualifiers out of there, but I think that roughly
> captures the sentiment.
>
> 1.
> https://security.googleblog.com/2018/01/more-details-about-mitigations-for-cpu_4.html
> 2.
> http://www.businessinsider.com/google-amazon-performance-hit-meltdown-spectre-fixes-overblown-2018-1

Do we need an arch_prctl() to enable IBRS for user mode?


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote:
> On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >>We tried things like this in an earlier iteration[1] which assumed the SG
> >>was homogenous (all P2P or all regular memory). This required serious
> >>ugliness to try and ensure SGs were in fact homogenous[2].
> >
> >I'm confused, these patches already assume the sg is homogenous,
> >right? Sure looks that way. So [2] is just debugging??
> 
> Yes, but it's a bit different to expect that someone calling
> pci_p2pmem_map_sg() will know what they're doing and provide a homogenous
> SG. It is relatively clear by convention that the entire SG must be
> homogenous given they're calling a pci_p2pmem function. Where as, allowing
> P2P SGs into the core DMA code means all we can do is hope that future
> developers don't screw it up and allow P2P pages to mix in with regular
> pages.

Well that argument applies equally to the RDMA RW API wrappers around
the DMA API. I think it is fine if sgl are defined to only have P2P or
not, and that debugging support seemed reasonable to me..

> It's also very difficult to add similar functionality to dma_map_page seeing
> dma_unmap_page won't have any way to know what it's dealing with. It just
> seems confusing to support P2P in the SG version and not the page version.

Well, this proposal is to support P2P in only some RDMA APIs and not
others, so it seems about as confusing to me..

> >Then we don't need to patch RDMA because RDMA is not special when it
> >comes to P2P. P2P should work with everything.
> 
> Yes, I agree this would be very nice.

Well, it is more than very nice. We have to keep RDMA working after
all, and if you make it even more special things become harder for us.

It is already the case that DMA in RDMA is very strange. We have
drivers that provide their own DMA ops, for instance.

And on that topic, does this scheme work with HFI?

On first glance, it looks like no. The PCI device the HFI device is
attached to may be able to do P2P, so it should be able to trigger the
support.

However, substituting the p2p_dma_map for the real device op dma_map
will cause a kernel crash when working with HFI. HFI uses a custom DMA
ops that returns CPU addreses in the dma_addr_t which the driver
handles in various special ways. One cannot just replace them with PCI
bus addresses.

So, this kinda looks to me like it causes bad breakage for some RDMA
drivers??

This is why P2P must fit in to the common DMA framework somehow, we
rely on these abstractions to work properly and fully in RDMA.

I think you should consider pushing this directly into the dma_ops
implementations. Add a p2p_supported flag to struct dma_map_ops, and
only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
Only map_sg would be supported for P2P. Upgraded implementations can
call the helper function.

Jason


Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()

2018-01-04 Thread Jason Gunthorpe
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote:
> On 04/01/18 03:13 PM, Jason Gunthorpe wrote:
> >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote:
> >>We tried things like this in an earlier iteration[1] which assumed the SG
> >>was homogenous (all P2P or all regular memory). This required serious
> >>ugliness to try and ensure SGs were in fact homogenous[2].
> >
> >I'm confused, these patches already assume the sg is homogenous,
> >right? Sure looks that way. So [2] is just debugging??
> 
> Yes, but it's a bit different to expect that someone calling
> pci_p2pmem_map_sg() will know what they're doing and provide a homogenous
> SG. It is relatively clear by convention that the entire SG must be
> homogenous given they're calling a pci_p2pmem function. Where as, allowing
> P2P SGs into the core DMA code means all we can do is hope that future
> developers don't screw it up and allow P2P pages to mix in with regular
> pages.

Well that argument applies equally to the RDMA RW API wrappers around
the DMA API. I think it is fine if sgl are defined to only have P2P or
not, and that debugging support seemed reasonable to me..

> It's also very difficult to add similar functionality to dma_map_page seeing
> dma_unmap_page won't have any way to know what it's dealing with. It just
> seems confusing to support P2P in the SG version and not the page version.

Well, this proposal is to support P2P in only some RDMA APIs and not
others, so it seems about as confusing to me..

> >Then we don't need to patch RDMA because RDMA is not special when it
> >comes to P2P. P2P should work with everything.
> 
> Yes, I agree this would be very nice.

Well, it is more than very nice. We have to keep RDMA working after
all, and if you make it even more special things become harder for us.

It is already the case that DMA in RDMA is very strange. We have
drivers that provide their own DMA ops, for instance.

And on that topic, does this scheme work with HFI?

On first glance, it looks like no. The PCI device the HFI device is
attached to may be able to do P2P, so it should be able to trigger the
support.

However, substituting the p2p_dma_map for the real device op dma_map
will cause a kernel crash when working with HFI. HFI uses a custom DMA
ops that returns CPU addreses in the dma_addr_t which the driver
handles in various special ways. One cannot just replace them with PCI
bus addresses.

So, this kinda looks to me like it causes bad breakage for some RDMA
drivers??

This is why P2P must fit in to the common DMA framework somehow, we
rely on these abstractions to work properly and fully in RDMA.

I think you should consider pushing this directly into the dma_ops
implementations. Add a p2p_supported flag to struct dma_map_ops, and
only if it is true can a caller pass a homogeneous SGL to ops->map_sg.
Only map_sg would be supported for P2P. Upgraded implementations can
call the helper function.

Jason


  1   2   3   4   5   6   7   8   9   10   >