Re: [PATCH v3] arm64: v8.4: Support for new floating point multiplication instructions
On Fri, Jan 05, 2018 at 09:22:54AM +0800, gengdongjiu wrote: > Hi will/catalin > > On 2017/12/13 18:09, Suzuki K Poulose wrote: > > On 13/12/17 10:13, Dongjiu Geng wrote: > >> ARM v8.4 extensions add new neon instructions for performing a > >> multiplication of each FP16 element of one vector with the corresponding > >> FP16 element of a second vector, and to add or subtract this without an > >> intermediate rounding to the corresponding FP32 element in a third vector. > >> > >> This patch detects this feature and let the userspace know about it via a > >> HWCAP bit and MRS emulation. > >> > >> Cc: Dave Martin> >> Cc: Suzuki K Poulose > >> Signed-off-by: Dongjiu Geng > >> Reviewed-by: Dave Martin > > > > Looks good to me. > > > > Reviewed-by: Suzuki K Poulose > > sorry to disturb you. Reminder, hope this patch can be applied to Linux > 4.15-rc7. New features should not be going into 4.15-rc, that should be a 4.16-rc1 thing, right? thanks, greg k-h
Re: [PATCH v3] arm64: v8.4: Support for new floating point multiplication instructions
On Fri, Jan 05, 2018 at 09:22:54AM +0800, gengdongjiu wrote: > Hi will/catalin > > On 2017/12/13 18:09, Suzuki K Poulose wrote: > > On 13/12/17 10:13, Dongjiu Geng wrote: > >> ARM v8.4 extensions add new neon instructions for performing a > >> multiplication of each FP16 element of one vector with the corresponding > >> FP16 element of a second vector, and to add or subtract this without an > >> intermediate rounding to the corresponding FP32 element in a third vector. > >> > >> This patch detects this feature and let the userspace know about it via a > >> HWCAP bit and MRS emulation. > >> > >> Cc: Dave Martin > >> Cc: Suzuki K Poulose > >> Signed-off-by: Dongjiu Geng > >> Reviewed-by: Dave Martin > > > > Looks good to me. > > > > Reviewed-by: Suzuki K Poulose > > sorry to disturb you. Reminder, hope this patch can be applied to Linux > 4.15-rc7. New features should not be going into 4.15-rc, that should be a 4.16-rc1 thing, right? thanks, greg k-h
Re: [PATCH 4.4 00/37] 4.4.110-stable review
On Thu, Jan 04, 2018 at 03:00:29PM -0700, Shuah Khan wrote: > On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.4.110 release. > > There are 37 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Fri Jan 5 19:50:38 UTC 2018. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz > > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-4.4.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Based on the email threads, I expected to see issues, however, > compiled and booted on my test system. No dmesg regressions. Hey, you got lucky :) Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 4.4 00/37] 4.4.110-stable review
On Thu, Jan 04, 2018 at 03:00:29PM -0700, Shuah Khan wrote: > On 01/03/2018 01:11 PM, Greg Kroah-Hartman wrote: > > This is the start of the stable review cycle for the 4.4.110 release. > > There are 37 patches in this series, all will be posted as a response > > to this one. If anyone has any issues with these being applied, please > > let me know. > > > > Responses should be made by Fri Jan 5 19:50:38 UTC 2018. > > Anything received after that time might be too late. > > > > The whole patch series can be found in one patch at: > > kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.110-rc1.gz > > or in the git tree and branch at: > > git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > linux-4.4.y > > and the diffstat can be found below. > > > > thanks, > > > > greg k-h > > > > Based on the email threads, I expected to see issues, however, > compiled and booted on my test system. No dmesg regressions. Hey, you got lucky :) Thanks for testing all of these and letting me know. greg k-h
Re: [PATCH 4.14 00/14] 4.14.12-stable review
On Thu, Jan 04, 2018 at 04:12:31PM -0800, Kevin Hilman wrote: > kernelci.org botwrites: > > > stable-rc/linux-4.14.y boot: 118 boots: 4 failed, 113 passed with 1 offline > > (v4.14.11-15-g732141e47ee6) > > > > Full Boot Summary: > > https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/ > > Full Build Summary: > > https://kernelci.org/build/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/ > > > > Tree: stable-rc > > Branch: linux-4.14.y > > Git Describe: v4.14.11-15-g732141e47ee6 > > Git Commit: 732141e47ee614d70aeb8ad828a977ad19447e87 > > Git URL: > > http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > Tested: 68 unique boards, 23 SoC families, 16 builds out of 185 > > > > Boot Regressions Detected: > > TL;DR; All is well. Thanks for the summary of all of these, and for your continued testing. greg k-h
Re: [PATCH 4.14 00/14] 4.14.12-stable review
On Thu, Jan 04, 2018 at 04:12:31PM -0800, Kevin Hilman wrote: > kernelci.org bot writes: > > > stable-rc/linux-4.14.y boot: 118 boots: 4 failed, 113 passed with 1 offline > > (v4.14.11-15-g732141e47ee6) > > > > Full Boot Summary: > > https://kernelci.org/boot/all/job/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/ > > Full Build Summary: > > https://kernelci.org/build/stable-rc/branch/linux-4.14.y/kernel/v4.14.11-15-g732141e47ee6/ > > > > Tree: stable-rc > > Branch: linux-4.14.y > > Git Describe: v4.14.11-15-g732141e47ee6 > > Git Commit: 732141e47ee614d70aeb8ad828a977ad19447e87 > > Git URL: > > http://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git > > Tested: 68 unique boards, 23 SoC families, 16 builds out of 185 > > > > Boot Regressions Detected: > > TL;DR; All is well. Thanks for the summary of all of these, and for your continued testing. greg k-h
Re: [PATCH] tty: fix data race in n_tty_receive_buf_common
Hi Alan, Can you make that code available otherwise it's impossible to see what the problem might be. https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/tty/serial?h=msm-4.9 As discussed , there not seems a problem as we are getting print request even when port seems to closed. tty_ldisc_lock(tty, 5 * HZ); tty_ldisc_setup(tty); tty_ldisc_unlock(tty) But in above lock, there is a chance when flush_to_ldisc will occur first and acquired a lock in tty_ldisc_ref itself. So this may fail, I am not much sure here, Please correct me, If i am missing something here. So can not we simply return from flush_to_ldisc ,when we know disc_data is not valid like we are doing for tty and ldisc already? if (tty->disc_data == NULL) { tty_ldisc_deref(disc); return; } Regards Gaurav -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH] tty: fix data race in n_tty_receive_buf_common
Hi Alan, Can you make that code available otherwise it's impossible to see what the problem might be. https://source.codeaurora.org/quic/la/kernel/msm-4.9/tree/drivers/tty/serial?h=msm-4.9 As discussed , there not seems a problem as we are getting print request even when port seems to closed. tty_ldisc_lock(tty, 5 * HZ); tty_ldisc_setup(tty); tty_ldisc_unlock(tty) But in above lock, there is a chance when flush_to_ldisc will occur first and acquired a lock in tty_ldisc_ref itself. So this may fail, I am not much sure here, Please correct me, If i am missing something here. So can not we simply return from flush_to_ldisc ,when we know disc_data is not valid like we are doing for tty and ldisc already? if (tty->disc_data == NULL) { tty_ldisc_deref(disc); return; } Regards Gaurav -- Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Crypto Fixes for 4.15
Hi Linus: This push fixes the following issues: - Racy use of ctx->rcvused in af_alg. - algif_aead crash in chacha20poly1305. - Freeing bogus pointer in pcrypt. - Build error on MIPS in mpi. - Memory leak in inside-secure. - Memory overwrite in inside-secure. - NULL pointer dereference in inside-secure. - State corruption in inside-secure. - Build error without CRYPTO_GF128MUL in chelsio. - Use after free in n2. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Antoine Ténart (3): crypto: inside-secure - free requests even if their handling failed crypto: inside-secure - fix request allocations in invalidation path crypto: inside-secure - do not use areq->result for partial results Arnd Bergmann (1): crypto: chelsio - select CRYPTO_GF128MUL Eric Biggers (2): crypto: chacha20poly1305 - validate the digest size crypto: pcrypt - fix freeing pcrypt instances James Hogan (1): lib/mpi: Fix umul_ppmm() for MIPS64r6 Jan Engelhardt (1): crypto: n2 - cure use after free Jonathan Cameron (1): crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t Ofer Heifetz (1): crypto: inside-secure - per request invalidation crypto/af_alg.c|4 +- crypto/algif_aead.c|2 +- crypto/algif_skcipher.c|2 +- crypto/chacha20poly1305.c |6 +- crypto/pcrypt.c| 19 ++--- drivers/crypto/chelsio/Kconfig |1 + drivers/crypto/inside-secure/safexcel.c|1 + drivers/crypto/inside-secure/safexcel_cipher.c | 85 -- drivers/crypto/inside-secure/safexcel_hash.c | 89 +--- drivers/crypto/n2_core.c |3 + include/crypto/if_alg.h|5 +- lib/mpi/longlong.h | 18 - 12 files changed, 173 insertions(+), 62 deletions(-) Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Crypto Fixes for 4.15
Hi Linus: This push fixes the following issues: - Racy use of ctx->rcvused in af_alg. - algif_aead crash in chacha20poly1305. - Freeing bogus pointer in pcrypt. - Build error on MIPS in mpi. - Memory leak in inside-secure. - Memory overwrite in inside-secure. - NULL pointer dereference in inside-secure. - State corruption in inside-secure. - Build error without CRYPTO_GF128MUL in chelsio. - Use after free in n2. Please pull from git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git linus Antoine Ténart (3): crypto: inside-secure - free requests even if their handling failed crypto: inside-secure - fix request allocations in invalidation path crypto: inside-secure - do not use areq->result for partial results Arnd Bergmann (1): crypto: chelsio - select CRYPTO_GF128MUL Eric Biggers (2): crypto: chacha20poly1305 - validate the digest size crypto: pcrypt - fix freeing pcrypt instances James Hogan (1): lib/mpi: Fix umul_ppmm() for MIPS64r6 Jan Engelhardt (1): crypto: n2 - cure use after free Jonathan Cameron (1): crypto: af_alg - Fix race around ctx->rcvused by making it atomic_t Ofer Heifetz (1): crypto: inside-secure - per request invalidation crypto/af_alg.c|4 +- crypto/algif_aead.c|2 +- crypto/algif_skcipher.c|2 +- crypto/chacha20poly1305.c |6 +- crypto/pcrypt.c| 19 ++--- drivers/crypto/chelsio/Kconfig |1 + drivers/crypto/inside-secure/safexcel.c|1 + drivers/crypto/inside-secure/safexcel_cipher.c | 85 -- drivers/crypto/inside-secure/safexcel_hash.c | 89 +--- drivers/crypto/n2_core.c |3 + include/crypto/if_alg.h|5 +- lib/mpi/longlong.h | 18 - 12 files changed, 173 insertions(+), 62 deletions(-) Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [GIT PULL 2/3] SOC: Keystone SOC update for 4.16
On Wed, Dec 27, 2017 at 06:07:51PM -0800, Santosh Shilimkar wrote: > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_driver_soc_for_4.16 > > for you to fetch changes up to aefc5818553680c50c9f6840e47c01b80edd9b3a: > > soc: ti: fix max dup length for kstrndup (2017-12-16 14:45:33 -0800) > > > SOC: Keystone Soc driver updates for 4.16 > > - TI EMIF-SRAM driver > - TI SCI print format fix > - Navigator strndup lenth fix > > > Arnd Bergmann (1): > memory: ti-emif-sram: remove unused variable > > Dave Gerlach (2): > Documentation: dt: Update ti,emif bindings > memory: ti-emif-sram: introduce relocatable suspend/resume handlers > > Ma Shimiao (1): > soc: ti: fix max dup length for kstrndup > > Nishanth Menon (1): > firmware: ti_sci: Use %zu for size_t print format > > .../bindings/memory-controllers/ti/emif.txt| 17 +- > drivers/firmware/ti_sci.c | 4 +- > drivers/memory/Kconfig | 10 + > drivers/memory/Makefile| 8 + > drivers/memory/Makefile.asm-offsets| 5 + > drivers/memory/emif-asm-offsets.c | 92 ++ > drivers/memory/emif.h | 17 ++ > drivers/memory/ti-emif-pm.c| 324 > drivers/memory/ti-emif-sram-pm.S | 334 > + > drivers/soc/ti/knav_qmss_queue.c | 4 +- > include/linux/ti-emif-sram.h | 69 + Based on the contents, I merged this into next/drivers instead of next/soc. -Olof
Re: [GIT PULL] arm64: dts: uniphier: UniPhier DT updates (64bit) for v4.16
On Fri, Dec 29, 2017 at 10:35:38PM +0900, Masahiro Yamada wrote: > Hi Arnd, Olof, > > Here are UniPhier DT (64bit) updates for the v4.16 merge window. > Please pull! > > > The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36: > > Linux 4.15-rc3 (2017-12-10 17:56:26 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git > tags/uniphier-dt64-v4.16 > > for you to fetch changes up to dbdae8474e08fc1194102bef95dc96db435c15da: > > arm64: dts: uniphier: enable more serial ports for PXs3 ref board > (2017-12-29 22:03:26 +0900) > > > UniPhier ARM64 SoC DT updates for v4.16 > > - clean up gpios properties by macro > - add GPIO hog for PXs3 reference node > - add has-transaction-translator property to generic-ehci nodes > - enable more serial ports for PXs3 reference node Merged, thanks! -Olof
Re: [GIT PULL 2/3] SOC: Keystone SOC update for 4.16
On Wed, Dec 27, 2017 at 06:07:51PM -0800, Santosh Shilimkar wrote: > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_driver_soc_for_4.16 > > for you to fetch changes up to aefc5818553680c50c9f6840e47c01b80edd9b3a: > > soc: ti: fix max dup length for kstrndup (2017-12-16 14:45:33 -0800) > > > SOC: Keystone Soc driver updates for 4.16 > > - TI EMIF-SRAM driver > - TI SCI print format fix > - Navigator strndup lenth fix > > > Arnd Bergmann (1): > memory: ti-emif-sram: remove unused variable > > Dave Gerlach (2): > Documentation: dt: Update ti,emif bindings > memory: ti-emif-sram: introduce relocatable suspend/resume handlers > > Ma Shimiao (1): > soc: ti: fix max dup length for kstrndup > > Nishanth Menon (1): > firmware: ti_sci: Use %zu for size_t print format > > .../bindings/memory-controllers/ti/emif.txt| 17 +- > drivers/firmware/ti_sci.c | 4 +- > drivers/memory/Kconfig | 10 + > drivers/memory/Makefile| 8 + > drivers/memory/Makefile.asm-offsets| 5 + > drivers/memory/emif-asm-offsets.c | 92 ++ > drivers/memory/emif.h | 17 ++ > drivers/memory/ti-emif-pm.c| 324 > drivers/memory/ti-emif-sram-pm.S | 334 > + > drivers/soc/ti/knav_qmss_queue.c | 4 +- > include/linux/ti-emif-sram.h | 69 + Based on the contents, I merged this into next/drivers instead of next/soc. -Olof
Re: [GIT PULL] arm64: dts: uniphier: UniPhier DT updates (64bit) for v4.16
On Fri, Dec 29, 2017 at 10:35:38PM +0900, Masahiro Yamada wrote: > Hi Arnd, Olof, > > Here are UniPhier DT (64bit) updates for the v4.16 merge window. > Please pull! > > > The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36: > > Linux 4.15-rc3 (2017-12-10 17:56:26 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git > tags/uniphier-dt64-v4.16 > > for you to fetch changes up to dbdae8474e08fc1194102bef95dc96db435c15da: > > arm64: dts: uniphier: enable more serial ports for PXs3 ref board > (2017-12-29 22:03:26 +0900) > > > UniPhier ARM64 SoC DT updates for v4.16 > > - clean up gpios properties by macro > - add GPIO hog for PXs3 reference node > - add has-transaction-translator property to generic-ehci nodes > - enable more serial ports for PXs3 reference node Merged, thanks! -Olof
Re: [GIT PULL] ARM: at91: drivers for 4.16
On Sun, Dec 31, 2017 at 04:34:42PM +0100, Alexandre Belloni wrote: > Arnd, Olof, > > A single harmless change for this pull request. I hope you'll enjoy this > New Year's Eve. > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git > tags/at91-ab-4.16-drivers > > for you to fetch changes up to 1203839290f151b84f5e54165d6d039e9514b236: > > pcmcia: at91_cf: Use PTR_ERR_OR_ZERO() (2017-11-29 21:58:58 +0100) > > > drivers for 4.16 > > - use PTR_ERR_OR_ZERO were relevant in at91_cf Merged, thanks. -Olof
Re: [GIT PULL] ARM: dts: uniphier: UniPhier DT updates for v4.16
Hi! On Fri, Dec 29, 2017 at 10:32:24PM +0900, Masahiro Yamada wrote: > Hi Arnd, Olof, > > Here are UniPhier DT (32bit) updates for the v4.16 merge window. > Please pull! > > > The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36: > > Linux 4.15-rc3 (2017-12-10 17:56:26 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git > tags/uniphier-dt-v4.16 Tiny tiny nit: It makes our life a little easier if you don't linewrap the URL+tag, since it's easier to copy-paste without a linebreak. > for you to fetch changes up to 6fa9b0255099fcd289f7e3857714532843044c76: > > ARM: dts: uniphier: add has-transaction-translator property to usb > node for LD4, sLD8 and Pro4 (2017-12-27 23:59:37 +0900) > > > UniPhier ARM SoC DT updates for v4.16 > > - clean up gpios properties by macro > - add efuse nodes > - add has-transaction-translator property to generic-ehci nodes > > > Keiji Hayashibara (1): > ARM: dts: uniphier: add efuse node for UniPhier 32bit SoC > > Kunihiko Hayashi (1): > ARM: dts: uniphier: add has-transaction-translator property to > usb node for LD4, sLD8 and Pro4 Another small nit: This patch subject is a bit on the long side. Try to keep it to ~60 characters if you can. Merged the branch. Thanks! -Olof > > Masahiro Yamada (1): > ARM: dts: uniphier: use macros in dt-bindings header > > arch/arm/boot/dts/uniphier-ld4-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-ld4.dtsi | 23 + > arch/arm/boot/dts/uniphier-ld6b-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-pro4-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-pro4.dtsi| 27 +++ > arch/arm/boot/dts/uniphier-pro5.dtsi| 33 > arch/arm/boot/dts/uniphier-pxs2.dtsi| 19 ++ > arch/arm/boot/dts/uniphier-sld8-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-sld8.dtsi| 23 + > 9 files changed, 129 insertions(+), 4 deletions(-) > > > -- > Best Regards > Masahiro Yamada
Re: [GIT PULL] ARM: at91: drivers for 4.16
On Sun, Dec 31, 2017 at 04:34:42PM +0100, Alexandre Belloni wrote: > Arnd, Olof, > > A single harmless change for this pull request. I hope you'll enjoy this > New Year's Eve. > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git > tags/at91-ab-4.16-drivers > > for you to fetch changes up to 1203839290f151b84f5e54165d6d039e9514b236: > > pcmcia: at91_cf: Use PTR_ERR_OR_ZERO() (2017-11-29 21:58:58 +0100) > > > drivers for 4.16 > > - use PTR_ERR_OR_ZERO were relevant in at91_cf Merged, thanks. -Olof
Re: [GIT PULL] ARM: dts: uniphier: UniPhier DT updates for v4.16
Hi! On Fri, Dec 29, 2017 at 10:32:24PM +0900, Masahiro Yamada wrote: > Hi Arnd, Olof, > > Here are UniPhier DT (32bit) updates for the v4.16 merge window. > Please pull! > > > The following changes since commit 50c4c4e268a2d7a3e58ebb698ac74da0de40ae36: > > Linux 4.15-rc3 (2017-12-10 17:56:26 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-uniphier.git > tags/uniphier-dt-v4.16 Tiny tiny nit: It makes our life a little easier if you don't linewrap the URL+tag, since it's easier to copy-paste without a linebreak. > for you to fetch changes up to 6fa9b0255099fcd289f7e3857714532843044c76: > > ARM: dts: uniphier: add has-transaction-translator property to usb > node for LD4, sLD8 and Pro4 (2017-12-27 23:59:37 +0900) > > > UniPhier ARM SoC DT updates for v4.16 > > - clean up gpios properties by macro > - add efuse nodes > - add has-transaction-translator property to generic-ehci nodes > > > Keiji Hayashibara (1): > ARM: dts: uniphier: add efuse node for UniPhier 32bit SoC > > Kunihiko Hayashi (1): > ARM: dts: uniphier: add has-transaction-translator property to > usb node for LD4, sLD8 and Pro4 Another small nit: This patch subject is a bit on the long side. Try to keep it to ~60 characters if you can. Merged the branch. Thanks! -Olof > > Masahiro Yamada (1): > ARM: dts: uniphier: use macros in dt-bindings header > > arch/arm/boot/dts/uniphier-ld4-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-ld4.dtsi | 23 + > arch/arm/boot/dts/uniphier-ld6b-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-pro4-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-pro4.dtsi| 27 +++ > arch/arm/boot/dts/uniphier-pro5.dtsi| 33 > arch/arm/boot/dts/uniphier-pxs2.dtsi| 19 ++ > arch/arm/boot/dts/uniphier-sld8-ref.dts | 2 +- > arch/arm/boot/dts/uniphier-sld8.dtsi| 23 + > 9 files changed, 129 insertions(+), 4 deletions(-) > > > -- > Best Regards > Masahiro Yamada
Re: [GIT PULL 3/3] ARM: Keystone config update for 4.16
On Wed, Dec 27, 2017 at 06:07:52PM -0800, Santosh Shilimkar wrote: > Also had patch to sync up multi-v7 config but because of conflicts > in next, have to drop it. Will send that post merge window separately > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_config_for_4.16 > > for you to fetch changes up to 10f06c70f337494fc2fec623542186fec80fc395: > > ARM: configs: keystone_defconfig: Enable few peripheral drivers (2017-12-02 > 19:34:36 -0800) > > > ARM: Keystone configs for 4.16 > > - Enable QSPI > - Enable LEDs > - Enable GPIO-decoder > > > Vignesh R (1): > ARM: configs: keystone_defconfig: Enable few peripheral drivers Merged, thanks. -Olof
Re: [GIT PULL] ARM: at91: DT for 4.16
On Sun, Dec 31, 2017 at 04:11:27PM +0100, Alexandre Belloni wrote: > Arnd, Olof, > > This is the at91 DT pull request. The bulk of it is the switch to the > new TCB bindings that were acked a long time ago. These changes are > compatible with the current driver and taking them now will allow for a > smooth transition. > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git > tags/at91-ab-4.16-dt > > for you to fetch changes up to 34a7fc3147bcc14127d941f228ce3b1737e66381: > > ARM: dts: at91: sama5d2_ptc_ek: use TCB0 as timers (2017-12-31 15:50:20 > +0100) Merged, thanks! -Olof
Re: [GIT PULL 3/3] ARM: Keystone config update for 4.16
On Wed, Dec 27, 2017 at 06:07:52PM -0800, Santosh Shilimkar wrote: > Also had patch to sync up multi-v7 config but because of conflicts > in next, have to drop it. Will send that post merge window separately > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_config_for_4.16 > > for you to fetch changes up to 10f06c70f337494fc2fec623542186fec80fc395: > > ARM: configs: keystone_defconfig: Enable few peripheral drivers (2017-12-02 > 19:34:36 -0800) > > > ARM: Keystone configs for 4.16 > > - Enable QSPI > - Enable LEDs > - Enable GPIO-decoder > > > Vignesh R (1): > ARM: configs: keystone_defconfig: Enable few peripheral drivers Merged, thanks. -Olof
Re: [GIT PULL] ARM: at91: DT for 4.16
On Sun, Dec 31, 2017 at 04:11:27PM +0100, Alexandre Belloni wrote: > Arnd, Olof, > > This is the at91 DT pull request. The bulk of it is the switch to the > new TCB bindings that were acked a long time ago. These changes are > compatible with the current driver and taking them now will allow for a > smooth transition. > > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the Git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git > tags/at91-ab-4.16-dt > > for you to fetch changes up to 34a7fc3147bcc14127d941f228ce3b1737e66381: > > ARM: dts: at91: sama5d2_ptc_ek: use TCB0 as timers (2017-12-31 15:50:20 > +0100) Merged, thanks! -Olof
Re: [GIT PULL 1/3] ARM: Keystone DTS for 4.16
On Wed, Dec 27, 2017 at 06:07:50PM -0800, Santosh Shilimkar wrote: > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_dts_for_4.16 > > for you to fetch changes up to 4fe85b0cdd06f8fef2631923799bdc95380badb5: > > ARM: dts: keystone-k2l-clocks: Add missing unit name to clock nodes that > have regs (2017-12-16 14:36:57 -0800) > > > ARM: Keystone DTS update for 4.16 > > - Enable GPIO bank2 for K2L > - Enable QSPI for K2G & K2G-EVM > - Enable UART1/2 for K2G & K2G-EVM > - Enable peripherals for K2G-ICE > - Fix C1 and C2 DTS warnings Merged, thanks. -Olof
Re: [GIT PULL 1/3] ARM: Keystone DTS for 4.16
On Wed, Dec 27, 2017 at 06:07:50PM -0800, Santosh Shilimkar wrote: > The following changes since commit 4fbd8d194f06c8a3fd2af1ce560ddb31f7ec8323: > > Linux 4.15-rc1 (2017-11-26 16:01:47 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux-keystone.git > tags/keystone_dts_for_4.16 > > for you to fetch changes up to 4fe85b0cdd06f8fef2631923799bdc95380badb5: > > ARM: dts: keystone-k2l-clocks: Add missing unit name to clock nodes that > have regs (2017-12-16 14:36:57 -0800) > > > ARM: Keystone DTS update for 4.16 > > - Enable GPIO bank2 for K2L > - Enable QSPI for K2G & K2G-EVM > - Enable UART1/2 for K2G & K2G-EVM > - Enable peripherals for K2G-ICE > - Fix C1 and C2 DTS warnings Merged, thanks. -Olof
[PATCH V3] nvme-pci: fix NULL pointer reference in nvme_alloc_ns
When the io queues setup or tagset allocation failed, ctrl.tagset is NULL. But the scan work will still be queued and executed, then panic comes up due to NULL pointer reference of ctrl.tagset. To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only admin queue is live. When non io queues or tagset allocation failed, ctrl enters into this state, scan work will not be started. But async event work and nvme dev ioctl will be still available. This will be helpful to do further investigation and recovery. V3: - s/NVME_CTRL_ADMIN_LIVE/NVME_CTRL_ADMIN_ONLY/ - s/BUG_ON/WARN_ON_ONCE/ - Other misc code changes V2: - Based on Sagi's suggestion, add new state NVME_CTRL_ADMIN_LIVE. - Change patch name and comment. Suggested-by: Sagi GrimbergSigned-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 25 ++--- drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 30 +- 3 files changed, 44 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..a614cd7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -232,6 +232,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, old_state = ctrl->state; switch (new_state) { + case NVME_CTRL_ADMIN_ONLY: + switch (old_state) { + case NVME_CTRL_RESETTING: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_LIVE: switch (old_state) { case NVME_CTRL_NEW: @@ -247,6 +256,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: changed = true; /* FALLTHRU */ default: @@ -266,6 +276,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, case NVME_CTRL_DELETING: switch (old_state) { case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: case NVME_CTRL_RESETTING: case NVME_CTRL_RECONNECTING: changed = true; @@ -2337,8 +2348,14 @@ static int nvme_dev_open(struct inode *inode, struct file *file) struct nvme_ctrl *ctrl = container_of(inode->i_cdev, struct nvme_ctrl, cdev); - if (ctrl->state != NVME_CTRL_LIVE) + switch(ctrl->state) { + case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: + break; + default: return -EWOULDBLOCK; + } + file->private_data = ctrl; return 0; } @@ -2602,6 +2619,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, static const char *const state_name[] = { [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE]= "live", + [NVME_CTRL_ADMIN_ONLY] = "only-admin", [NVME_CTRL_RESETTING] = "resetting", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING]= "deleting", @@ -3074,6 +3092,8 @@ static void nvme_scan_work(struct work_struct *work) if (ctrl->state != NVME_CTRL_LIVE) return; + WARN_ON_ONCE(!ctrl->tagset); + if (nvme_identify_ctrl(ctrl, )) return; @@ -3094,8 +3114,7 @@ static void nvme_scan_work(struct work_struct *work) void nvme_queue_scan(struct nvme_ctrl *ctrl) { /* -* Do not queue new scan work when a controller is reset during -* removal. +* Only new queue scan work when admin and IO queues are both alive */ if (ctrl->state == NVME_CTRL_LIVE) queue_work(nvme_wq, >scan_work); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ea1aa52..eecf71c 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -119,6 +119,7 @@ static inline struct nvme_request *nvme_req(struct request *req) enum nvme_ctrl_state { NVME_CTRL_NEW, NVME_CTRL_LIVE, + NVME_CTRL_ADMIN_ONLY,/* Only admin queue live */ NVME_CTRL_RESETTING, NVME_CTRL_RECONNECTING, NVME_CTRL_DELETING, diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e758c5a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2035,13 +2035,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev, int queues) } /* - * Return: error value if an error occurred setting up the queues or calling - * Identify Device. 0 if these succeeded, even if adding some of the - * namespaces failed. At the moment, these failures are silent. TBD which - * failures should be reported. + * return error value
[PATCH V3] nvme-pci: fix NULL pointer reference in nvme_alloc_ns
When the io queues setup or tagset allocation failed, ctrl.tagset is NULL. But the scan work will still be queued and executed, then panic comes up due to NULL pointer reference of ctrl.tagset. To fix this, add a new ctrl state NVME_CTRL_ADMIN_ONLY to inidcate only admin queue is live. When non io queues or tagset allocation failed, ctrl enters into this state, scan work will not be started. But async event work and nvme dev ioctl will be still available. This will be helpful to do further investigation and recovery. V3: - s/NVME_CTRL_ADMIN_LIVE/NVME_CTRL_ADMIN_ONLY/ - s/BUG_ON/WARN_ON_ONCE/ - Other misc code changes V2: - Based on Sagi's suggestion, add new state NVME_CTRL_ADMIN_LIVE. - Change patch name and comment. Suggested-by: Sagi Grimberg Signed-off-by: Jianchao Wang --- drivers/nvme/host/core.c | 25 ++--- drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 30 +- 3 files changed, 44 insertions(+), 12 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e46e60..a614cd7 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -232,6 +232,15 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, old_state = ctrl->state; switch (new_state) { + case NVME_CTRL_ADMIN_ONLY: + switch (old_state) { + case NVME_CTRL_RESETTING: + changed = true; + /* FALLTHRU */ + default: + break; + } + break; case NVME_CTRL_LIVE: switch (old_state) { case NVME_CTRL_NEW: @@ -247,6 +256,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, switch (old_state) { case NVME_CTRL_NEW: case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: changed = true; /* FALLTHRU */ default: @@ -266,6 +276,7 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl, case NVME_CTRL_DELETING: switch (old_state) { case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: case NVME_CTRL_RESETTING: case NVME_CTRL_RECONNECTING: changed = true; @@ -2337,8 +2348,14 @@ static int nvme_dev_open(struct inode *inode, struct file *file) struct nvme_ctrl *ctrl = container_of(inode->i_cdev, struct nvme_ctrl, cdev); - if (ctrl->state != NVME_CTRL_LIVE) + switch(ctrl->state) { + case NVME_CTRL_LIVE: + case NVME_CTRL_ADMIN_ONLY: + break; + default: return -EWOULDBLOCK; + } + file->private_data = ctrl; return 0; } @@ -2602,6 +2619,7 @@ static ssize_t nvme_sysfs_show_state(struct device *dev, static const char *const state_name[] = { [NVME_CTRL_NEW] = "new", [NVME_CTRL_LIVE]= "live", + [NVME_CTRL_ADMIN_ONLY] = "only-admin", [NVME_CTRL_RESETTING] = "resetting", [NVME_CTRL_RECONNECTING]= "reconnecting", [NVME_CTRL_DELETING]= "deleting", @@ -3074,6 +3092,8 @@ static void nvme_scan_work(struct work_struct *work) if (ctrl->state != NVME_CTRL_LIVE) return; + WARN_ON_ONCE(!ctrl->tagset); + if (nvme_identify_ctrl(ctrl, )) return; @@ -3094,8 +3114,7 @@ static void nvme_scan_work(struct work_struct *work) void nvme_queue_scan(struct nvme_ctrl *ctrl) { /* -* Do not queue new scan work when a controller is reset during -* removal. +* Only new queue scan work when admin and IO queues are both alive */ if (ctrl->state == NVME_CTRL_LIVE) queue_work(nvme_wq, >scan_work); diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ea1aa52..eecf71c 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -119,6 +119,7 @@ static inline struct nvme_request *nvme_req(struct request *req) enum nvme_ctrl_state { NVME_CTRL_NEW, NVME_CTRL_LIVE, + NVME_CTRL_ADMIN_ONLY,/* Only admin queue live */ NVME_CTRL_RESETTING, NVME_CTRL_RECONNECTING, NVME_CTRL_DELETING, diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c index f5800c3..e758c5a 100644 --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -2035,13 +2035,12 @@ static void nvme_disable_io_queues(struct nvme_dev *dev, int queues) } /* - * Return: error value if an error occurred setting up the queues or calling - * Identify Device. 0 if these succeeded, even if adding some of the - * namespaces failed. At the moment, these failures are silent. TBD which - * failures should be reported. + * return error value only when tagset allocation failed */
Re: [PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state
> On Jan 4, 2018, at 10:42 PM, Yafang Shaowrote: > > sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments. > Then we can conveniently use these two arguments to do the filter. > > Suggested-by: Brendan Gregg > Signed-off-by: Yafang Shao > --- > include/trace/events/sock.h | 24 ++-- > net/ipv4/af_inet.c | 6 -- > 2 files changed, 22 insertions(+), 8 deletions(-) > > diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h > index 3537c5f..c7df70f 100644 > --- a/include/trace/events/sock.h > +++ b/include/trace/events/sock.h > @@ -11,7 +11,11 @@ > #include > #include > > -/* The protocol traced by sock_set_state */ > +#define family_names \ > + EM(AF_INET) \ > + EMe(AF_INET6) > + > +/* The protocol traced by inet_sock_set_state */ > #define inet_protocol_names \ > EM(IPPROTO_TCP) \ > EM(IPPROTO_DCCP)\ > @@ -37,6 +41,7 @@ > #define EM(a) TRACE_DEFINE_ENUM(a); > #define EMe(a) TRACE_DEFINE_ENUM(a); > > +family_names > inet_protocol_names > tcp_state_names > > @@ -45,6 +50,9 @@ > #define EM(a) { a, #a }, > #define EMe(a) { a, #a } > > +#define show_family_name(val)\ > + __print_symbolic(val, family_names) > + > #define show_inet_protocol_name(val)\ > __print_symbolic(val, inet_protocol_names) > > @@ -108,9 +116,10 @@ > > TRACE_EVENT(inet_sock_set_state, > > - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate), > + TP_PROTO(const struct sock *sk, const int family, const int protocol, > + const int oldstate, const int newstate), Are there cases we need protocol and/or family that is different to sk->sk_protocol/sk_family? If not, I think we don't need to change the TP_PROTO. Thanks, Song > > - TP_ARGS(sk, oldstate, newstate), > + TP_ARGS(sk, family, protocol, oldstate, newstate), > > TP_STRUCT__entry( > __field(const void *, skaddr) > @@ -118,6 +127,7 @@ > __field(int, newstate) > __field(__u16, sport) > __field(__u16, dport) > + __field(__u16, family) > __field(__u8, protocol) > __array(__u8, saddr, 4) > __array(__u8, daddr, 4) > @@ -133,8 +143,9 @@ > __entry->skaddr = sk; > __entry->oldstate = oldstate; > __entry->newstate = newstate; > + __entry->family = family; > + __entry->protocol = protocol; > > - __entry->protocol = sk->sk_protocol; > __entry->sport = ntohs(inet->inet_sport); > __entry->dport = ntohs(inet->inet_dport); > > @@ -145,7 +156,7 @@ > *p32 = inet->inet_daddr; > > #if IS_ENABLED(CONFIG_IPV6) > - if (sk->sk_family == AF_INET6) { > + if (family == AF_INET6) { > pin6 = (struct in6_addr *)__entry->saddr_v6; > *pin6 = sk->sk_v6_rcv_saddr; > pin6 = (struct in6_addr *)__entry->daddr_v6; > @@ -160,7 +171,8 @@ > } > ), > > - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 > saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 > daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + show_family_name(__entry->family), > show_inet_protocol_name(__entry->protocol), > __entry->sport, __entry->dport, > __entry->saddr, __entry->daddr, > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index bab98a4..1d52796 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk) > > void inet_sk_set_state(struct sock *sk, int state) > { > - trace_inet_sock_set_state(sk, sk->sk_state, state); > + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, > + sk->sk_state, state); > sk->sk_state = state; > } > EXPORT_SYMBOL(inet_sk_set_state); > > void inet_sk_state_store(struct sock *sk, int newstate) > { > - trace_inet_sock_set_state(sk, sk->sk_state, newstate); > + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, > + sk->sk_state, newstate); > smp_store_release(>sk_state, newstate); > } > > -- > 1.8.3.1 >
Re: [PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state
> On Jan 4, 2018, at 10:42 PM, Yafang Shao wrote: > > sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments. > Then we can conveniently use these two arguments to do the filter. > > Suggested-by: Brendan Gregg > Signed-off-by: Yafang Shao > --- > include/trace/events/sock.h | 24 ++-- > net/ipv4/af_inet.c | 6 -- > 2 files changed, 22 insertions(+), 8 deletions(-) > > diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h > index 3537c5f..c7df70f 100644 > --- a/include/trace/events/sock.h > +++ b/include/trace/events/sock.h > @@ -11,7 +11,11 @@ > #include > #include > > -/* The protocol traced by sock_set_state */ > +#define family_names \ > + EM(AF_INET) \ > + EMe(AF_INET6) > + > +/* The protocol traced by inet_sock_set_state */ > #define inet_protocol_names \ > EM(IPPROTO_TCP) \ > EM(IPPROTO_DCCP)\ > @@ -37,6 +41,7 @@ > #define EM(a) TRACE_DEFINE_ENUM(a); > #define EMe(a) TRACE_DEFINE_ENUM(a); > > +family_names > inet_protocol_names > tcp_state_names > > @@ -45,6 +50,9 @@ > #define EM(a) { a, #a }, > #define EMe(a) { a, #a } > > +#define show_family_name(val)\ > + __print_symbolic(val, family_names) > + > #define show_inet_protocol_name(val)\ > __print_symbolic(val, inet_protocol_names) > > @@ -108,9 +116,10 @@ > > TRACE_EVENT(inet_sock_set_state, > > - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate), > + TP_PROTO(const struct sock *sk, const int family, const int protocol, > + const int oldstate, const int newstate), Are there cases we need protocol and/or family that is different to sk->sk_protocol/sk_family? If not, I think we don't need to change the TP_PROTO. Thanks, Song > > - TP_ARGS(sk, oldstate, newstate), > + TP_ARGS(sk, family, protocol, oldstate, newstate), > > TP_STRUCT__entry( > __field(const void *, skaddr) > @@ -118,6 +127,7 @@ > __field(int, newstate) > __field(__u16, sport) > __field(__u16, dport) > + __field(__u16, family) > __field(__u8, protocol) > __array(__u8, saddr, 4) > __array(__u8, daddr, 4) > @@ -133,8 +143,9 @@ > __entry->skaddr = sk; > __entry->oldstate = oldstate; > __entry->newstate = newstate; > + __entry->family = family; > + __entry->protocol = protocol; > > - __entry->protocol = sk->sk_protocol; > __entry->sport = ntohs(inet->inet_sport); > __entry->dport = ntohs(inet->inet_dport); > > @@ -145,7 +156,7 @@ > *p32 = inet->inet_daddr; > > #if IS_ENABLED(CONFIG_IPV6) > - if (sk->sk_family == AF_INET6) { > + if (family == AF_INET6) { > pin6 = (struct in6_addr *)__entry->saddr_v6; > *pin6 = sk->sk_v6_rcv_saddr; > pin6 = (struct in6_addr *)__entry->daddr_v6; > @@ -160,7 +171,8 @@ > } > ), > > - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 > saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 > daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", > + show_family_name(__entry->family), > show_inet_protocol_name(__entry->protocol), > __entry->sport, __entry->dport, > __entry->saddr, __entry->daddr, > diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c > index bab98a4..1d52796 100644 > --- a/net/ipv4/af_inet.c > +++ b/net/ipv4/af_inet.c > @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk) > > void inet_sk_set_state(struct sock *sk, int state) > { > - trace_inet_sock_set_state(sk, sk->sk_state, state); > + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, > + sk->sk_state, state); > sk->sk_state = state; > } > EXPORT_SYMBOL(inet_sk_set_state); > > void inet_sk_state_store(struct sock *sk, int newstate) > { > - trace_inet_sock_set_state(sk, sk->sk_state, newstate); > + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, > + sk->sk_state, newstate); > smp_store_release(>sk_state, newstate); > } > > -- > 1.8.3.1 >
[PATCH -next] um: vector: fix missing unlock on error in vector_net_open()
Add the missing unlock before return from function vector_net_open() in the error handling case. Fixes: ad1f62ab2bd4 ("High Performance UML Vector Network Driver") Signed-off-by: Wei Yongjun--- arch/um/drivers/vector_kern.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c index d1d5301..bb83a2d 100644 --- a/arch/um/drivers/vector_kern.c +++ b/arch/um/drivers/vector_kern.c @@ -1156,8 +1156,10 @@ static int vector_net_open(struct net_device *dev) struct vector_device *vdevice; spin_lock_irqsave(>lock, flags); - if (vp->opened) + if (vp->opened) { + spin_unlock_irqrestore(>lock, flags); return -ENXIO; + } vp->opened = true; spin_unlock_irqrestore(>lock, flags);
[PATCH -next] um: vector: fix missing unlock on error in vector_net_open()
Add the missing unlock before return from function vector_net_open() in the error handling case. Fixes: ad1f62ab2bd4 ("High Performance UML Vector Network Driver") Signed-off-by: Wei Yongjun --- arch/um/drivers/vector_kern.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/um/drivers/vector_kern.c b/arch/um/drivers/vector_kern.c index d1d5301..bb83a2d 100644 --- a/arch/um/drivers/vector_kern.c +++ b/arch/um/drivers/vector_kern.c @@ -1156,8 +1156,10 @@ static int vector_net_open(struct net_device *dev) struct vector_device *vdevice; spin_lock_irqsave(>lock, flags); - if (vp->opened) + if (vp->opened) { + spin_unlock_irqrestore(>lock, flags); return -ENXIO; + } vp->opened = true; spin_unlock_irqrestore(>lock, flags);
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote: > On 01/04/2018 10:49 PM, Willy Tarreau wrote: > > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: > >>> P.S. I've an internal document where I've been tracking "nice to haves" > >>> for later, and one of them is whether it makes sense to tag binaries as > >>> "trusted" (e.g. extended attribute, label, whatever). It was something I > >>> wanted to bring up at some point as potentially worth considering. > >> Scratch that. There is no such thing as a trusted binary. > > I disagree with you on this Thomas. "trusted" means "we agree to share the > > risk this binary takes because it's critical to our service". When you > > build a load balancing appliance on which 100% of the service is assured > > by a single executable and the rest is just config management, you'd better > > trust that process. > > So you want to run this "one binary" as fast as possible and without > mitigations in place? But, you want mitigations *available* on that > system at the same time? For what? If there's only one binary, why not > just disable the mitigations entirely? I'm not fond of running the mitigations, but given that a few sysops can connect to the machine to collect stats or counters, I think it would be better to ensure these people can't happily play with the exploits to dump stuff they shouldn't have access to. It's even easier to understand on a database or key-value server for example, where you may expect the highest performance the CPU can bring for a specific process and the rest can be mitigated and will never ever notice any performance impact at all. That's why I was saying in another thread that it would be nice over the long term if we could 1) make the mitigation dynamic, and 2) make it possible for an admin to disable it for certain processes/programs. Don't get me wrong, I'm perfectly aware that it's far from being simple and for now we need to get a reliable mitigation. I'm just saying that the performance impact is a huge loss for certain use cases and that once things settle down we should start to work on ways to recover what was lost. Regards, Willy
Re: Avoid speculative indirect calls in kernel
On Thu, Jan 04, 2018 at 10:57:19PM -0800, Dave Hansen wrote: > On 01/04/2018 10:49 PM, Willy Tarreau wrote: > > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > >> On Thu, 4 Jan 2018, Jon Masters wrote: > >>> P.S. I've an internal document where I've been tracking "nice to haves" > >>> for later, and one of them is whether it makes sense to tag binaries as > >>> "trusted" (e.g. extended attribute, label, whatever). It was something I > >>> wanted to bring up at some point as potentially worth considering. > >> Scratch that. There is no such thing as a trusted binary. > > I disagree with you on this Thomas. "trusted" means "we agree to share the > > risk this binary takes because it's critical to our service". When you > > build a load balancing appliance on which 100% of the service is assured > > by a single executable and the rest is just config management, you'd better > > trust that process. > > So you want to run this "one binary" as fast as possible and without > mitigations in place? But, you want mitigations *available* on that > system at the same time? For what? If there's only one binary, why not > just disable the mitigations entirely? I'm not fond of running the mitigations, but given that a few sysops can connect to the machine to collect stats or counters, I think it would be better to ensure these people can't happily play with the exploits to dump stuff they shouldn't have access to. It's even easier to understand on a database or key-value server for example, where you may expect the highest performance the CPU can bring for a specific process and the rest can be mitigated and will never ever notice any performance impact at all. That's why I was saying in another thread that it would be nice over the long term if we could 1) make the mitigation dynamic, and 2) make it possible for an admin to disable it for certain processes/programs. Don't get me wrong, I'm perfectly aware that it's far from being simple and for now we need to get a reliable mitigation. I'm just saying that the performance impact is a huge loss for certain use cases and that once things settle down we should start to work on ways to recover what was lost. Regards, Willy
Re: [PATCH] of: Use SPDX license tag for DT files
On Fri, Jan 5, 2018 at 12:05 AM, Rob Herringwrote: > Convert remaining DT files to use SPDX-License-Identifier tags. > > Cc: Benjamin Herrenschmidt > Cc: Guennadi Liakhovetski > Cc: Paul Mackerras > Cc: Pantelis Antoniou > Signed-off-by: Rob Herring > --- > drivers/of/Kconfig | 1 + > drivers/of/address.c| 2 +- > drivers/of/base.c | 6 +- > drivers/of/device.c | 1 + > drivers/of/dynamic.c| 1 + > drivers/of/fdt.c| 5 + > drivers/of/fdt_address.c| 6 +- > drivers/of/irq.c| 6 +- > drivers/of/kobj.c | 2 +- > drivers/of/of_numa.c| 13 + > drivers/of/of_private.h | 6 +- > drivers/of/of_reserved_mem.c| 6 +- > drivers/of/overlay.c| 5 + > drivers/of/pdt.c| 6 +- > drivers/of/platform.c | 7 +-- > drivers/of/property.c | 6 +- > drivers/of/resolver.c | 5 + > drivers/of/unittest-data/overlay_bad_symbol.dts | 1 + > include/linux/of.h | 6 +- > include/linux/of_dma.h | 5 + > include/linux/of_fdt.h | 5 + > include/linux/of_gpio.h | 6 +- > include/linux/of_graph.h| 5 + > include/linux/of_pdt.h | 6 +- > include/linux/of_platform.h | 7 +-- > 25 files changed, 25 insertions(+), 100 deletions(-) > > diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig > index c2b6c11d29d1..572942c3cb15 100644 > --- a/drivers/of/Kconfig > +++ b/drivers/of/Kconfig > @@ -1,3 +1,4 @@ > +# SPDX-License-Identifier: GPL-2.0 > config DTC > bool > > diff --git a/drivers/of/address.c b/drivers/of/address.c > index 8591afbdfe99..b48b68c4a7a9 100644 > --- a/drivers/of/address.c > +++ b/drivers/of/address.c > @@ -1,4 +1,4 @@ > - > +// SPDX-License-Identifier: GPL-2.0 > #define pr_fmt(fmt)"OF: " fmt > > #include > diff --git a/drivers/of/base.c b/drivers/of/base.c > index 26618ba8f92a..dd0b4201f1cc 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0+ > /* > * Procedures for creating, accessing and interpreting the device tree. > * > @@ -11,11 +12,6 @@ > * > * Reconsolidated from arch/x/kernel/prom.c by Stephen Rothwell and > * Grant Likely. > - * > - * This program is free software; you can redistribute it and/or > - * modify it under the terms of the GNU General Public License > - * as published by the Free Software Foundation; either version > - * 2 of the License, or (at your option) any later version. > */ > > #define pr_fmt(fmt)"OF: " fmt > diff --git a/drivers/of/device.c b/drivers/of/device.c > index 25bddf9c9fe1..064c818105bd 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0 > #include > #include > #include > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index ab988d88704d..7bb33d22b4e2 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0 > /* > * Support for dynamic device trees. > * > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c > index 4675e5ac4d11..7db5353a24c0 100644 > --- a/drivers/of/fdt.c > +++ b/drivers/of/fdt.c > @@ -1,12 +1,9 @@ > +// SPDX-License-Identifier: GPL-2.0 > /* > * Functions for working with the Flattened Device Tree data format > * > * Copyright 2009 Benjamin Herrenschmidt, IBM Corp > * b...@kernel.crashing.org > - * > - * This program is free software; you can redistribute it and/or > - * modify it under the terms of the GNU General Public License > - * version 2 as published by the Free Software Foundation. > */ > > #define pr_fmt(fmt)"OF: fdt: " fmt > diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c > index 843a542dac7d..1dc15ab78b10 100644 > --- a/drivers/of/fdt_address.c > +++ b/drivers/of/fdt_address.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0+ > /* > * FDT Address translation based on u-boot fdt_support.c which in turn was > * based on the kernel unflattened DT address translation code. > @@ -6,11 +7,6 @@ > * Gerald Van Baren, Custom IDEAS, vanba...@cideas.com > * > * Copyright 2010-2011 Freescale Semiconductor, Inc. > - * > - * This program is free software; you can redistribute it
Re: [PATCH] of: Use SPDX license tag for DT files
On Fri, Jan 5, 2018 at 12:05 AM, Rob Herring wrote: > Convert remaining DT files to use SPDX-License-Identifier tags. > > Cc: Benjamin Herrenschmidt > Cc: Guennadi Liakhovetski > Cc: Paul Mackerras > Cc: Pantelis Antoniou > Signed-off-by: Rob Herring > --- > drivers/of/Kconfig | 1 + > drivers/of/address.c| 2 +- > drivers/of/base.c | 6 +- > drivers/of/device.c | 1 + > drivers/of/dynamic.c| 1 + > drivers/of/fdt.c| 5 + > drivers/of/fdt_address.c| 6 +- > drivers/of/irq.c| 6 +- > drivers/of/kobj.c | 2 +- > drivers/of/of_numa.c| 13 + > drivers/of/of_private.h | 6 +- > drivers/of/of_reserved_mem.c| 6 +- > drivers/of/overlay.c| 5 + > drivers/of/pdt.c| 6 +- > drivers/of/platform.c | 7 +-- > drivers/of/property.c | 6 +- > drivers/of/resolver.c | 5 + > drivers/of/unittest-data/overlay_bad_symbol.dts | 1 + > include/linux/of.h | 6 +- > include/linux/of_dma.h | 5 + > include/linux/of_fdt.h | 5 + > include/linux/of_gpio.h | 6 +- > include/linux/of_graph.h| 5 + > include/linux/of_pdt.h | 6 +- > include/linux/of_platform.h | 7 +-- > 25 files changed, 25 insertions(+), 100 deletions(-) > > diff --git a/drivers/of/Kconfig b/drivers/of/Kconfig > index c2b6c11d29d1..572942c3cb15 100644 > --- a/drivers/of/Kconfig > +++ b/drivers/of/Kconfig > @@ -1,3 +1,4 @@ > +# SPDX-License-Identifier: GPL-2.0 > config DTC > bool > > diff --git a/drivers/of/address.c b/drivers/of/address.c > index 8591afbdfe99..b48b68c4a7a9 100644 > --- a/drivers/of/address.c > +++ b/drivers/of/address.c > @@ -1,4 +1,4 @@ > - > +// SPDX-License-Identifier: GPL-2.0 > #define pr_fmt(fmt)"OF: " fmt > > #include > diff --git a/drivers/of/base.c b/drivers/of/base.c > index 26618ba8f92a..dd0b4201f1cc 100644 > --- a/drivers/of/base.c > +++ b/drivers/of/base.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0+ > /* > * Procedures for creating, accessing and interpreting the device tree. > * > @@ -11,11 +12,6 @@ > * > * Reconsolidated from arch/x/kernel/prom.c by Stephen Rothwell and > * Grant Likely. > - * > - * This program is free software; you can redistribute it and/or > - * modify it under the terms of the GNU General Public License > - * as published by the Free Software Foundation; either version > - * 2 of the License, or (at your option) any later version. > */ > > #define pr_fmt(fmt)"OF: " fmt > diff --git a/drivers/of/device.c b/drivers/of/device.c > index 25bddf9c9fe1..064c818105bd 100644 > --- a/drivers/of/device.c > +++ b/drivers/of/device.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0 > #include > #include > #include > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c > index ab988d88704d..7bb33d22b4e2 100644 > --- a/drivers/of/dynamic.c > +++ b/drivers/of/dynamic.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0 > /* > * Support for dynamic device trees. > * > diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c > index 4675e5ac4d11..7db5353a24c0 100644 > --- a/drivers/of/fdt.c > +++ b/drivers/of/fdt.c > @@ -1,12 +1,9 @@ > +// SPDX-License-Identifier: GPL-2.0 > /* > * Functions for working with the Flattened Device Tree data format > * > * Copyright 2009 Benjamin Herrenschmidt, IBM Corp > * b...@kernel.crashing.org > - * > - * This program is free software; you can redistribute it and/or > - * modify it under the terms of the GNU General Public License > - * version 2 as published by the Free Software Foundation. > */ > > #define pr_fmt(fmt)"OF: fdt: " fmt > diff --git a/drivers/of/fdt_address.c b/drivers/of/fdt_address.c > index 843a542dac7d..1dc15ab78b10 100644 > --- a/drivers/of/fdt_address.c > +++ b/drivers/of/fdt_address.c > @@ -1,3 +1,4 @@ > +// SPDX-License-Identifier: GPL-2.0+ > /* > * FDT Address translation based on u-boot fdt_support.c which in turn was > * based on the kernel unflattened DT address translation code. > @@ -6,11 +7,6 @@ > * Gerald Van Baren, Custom IDEAS, vanba...@cideas.com > * > * Copyright 2010-2011 Freescale Semiconductor, Inc. > - * > - * This program is free software; you can redistribute it and/or modify > - * it under the terms of the GNU General Public License as published by > - * the Free Software Foundation; either
Re: [f2fs-dev] [PATCH 1/2] f2fs: show precise # of blocks that user/root can use
NACK man statfs shows: struct statfs { ... fsblkcnt_t f_bfree; /* free blocks in fs */ fsblkcnt_t f_bavail; /* free blocks available to unprivileged user */ ... } f_bfree is free blocks in fs, so buf->bfree should be buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; On 2018/1/4 2:58, Jaegeuk Kim wrote: Let's show precise # of blocks that user/root can use through bavail and bfree respectively. Signed-off-by: Jaegeuk Kim--- fs/f2fs/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 0a820ba55b10..4c1c99cf54ef 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -1005,9 +1005,9 @@ static int f2fs_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_bsize = sbi->blocksize; buf->f_blocks = total_count - start_count; - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - + buf->f_bfree = user_block_count - valid_user_blocks(sbi) - sbi->current_reserved_blocks; + buf->f_bavail = buf->f_bfree; avail_node_count = sbi->total_node_count - sbi->nquota_files - F2FS_RESERVED_NODE_NUM; -- Thanks, Yunlong Song
Re: [f2fs-dev] [PATCH 1/2] f2fs: show precise # of blocks that user/root can use
NACK man statfs shows: struct statfs { ... fsblkcnt_t f_bfree; /* free blocks in fs */ fsblkcnt_t f_bavail; /* free blocks available to unprivileged user */ ... } f_bfree is free blocks in fs, so buf->bfree should be buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; On 2018/1/4 2:58, Jaegeuk Kim wrote: Let's show precise # of blocks that user/root can use through bavail and bfree respectively. Signed-off-by: Jaegeuk Kim --- fs/f2fs/super.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 0a820ba55b10..4c1c99cf54ef 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -1005,9 +1005,9 @@ static int f2fs_statfs(struct dentry *dentry, struct kstatfs *buf) buf->f_bsize = sbi->blocksize; buf->f_blocks = total_count - start_count; - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - + buf->f_bfree = user_block_count - valid_user_blocks(sbi) - sbi->current_reserved_blocks; + buf->f_bavail = buf->f_bfree; avail_node_count = sbi->total_node_count - sbi->nquota_files - F2FS_RESERVED_NODE_NUM; -- Thanks, Yunlong Song
Re: [PATCH] [v3] x86/doc: add PTI description
On 01/04/18 21:38, Dave Hansen wrote: > +Page Table Management > += > + > +When PTI is enabled, the kernel manages two sets of page tables. > +The first set is very similar to the single set which is present in > +kernels without PTI. This includes a complete mapping of userspace > +that the kernel can use for things like copy_to_user(). > + > +Although _complete_, the user portion of the kernel page tables is > +crippled by setting the NX bit in the top level. This ensures > +that any missed kernel->user CR3 switch will immediately crash > +userspace upon executing its first instruction. > + > +The userspace page tables map only the kernel data needed to enter > +and exit the kernel. This data is entirely contained in the 'struct > +cpu_entry_area' structure which is placed in the fixmap which gives > +each CPU's copy of the area has a compile-time-fixed virtual > +address. drop /has/ above. > + > +For new userspace mappings, the kernel makes the entries in its > +page tables like normal. The only difference is when the kernel > +makes entries in the top (PGD) level. In addition to setting the > +entry in the main kernel PGD, a copy of the entry is made in the > +userspace page tables' PGD. -- ~Randy
Re: [PATCH] [v3] x86/doc: add PTI description
On 01/04/18 21:38, Dave Hansen wrote: > +Page Table Management > += > + > +When PTI is enabled, the kernel manages two sets of page tables. > +The first set is very similar to the single set which is present in > +kernels without PTI. This includes a complete mapping of userspace > +that the kernel can use for things like copy_to_user(). > + > +Although _complete_, the user portion of the kernel page tables is > +crippled by setting the NX bit in the top level. This ensures > +that any missed kernel->user CR3 switch will immediately crash > +userspace upon executing its first instruction. > + > +The userspace page tables map only the kernel data needed to enter > +and exit the kernel. This data is entirely contained in the 'struct > +cpu_entry_area' structure which is placed in the fixmap which gives > +each CPU's copy of the area has a compile-time-fixed virtual > +address. drop /has/ above. > + > +For new userspace mappings, the kernel makes the entries in its > +page tables like normal. The only difference is when the kernel > +makes entries in the top (PGD) level. In addition to setting the > +entry in the main kernel PGD, a copy of the entry is made in the > +userspace page tables' PGD. -- ~Randy
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 10:49 PM, Willy Tarreau wrote: > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: >> On Thu, 4 Jan 2018, Jon Masters wrote: >>> P.S. I've an internal document where I've been tracking "nice to haves" >>> for later, and one of them is whether it makes sense to tag binaries as >>> "trusted" (e.g. extended attribute, label, whatever). It was something I >>> wanted to bring up at some point as potentially worth considering. >> Scratch that. There is no such thing as a trusted binary. > I disagree with you on this Thomas. "trusted" means "we agree to share the > risk this binary takes because it's critical to our service". When you > build a load balancing appliance on which 100% of the service is assured > by a single executable and the rest is just config management, you'd better > trust that process. So you want to run this "one binary" as fast as possible and without mitigations in place? But, you want mitigations *available* on that system at the same time? For what? If there's only one binary, why not just disable the mitigations entirely?
Re: Avoid speculative indirect calls in kernel
On 01/04/2018 10:49 PM, Willy Tarreau wrote: > On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: >> On Thu, 4 Jan 2018, Jon Masters wrote: >>> P.S. I've an internal document where I've been tracking "nice to haves" >>> for later, and one of them is whether it makes sense to tag binaries as >>> "trusted" (e.g. extended attribute, label, whatever). It was something I >>> wanted to bring up at some point as potentially worth considering. >> Scratch that. There is no such thing as a trusted binary. > I disagree with you on this Thomas. "trusted" means "we agree to share the > risk this binary takes because it's critical to our service". When you > build a load balancing appliance on which 100% of the service is assured > by a single executable and the rest is just config management, you'd better > trust that process. So you want to run this "one binary" as fast as possible and without mitigations in place? But, you want mitigations *available* on that system at the same time? For what? If there's only one binary, why not just disable the mitigations entirely?
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I disagree with you on this Thomas. "trusted" means "we agree to share the risk this binary takes because it's critical to our service". When you build a load balancing appliance on which 100% of the service is assured by a single executable and the rest is just config management, you'd better trust that process. If the binary or process cannot be trusted, the product is dead anyway. It doesn't mean the binary is safe. It just means that for the product there's nothing worse than its compromission or failure. And when it suffers from the performance impact of workarounds supposed to protect the whole device against this process' possible abuses, you easily see how the situation becomes ridiculous. We need to still think about performance a lot. There's already an ongoing trend of kernel bypass mechanisms in the wild for performance reasons, and the new increase of syscall costs will necessarily amplify this willingness to avoid the kernel. I personally don't want to see the kernel being reduced to booting and executing SSH to manage the machines. Willy
Re: Avoid speculative indirect calls in kernel
On Fri, Jan 05, 2018 at 01:54:13AM +0100, Thomas Gleixner wrote: > On Thu, 4 Jan 2018, Jon Masters wrote: > > P.S. I've an internal document where I've been tracking "nice to haves" > > for later, and one of them is whether it makes sense to tag binaries as > > "trusted" (e.g. extended attribute, label, whatever). It was something I > > wanted to bring up at some point as potentially worth considering. > > Scratch that. There is no such thing as a trusted binary. I disagree with you on this Thomas. "trusted" means "we agree to share the risk this binary takes because it's critical to our service". When you build a load balancing appliance on which 100% of the service is assured by a single executable and the rest is just config management, you'd better trust that process. If the binary or process cannot be trusted, the product is dead anyway. It doesn't mean the binary is safe. It just means that for the product there's nothing worse than its compromission or failure. And when it suffers from the performance impact of workarounds supposed to protect the whole device against this process' possible abuses, you easily see how the situation becomes ridiculous. We need to still think about performance a lot. There's already an ongoing trend of kernel bypass mechanisms in the wild for performance reasons, and the new increase of syscall costs will necessarily amplify this willingness to avoid the kernel. I personally don't want to see the kernel being reduced to booting and executing SSH to manage the machines. Willy
Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote: > On Thu, 4 Jan 2018 17:00:47 -0700 > Logan Gunthorpewrote: > > > On 04/01/18 03:35 PM, Alex Williamson wrote: > > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend > > > on the isolation of that downstream port and I suspect also any peers > > > within the same PCI slot of that port and their downstream devices. The > > > entire sub-hierarchy grouping needs to be re-evaluated. This > > > potentially affects running devices that depend on that isolation, so > > > I'm not sure how that happens dynamically. A boot option might be > > > easier. Thanks, > > > > I don't see how this is the case in current kernel code. It appears to > > only enable ACS globally if the IOMMU requests it. > > IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM > both request ACS be enabled if an IOMMU is present, so I'm not sure > what you're getting at here. Also, in reply to your other email, if > the IOMMU is enabled, every device handled by the IOMMU is a member of > an IOMMU group, see struct device.iommu_group. There's an > iommu_group_get() accessor to get a reference to it. > > > I also don't see how turning off ACS isolation for a specific device is > > going to hurt anything. The IOMMU should still be able to keep going on > > unaware that anything has changed. The only worry is that a security > > hole may now be created if a user was relying on the isolation between > > two devices that are in different VMs or something. However, if a user > > was relying on this, they probably shouldn't have turned on P2P in the > > first place. > > That's exactly what IOMMU groups represent, the smallest set of devices > which have DMA isolation from other devices. By poking this hole, the > IOMMU group is invalid. We cannot turn off ACS only for a specific > device, in order to enable p2p it needs to be disabled at every > downstream port between the devices where we want to enable p2p. > Depending on the topology, that could mean we're also enabling p2p for > unrelated devices. Those unrelated devices might be in active use and > the p2p IOVAs now have a different destination which is no longer IOMMU > translated. > > > We started with a fairly unintelligent choice to simply disable ACS on > > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like > > a good idea going forward. Instead, we now selectively disable the ACS > > bit only on the downstream ports that are involved in P2P transactions. > > This seems like the safest choice and still allows people to (carefully) > > use P2P adjacent to other devices that need to be isolated. > > I don't see that the code is doing much checking that adjacent devices > are also affected by the p2p change and of course the IOMMU group is > entirely invalid once the p2p holes start getting poked. > > > I don't think anyone wants another boot option that must be set in order > > to use this functionality (and only some hardware would require this). > > That's just a huge pain for users. > > No, but nor do we need IOMMU groups that no longer represent what > they're intended to describe or runtime, unchecked routing changes > through the topology for devices that might already be using > conflicting IOVA ranges. Maybe soft hotplugs are another possibility, > designate a sub-hierarchy to be removed and re-scanned with ACS > disabled. Otherwise it seems like disabling and re-enabling ACS needs > to also handle merging and splitting groups dynamically. Thanks, > Dumb question, can we use a PCI bar address of one device into the IOMMU page table of another address ie like we would DMA map a regular system page ? It would be much better in my view to follow down such path if that is at all possible from hardware point of view (i am not sure where to dig in the specification to answer my above question). Cheers, Jérôme
Re: [PATCH 04/12] pci-p2p: Clear ACS P2P flags for all client devices
On Thu, Jan 04, 2018 at 08:33:00PM -0700, Alex Williamson wrote: > On Thu, 4 Jan 2018 17:00:47 -0700 > Logan Gunthorpe wrote: > > > On 04/01/18 03:35 PM, Alex Williamson wrote: > > > Yep, flipping these ACS bits invalidates any IOMMU groups that depend > > > on the isolation of that downstream port and I suspect also any peers > > > within the same PCI slot of that port and their downstream devices. The > > > entire sub-hierarchy grouping needs to be re-evaluated. This > > > potentially affects running devices that depend on that isolation, so > > > I'm not sure how that happens dynamically. A boot option might be > > > easier. Thanks, > > > > I don't see how this is the case in current kernel code. It appears to > > only enable ACS globally if the IOMMU requests it. > > IOMMU groups don't exist unless the IOMMU is enabled and x86 and ARM > both request ACS be enabled if an IOMMU is present, so I'm not sure > what you're getting at here. Also, in reply to your other email, if > the IOMMU is enabled, every device handled by the IOMMU is a member of > an IOMMU group, see struct device.iommu_group. There's an > iommu_group_get() accessor to get a reference to it. > > > I also don't see how turning off ACS isolation for a specific device is > > going to hurt anything. The IOMMU should still be able to keep going on > > unaware that anything has changed. The only worry is that a security > > hole may now be created if a user was relying on the isolation between > > two devices that are in different VMs or something. However, if a user > > was relying on this, they probably shouldn't have turned on P2P in the > > first place. > > That's exactly what IOMMU groups represent, the smallest set of devices > which have DMA isolation from other devices. By poking this hole, the > IOMMU group is invalid. We cannot turn off ACS only for a specific > device, in order to enable p2p it needs to be disabled at every > downstream port between the devices where we want to enable p2p. > Depending on the topology, that could mean we're also enabling p2p for > unrelated devices. Those unrelated devices might be in active use and > the p2p IOVAs now have a different destination which is no longer IOMMU > translated. > > > We started with a fairly unintelligent choice to simply disable ACS on > > any kernel that had CONFIG_PCI_P2P set. However, this did not seem like > > a good idea going forward. Instead, we now selectively disable the ACS > > bit only on the downstream ports that are involved in P2P transactions. > > This seems like the safest choice and still allows people to (carefully) > > use P2P adjacent to other devices that need to be isolated. > > I don't see that the code is doing much checking that adjacent devices > are also affected by the p2p change and of course the IOMMU group is > entirely invalid once the p2p holes start getting poked. > > > I don't think anyone wants another boot option that must be set in order > > to use this functionality (and only some hardware would require this). > > That's just a huge pain for users. > > No, but nor do we need IOMMU groups that no longer represent what > they're intended to describe or runtime, unchecked routing changes > through the topology for devices that might already be using > conflicting IOVA ranges. Maybe soft hotplugs are another possibility, > designate a sub-hierarchy to be removed and re-scanned with ACS > disabled. Otherwise it seems like disabling and re-enabling ACS needs > to also handle merging and splitting groups dynamically. Thanks, > Dumb question, can we use a PCI bar address of one device into the IOMMU page table of another address ie like we would DMA map a regular system page ? It would be much better in my view to follow down such path if that is at all possible from hardware point of view (i am not sure where to dig in the specification to answer my above question). Cheers, Jérôme
Re: [PATCH 1/2] Move kfree_call_rcu() to slab_common.c
On Thu, 2018-01-04 at 16:07 -0800, Matthew Wilcox wrote: > On Thu, Jan 04, 2018 at 03:47:32PM -0800, Paul E. McKenney wrote: > > I was under the impression that typeof did not actually evaluate its > > argument, but rather only returned its type. And there are a few macros > > with this pattern in mainline. > > > > Or am I confused about what typeof does? > > I think checkpatch is confused by the '*' in the typeof argument: > > $ git diff |./scripts/checkpatch.pl --strict > CHECK: Macro argument reuse 'ptr' - possible side-effects? > #29: FILE: include/linux/rcupdate.h:896: > +#define kfree_rcu(ptr, rcu_head)\ > + __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head)) > > If one removes the '*', the warning goes away. > > I'm no perlista, but Joe, would this regexp modification make sense? > > +++ b/scripts/checkpatch.pl > @@ -4957,7 +4957,7 @@ sub process { > next if ($arg =~ /\.\.\./); > next if ($arg =~ /^type$/i); > my $tmp_stmt = $define_stmt; > - $tmp_stmt =~ > s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\s*$arg\s*\)*\b//g; > + $tmp_stmt =~ > s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\**\(*\s*$arg\s*\)*\b//g; I supposed ideally it'd be more like $tmp_stmt =~ s/\b(?:typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*(?:\s*\*\s*)*\s*\(*\s*$arg\s*\)*\b//g; Adding ?: at the start to not capture and (?:\s*\*\s*)* for any number of * with any surrounding spacings.
Re: [PATCH 1/2] Move kfree_call_rcu() to slab_common.c
On Thu, 2018-01-04 at 16:07 -0800, Matthew Wilcox wrote: > On Thu, Jan 04, 2018 at 03:47:32PM -0800, Paul E. McKenney wrote: > > I was under the impression that typeof did not actually evaluate its > > argument, but rather only returned its type. And there are a few macros > > with this pattern in mainline. > > > > Or am I confused about what typeof does? > > I think checkpatch is confused by the '*' in the typeof argument: > > $ git diff |./scripts/checkpatch.pl --strict > CHECK: Macro argument reuse 'ptr' - possible side-effects? > #29: FILE: include/linux/rcupdate.h:896: > +#define kfree_rcu(ptr, rcu_head)\ > + __kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head)) > > If one removes the '*', the warning goes away. > > I'm no perlista, but Joe, would this regexp modification make sense? > > +++ b/scripts/checkpatch.pl > @@ -4957,7 +4957,7 @@ sub process { > next if ($arg =~ /\.\.\./); > next if ($arg =~ /^type$/i); > my $tmp_stmt = $define_stmt; > - $tmp_stmt =~ > s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\s*$arg\s*\)*\b//g; > + $tmp_stmt =~ > s/\b(typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*\**\(*\s*$arg\s*\)*\b//g; I supposed ideally it'd be more like $tmp_stmt =~ s/\b(?:typeof|__typeof__|__builtin\w+|typecheck\s*\(\s*$Type\s*,|\#+)\s*\(*(?:\s*\*\s*)*\s*\(*\s*$arg\s*\)*\b//g; Adding ?: at the start to not capture and (?:\s*\*\s*)* for any number of * with any surrounding spacings.
Re: mmotm 2018-01-04-16-19 uploaded
On 01/05/2018 05:50 AM, a...@linux-foundation.org wrote: > The mm-of-the-moment snapshot 2018-01-04-16-19 has been uploaded to > >http://www.ozlabs.org/~akpm/mmotm/ > > mmotm-readme.txt says > > README for mm-of-the-moment: > > http://www.ozlabs.org/~akpm/mmotm/ > > This is a snapshot of my -mm patch queue. Uploaded at random hopefully > more than once a week. > > You will need quilt to apply these patches to the latest Linus release (4.x > or 4.x-rcY). The series file is in broken-out.tar.gz and is duplicated in > http://ozlabs.org/~akpm/mmotm/series > > The file broken-out.tar.gz contains two datestamp files: .DATE and > .DATE--mm-dd-hh-mm-ss. Both contain the string -mm-dd-hh-mm-ss, > followed by the base kernel version against which this patch series is to > be applied. > > This tree is partially included in linux-next. To see which patches are > included in linux-next, consult the `series' file. Only the patches > within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in > linux-next. > > A git tree which contains the memory management portion of this tree is > maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git Seems like this latest snapshot mmotm-2018-01-04-16-19 has not been updated in this git tree. I could not fetch not it shows up in the http link below. https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git The last one mmotm-2017-12-22-17-55 seems to have some regression on powerpc with respect to ELF loading of binaries (see below). Seems to be related to recent MAP_FIXED_SAFE (or MAP_FIXED_NOREPLACE as seen now in the code). IIUC (have not been following the series last month) MAP_FIXED_NOREPLACE will fail an allocation request if the hint address cannot be reserve instead of changing existing mappings. Is it possible that ELF loading needs to be fixed at a higher level to deal with these new possible mmap() failures because of MAP_FIXED_NOREPLACE ? [ 22.448068] 9060 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 22.450135] 9063 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.456484] 9066 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 22.458171] 9069 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.505341] 9078 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.506961] 9081 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.508736] 9084 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.510589] 9087 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.512442] 9090 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.514685] 9093 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.565793] 9103 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.567874] 9106 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 123.469490] 9173 (fprintd): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.468372] 9182 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.644647] 9205 (pkg-config): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.811893] 9219 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 164.739135] 9232 (less): Uhuuh, elf segment at 1004 requested but the memory is mapped already
Re: mmotm 2018-01-04-16-19 uploaded
On 01/05/2018 05:50 AM, a...@linux-foundation.org wrote: > The mm-of-the-moment snapshot 2018-01-04-16-19 has been uploaded to > >http://www.ozlabs.org/~akpm/mmotm/ > > mmotm-readme.txt says > > README for mm-of-the-moment: > > http://www.ozlabs.org/~akpm/mmotm/ > > This is a snapshot of my -mm patch queue. Uploaded at random hopefully > more than once a week. > > You will need quilt to apply these patches to the latest Linus release (4.x > or 4.x-rcY). The series file is in broken-out.tar.gz and is duplicated in > http://ozlabs.org/~akpm/mmotm/series > > The file broken-out.tar.gz contains two datestamp files: .DATE and > .DATE--mm-dd-hh-mm-ss. Both contain the string -mm-dd-hh-mm-ss, > followed by the base kernel version against which this patch series is to > be applied. > > This tree is partially included in linux-next. To see which patches are > included in linux-next, consult the `series' file. Only the patches > within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in > linux-next. > > A git tree which contains the memory management portion of this tree is > maintained at git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git Seems like this latest snapshot mmotm-2018-01-04-16-19 has not been updated in this git tree. I could not fetch not it shows up in the http link below. https://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git The last one mmotm-2017-12-22-17-55 seems to have some regression on powerpc with respect to ELF loading of binaries (see below). Seems to be related to recent MAP_FIXED_SAFE (or MAP_FIXED_NOREPLACE as seen now in the code). IIUC (have not been following the series last month) MAP_FIXED_NOREPLACE will fail an allocation request if the hint address cannot be reserve instead of changing existing mappings. Is it possible that ELF loading needs to be fixed at a higher level to deal with these new possible mmap() failures because of MAP_FIXED_NOREPLACE ? [ 22.448068] 9060 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 22.450135] 9063 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.456484] 9066 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 22.458171] 9069 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.505341] 9078 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.506961] 9081 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.508736] 9084 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.510589] 9087 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.512442] 9090 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.514685] 9093 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.565793] 9103 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 22.567874] 9106 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 123.469490] 9173 (fprintd): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.468372] 9182 (hostname): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.644647] 9205 (pkg-config): Uhuuh, elf segment at 1002 requested but the memory is mapped already [ 137.811893] 9219 (sed): Uhuuh, elf segment at 1003 requested but the memory is mapped already [ 164.739135] 9232 (less): Uhuuh, elf segment at 1004 requested but the memory is mapped already
[PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state
sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments. Then we can conveniently use these two arguments to do the filter. Suggested-by: Brendan GreggSigned-off-by: Yafang Shao --- include/trace/events/sock.h | 24 ++-- net/ipv4/af_inet.c | 6 -- 2 files changed, 22 insertions(+), 8 deletions(-) diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h index 3537c5f..c7df70f 100644 --- a/include/trace/events/sock.h +++ b/include/trace/events/sock.h @@ -11,7 +11,11 @@ #include #include -/* The protocol traced by sock_set_state */ +#define family_names \ + EM(AF_INET) \ + EMe(AF_INET6) + +/* The protocol traced by inet_sock_set_state */ #define inet_protocol_names\ EM(IPPROTO_TCP) \ EM(IPPROTO_DCCP)\ @@ -37,6 +41,7 @@ #define EM(a) TRACE_DEFINE_ENUM(a); #define EMe(a) TRACE_DEFINE_ENUM(a); +family_names inet_protocol_names tcp_state_names @@ -45,6 +50,9 @@ #define EM(a) { a, #a }, #define EMe(a) { a, #a } +#define show_family_name(val) \ + __print_symbolic(val, family_names) + #define show_inet_protocol_name(val)\ __print_symbolic(val, inet_protocol_names) @@ -108,9 +116,10 @@ TRACE_EVENT(inet_sock_set_state, - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate), + TP_PROTO(const struct sock *sk, const int family, const int protocol, + const int oldstate, const int newstate), - TP_ARGS(sk, oldstate, newstate), + TP_ARGS(sk, family, protocol, oldstate, newstate), TP_STRUCT__entry( __field(const void *, skaddr) @@ -118,6 +127,7 @@ __field(int, newstate) __field(__u16, sport) __field(__u16, dport) + __field(__u16, family) __field(__u8, protocol) __array(__u8, saddr, 4) __array(__u8, daddr, 4) @@ -133,8 +143,9 @@ __entry->skaddr = sk; __entry->oldstate = oldstate; __entry->newstate = newstate; + __entry->family = family; + __entry->protocol = protocol; - __entry->protocol = sk->sk_protocol; __entry->sport = ntohs(inet->inet_sport); __entry->dport = ntohs(inet->inet_dport); @@ -145,7 +156,7 @@ *p32 = inet->inet_daddr; #if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) { + if (family == AF_INET6) { pin6 = (struct in6_addr *)__entry->saddr_v6; *pin6 = sk->sk_v6_rcv_saddr; pin6 = (struct in6_addr *)__entry->daddr_v6; @@ -160,7 +171,8 @@ } ), - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", + show_family_name(__entry->family), show_inet_protocol_name(__entry->protocol), __entry->sport, __entry->dport, __entry->saddr, __entry->daddr, diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index bab98a4..1d52796 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk) void inet_sk_set_state(struct sock *sk, int state) { - trace_inet_sock_set_state(sk, sk->sk_state, state); + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, + sk->sk_state, state); sk->sk_state = state; } EXPORT_SYMBOL(inet_sk_set_state); void inet_sk_state_store(struct sock *sk, int newstate) { - trace_inet_sock_set_state(sk, sk->sk_state, newstate); + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, + sk->sk_state, newstate); smp_store_release(>sk_state, newstate); } -- 1.8.3.1
[PATCH net-next] net: tracepoint: adding new tracepoint arguments in inet_sock_set_state
sk->sk_protocol and sk->sk_family are exposed as tracepoint arguments. Then we can conveniently use these two arguments to do the filter. Suggested-by: Brendan Gregg Signed-off-by: Yafang Shao --- include/trace/events/sock.h | 24 ++-- net/ipv4/af_inet.c | 6 -- 2 files changed, 22 insertions(+), 8 deletions(-) diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h index 3537c5f..c7df70f 100644 --- a/include/trace/events/sock.h +++ b/include/trace/events/sock.h @@ -11,7 +11,11 @@ #include #include -/* The protocol traced by sock_set_state */ +#define family_names \ + EM(AF_INET) \ + EMe(AF_INET6) + +/* The protocol traced by inet_sock_set_state */ #define inet_protocol_names\ EM(IPPROTO_TCP) \ EM(IPPROTO_DCCP)\ @@ -37,6 +41,7 @@ #define EM(a) TRACE_DEFINE_ENUM(a); #define EMe(a) TRACE_DEFINE_ENUM(a); +family_names inet_protocol_names tcp_state_names @@ -45,6 +50,9 @@ #define EM(a) { a, #a }, #define EMe(a) { a, #a } +#define show_family_name(val) \ + __print_symbolic(val, family_names) + #define show_inet_protocol_name(val)\ __print_symbolic(val, inet_protocol_names) @@ -108,9 +116,10 @@ TRACE_EVENT(inet_sock_set_state, - TP_PROTO(const struct sock *sk, const int oldstate, const int newstate), + TP_PROTO(const struct sock *sk, const int family, const int protocol, + const int oldstate, const int newstate), - TP_ARGS(sk, oldstate, newstate), + TP_ARGS(sk, family, protocol, oldstate, newstate), TP_STRUCT__entry( __field(const void *, skaddr) @@ -118,6 +127,7 @@ __field(int, newstate) __field(__u16, sport) __field(__u16, dport) + __field(__u16, family) __field(__u8, protocol) __array(__u8, saddr, 4) __array(__u8, daddr, 4) @@ -133,8 +143,9 @@ __entry->skaddr = sk; __entry->oldstate = oldstate; __entry->newstate = newstate; + __entry->family = family; + __entry->protocol = protocol; - __entry->protocol = sk->sk_protocol; __entry->sport = ntohs(inet->inet_sport); __entry->dport = ntohs(inet->inet_dport); @@ -145,7 +156,7 @@ *p32 = inet->inet_daddr; #if IS_ENABLED(CONFIG_IPV6) - if (sk->sk_family == AF_INET6) { + if (family == AF_INET6) { pin6 = (struct in6_addr *)__entry->saddr_v6; *pin6 = sk->sk_v6_rcv_saddr; pin6 = (struct in6_addr *)__entry->daddr_v6; @@ -160,7 +171,8 @@ } ), - TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", + TP_printk("family=%s protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s", + show_family_name(__entry->family), show_inet_protocol_name(__entry->protocol), __entry->sport, __entry->dport, __entry->saddr, __entry->daddr, diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index bab98a4..1d52796 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1223,14 +1223,16 @@ int inet_sk_rebuild_header(struct sock *sk) void inet_sk_set_state(struct sock *sk, int state) { - trace_inet_sock_set_state(sk, sk->sk_state, state); + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, + sk->sk_state, state); sk->sk_state = state; } EXPORT_SYMBOL(inet_sk_set_state); void inet_sk_state_store(struct sock *sk, int newstate) { - trace_inet_sock_set_state(sk, sk->sk_state, newstate); + trace_inet_sock_set_state(sk, sk->sk_family, sk->sk_protocol, + sk->sk_state, newstate); smp_store_release(>sk_state, newstate); } -- 1.8.3.1
Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
Hi Andrew, Happy new year. Could you help to pick up this patch, which is used to fix a old patch 1cce4df04f37. If we have not this patch, some multiple node test cases will trigger softlockup problems, also make HA communication daemon (e.g. corosync) timeout and the node will has to be fenced. Thanks Gang >>> > > On 17/12/28 15:48, Gang He wrote: >> If we can't get inode lock immediately in the function >> ocfs2_inode_lock_with_page() when reading a page, we should not >> return directly here, since this will lead to a softlockup problem >> when the kernel is configured with CONFIG_PREEMPT is not set. >> The method is to get a blocking lock and immediately unlock before >> returning, this can avoid CPU resource waste due to lots of retries, >> and benefits fairness in getting lock among multiple nodes, increase >> efficiency in case modifying the same file frequently from multiple >> nodes. >> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) >> looks like, >> Kernel panic - not syncing: softlockup: hung tasks >> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 >> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >> Call Trace: >> >> dump_stack+0x5c/0x82 >> panic+0xd5/0x21e >> watchdog_timer_fn+0x208/0x210 >> ? watchdog_park_threads+0x70/0x70 >> __hrtimer_run_queues+0xcc/0x200 >> hrtimer_interrupt+0xa6/0x1f0 >> smp_apic_timer_interrupt+0x34/0x50 >> apic_timer_interrupt+0x96/0xa0 >> >> RIP: 0010:unlock_page+0x17/0x30 >> RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 >> RAX: dead0100 RBX: f21e009f5300 RCX: 0004 >> RDX: dead00ff RSI: 0202 RDI: f21e009f5300 >> RBP: R08: R09: af154080bb00 >> R10: af154080bc30 R11: 0040 R12: 993749a39518 >> R13: R14: f21e009f5300 R15: f21e009f5300 >> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] >> ocfs2_readpage+0x41/0x2d0 [ocfs2] >> ? pagecache_get_page+0x30/0x200 >> filemap_fault+0x12b/0x5c0 >> ? recalc_sigpending+0x17/0x50 >> ? __set_task_blocked+0x28/0x70 >> ? __set_current_blocked+0x3d/0x60 >> ocfs2_fault+0x29/0xb0 [ocfs2] >> __do_fault+0x1a/0xa0 >> __handle_mm_fault+0xbe8/0x1090 >> handle_mm_fault+0xaa/0x1f0 >> __do_page_fault+0x235/0x4b0 >> trace_do_page_fault+0x3c/0x110 >> async_page_fault+0x28/0x30 >> RIP: 0033:0x7fa75ded638e >> RSP: 002b:7ffd6657db18 EFLAGS: 00010287 >> RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 >> RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 >> RBP: 0003 R08: 000e R09: >> R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 >> R13: 000e R14: 1770 R15: >> >> About performance improvement, we can see the testing time is reduced, >> and CPU utilization decreases, the detailed data is as follows. >> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. >> Before apply this patch, >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND >> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap >> 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 > corosync >> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 >>95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 >> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 >> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 >> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun >> >> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o >> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d >> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared >> Tests with "-b 4096 -C 32768" >> Thu Dec 28 14:44:52 CST 2017 >> multi_mmap..Passed. >> Runtime 783 seconds. >> >> After apply this patch, >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND >> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap >> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 >>95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 >> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun >> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 >> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 >> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H >> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H >> 535 root
Re: [Ocfs2-devel] [PATCH v2] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE
Hi Andrew, Happy new year. Could you help to pick up this patch, which is used to fix a old patch 1cce4df04f37. If we have not this patch, some multiple node test cases will trigger softlockup problems, also make HA communication daemon (e.g. corosync) timeout and the node will has to be fenced. Thanks Gang >>> > > On 17/12/28 15:48, Gang He wrote: >> If we can't get inode lock immediately in the function >> ocfs2_inode_lock_with_page() when reading a page, we should not >> return directly here, since this will lead to a softlockup problem >> when the kernel is configured with CONFIG_PREEMPT is not set. >> The method is to get a blocking lock and immediately unlock before >> returning, this can avoid CPU resource waste due to lots of retries, >> and benefits fairness in getting lock among multiple nodes, increase >> efficiency in case modifying the same file frequently from multiple >> nodes. >> The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1) >> looks like, >> Kernel panic - not syncing: softlockup: hung tasks >> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1 >> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 >> Call Trace: >> >> dump_stack+0x5c/0x82 >> panic+0xd5/0x21e >> watchdog_timer_fn+0x208/0x210 >> ? watchdog_park_threads+0x70/0x70 >> __hrtimer_run_queues+0xcc/0x200 >> hrtimer_interrupt+0xa6/0x1f0 >> smp_apic_timer_interrupt+0x34/0x50 >> apic_timer_interrupt+0x96/0xa0 >> >> RIP: 0010:unlock_page+0x17/0x30 >> RSP: :af154080bc88 EFLAGS: 0246 ORIG_RAX: ff10 >> RAX: dead0100 RBX: f21e009f5300 RCX: 0004 >> RDX: dead00ff RSI: 0202 RDI: f21e009f5300 >> RBP: R08: R09: af154080bb00 >> R10: af154080bc30 R11: 0040 R12: 993749a39518 >> R13: R14: f21e009f5300 R15: f21e009f5300 >> ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2] >> ocfs2_readpage+0x41/0x2d0 [ocfs2] >> ? pagecache_get_page+0x30/0x200 >> filemap_fault+0x12b/0x5c0 >> ? recalc_sigpending+0x17/0x50 >> ? __set_task_blocked+0x28/0x70 >> ? __set_current_blocked+0x3d/0x60 >> ocfs2_fault+0x29/0xb0 [ocfs2] >> __do_fault+0x1a/0xa0 >> __handle_mm_fault+0xbe8/0x1090 >> handle_mm_fault+0xaa/0x1f0 >> __do_page_fault+0x235/0x4b0 >> trace_do_page_fault+0x3c/0x110 >> async_page_fault+0x28/0x30 >> RIP: 0033:0x7fa75ded638e >> RSP: 002b:7ffd6657db18 EFLAGS: 00010287 >> RAX: 55c7662fb700 RBX: 0001 RCX: 55c7662fb700 >> RDX: 1770 RSI: 7fa75e909000 RDI: 55c7662fb700 >> RBP: 0003 R08: 000e R09: >> R10: 0483 R11: 7fa75ded61b0 R12: 7fa75e90a770 >> R13: 000e R14: 1770 R15: >> >> About performance improvement, we can see the testing time is reduced, >> and CPU utilization decreases, the detailed data is as follows. >> I ran multi_mmap test case in ocfs2-test package in a three nodes cluster. >> Before apply this patch, >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND >> 2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 > multi_mmap >> 1505 root rt 0 36 123060 97224 S 2.658 6.015 0:01.44 > corosync >> 5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 > kworker/u8:0 >>95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 > kworker/u8:1 >> 2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 > jbd2/sda1-33 >> 2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 > ocfs2dc-3C8CFD4 >> 2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun >> >> ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o >> ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d >> /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared >> Tests with "-b 4096 -C 32768" >> Thu Dec 28 14:44:52 CST 2017 >> multi_mmap..Passed. >> Runtime 783 seconds. >> >> After apply this patch, >> PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND >> 2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 > multi_mmap >> 155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 > kworker/u8:3 >>95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 > kworker/u8:1 >> 2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun >> 5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 > kworker/u8:0 >> 2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 > jbd2/sda1-33 >> 299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 > kworker/2:1H >> 335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 > kworker/1:1H >> 535 root
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
On 01/04/2018 10:16 PM, Yisheng Xie wrote: > BTW, we have just reported a bug caused by kaiser[1], which looks like > caused by SMEP. Could you please help to have a look? > > [1] https://lkml.org/lkml/2018/1/5/3 Please report that to your kernel vendor. Your EFI page tables have the NX bit set on the low addresses. There have been a bunch of iterations of this, but you need to make sure that the EFI kernel mappings don't get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in mainline.
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
On 01/04/2018 10:16 PM, Yisheng Xie wrote: > BTW, we have just reported a bug caused by kaiser[1], which looks like > caused by SMEP. Could you please help to have a look? > > [1] https://lkml.org/lkml/2018/1/5/3 Please report that to your kernel vendor. Your EFI page tables have the NX bit set on the low addresses. There have been a bunch of iterations of this, but you need to make sure that the EFI kernel mappings don't get _PAGE_NX set on them. Look at what __pti_set_user_pgd() does in mainline.
[PATCH] f2fs: implement cgroup writeback supprot
Cgroup writeback requires explicit support from the filesystem. f2fs's data and node writeback IOs go through __write_data_page, which sets fio for submiting IOs. So, we add io_wbc for fio, associate bios with blkcg by invoking wbc_init_bio() and account IOs issuing by wbc_account_io(). In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB. Meta writeback IOs is left alone by this patch and will always be attributed to the root cgroup. The results show that f2fs can throttle writeback nicely for data writing and file creating. Signed-off-by: Yufen Yu--- fs/f2fs/data.c | 11 +-- fs/f2fs/f2fs.h | 1 + fs/f2fs/node.c | 1 + fs/f2fs/super.c | 1 + 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 516fa0d..402df03 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -169,6 +169,7 @@ static bool __same_bdev(struct f2fs_sb_info *sbi, * Low-level block read/write IO operations. */ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, + struct writeback_control *wbc, int npages, bool is_read) { struct bio *bio; @@ -178,6 +179,8 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, f2fs_target_device(sbi, blk_addr, bio); bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io; bio->bi_private = is_read ? NULL : sbi; + if (wbc) + wbc_init_bio(wbc, bio); return bio; } @@ -373,7 +376,8 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio) f2fs_trace_ios(fio, 0); /* Allocate a new bio */ - bio = __bio_alloc(fio->sbi, fio->new_blkaddr, 1, is_read_io(fio->op)); + bio = __bio_alloc(fio->sbi, fio->new_blkaddr, fio->io_wbc, + 1, is_read_io(fio->op)); if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { bio_put(bio); @@ -435,7 +439,7 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio) dec_page_count(sbi, WB_DATA_TYPE(bio_page)); goto out_fail; } - io->bio = __bio_alloc(sbi, fio->new_blkaddr, + io->bio = __bio_alloc(sbi, fio->new_blkaddr, fio->io_wbc, BIO_MAX_PAGES, false); io->fio = *fio; } @@ -443,6 +447,8 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio) if (bio_add_page(io->bio, bio_page, PAGE_SIZE, 0) < PAGE_SIZE) { __submit_merged_bio(io); goto alloc_new; + } else if (fio->io_wbc) { + wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE); } io->last_block_in_bio = fio->new_blkaddr; @@ -1508,6 +1514,7 @@ static int __write_data_page(struct page *page, bool *submitted, .submitted = false, .need_lock = LOCK_RETRY, .io_type = io_type, + .io_wbc = wbc, }; trace_f2fs_writepage(page, DATA); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 6abf26c..4887dde 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -957,6 +957,7 @@ struct f2fs_io_info { int need_lock; /* indicate we need to lock cp_rwsem */ bool in_list; /* indicate fio is in io_list */ enum iostat_type io_type; /* io type */ + struct writeback_control *io_wbc; /* writeback control */ }; #define is_read_io(rw) ((rw) == READ) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index d332275..e4f8bb0 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1336,6 +1336,7 @@ static int __write_node_page(struct page *page, bool atomic, bool *submitted, .encrypted_page = NULL, .submitted = false, .io_type = io_type, + .io_wbc = wbc, }; trace_f2fs_writepage(page, NODE); diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 708155d..deeba98 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -2475,6 +2475,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags = (sb->s_flags & ~SB_POSIXACL) | (test_opt(sbi, POSIX_ACL) ? SB_POSIXACL : 0); memcpy(>s_uuid, raw_super->uuid, sizeof(raw_super->uuid)); + sb->s_iflags |= SB_I_CGROUPWB; /* init f2fs-specific super block info */ sbi->valid_super_block = valid_super_block; -- 2.9.5
[PATCH] f2fs: implement cgroup writeback supprot
Cgroup writeback requires explicit support from the filesystem. f2fs's data and node writeback IOs go through __write_data_page, which sets fio for submiting IOs. So, we add io_wbc for fio, associate bios with blkcg by invoking wbc_init_bio() and account IOs issuing by wbc_account_io(). In addtion, f2fs_fill_super() is updated to set SB_I_CGROUPWB. Meta writeback IOs is left alone by this patch and will always be attributed to the root cgroup. The results show that f2fs can throttle writeback nicely for data writing and file creating. Signed-off-by: Yufen Yu --- fs/f2fs/data.c | 11 +-- fs/f2fs/f2fs.h | 1 + fs/f2fs/node.c | 1 + fs/f2fs/super.c | 1 + 4 files changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c index 516fa0d..402df03 100644 --- a/fs/f2fs/data.c +++ b/fs/f2fs/data.c @@ -169,6 +169,7 @@ static bool __same_bdev(struct f2fs_sb_info *sbi, * Low-level block read/write IO operations. */ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, + struct writeback_control *wbc, int npages, bool is_read) { struct bio *bio; @@ -178,6 +179,8 @@ static struct bio *__bio_alloc(struct f2fs_sb_info *sbi, block_t blk_addr, f2fs_target_device(sbi, blk_addr, bio); bio->bi_end_io = is_read ? f2fs_read_end_io : f2fs_write_end_io; bio->bi_private = is_read ? NULL : sbi; + if (wbc) + wbc_init_bio(wbc, bio); return bio; } @@ -373,7 +376,8 @@ int f2fs_submit_page_bio(struct f2fs_io_info *fio) f2fs_trace_ios(fio, 0); /* Allocate a new bio */ - bio = __bio_alloc(fio->sbi, fio->new_blkaddr, 1, is_read_io(fio->op)); + bio = __bio_alloc(fio->sbi, fio->new_blkaddr, fio->io_wbc, + 1, is_read_io(fio->op)); if (bio_add_page(bio, page, PAGE_SIZE, 0) < PAGE_SIZE) { bio_put(bio); @@ -435,7 +439,7 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio) dec_page_count(sbi, WB_DATA_TYPE(bio_page)); goto out_fail; } - io->bio = __bio_alloc(sbi, fio->new_blkaddr, + io->bio = __bio_alloc(sbi, fio->new_blkaddr, fio->io_wbc, BIO_MAX_PAGES, false); io->fio = *fio; } @@ -443,6 +447,8 @@ int f2fs_submit_page_write(struct f2fs_io_info *fio) if (bio_add_page(io->bio, bio_page, PAGE_SIZE, 0) < PAGE_SIZE) { __submit_merged_bio(io); goto alloc_new; + } else if (fio->io_wbc) { + wbc_account_io(fio->io_wbc, bio_page, PAGE_SIZE); } io->last_block_in_bio = fio->new_blkaddr; @@ -1508,6 +1514,7 @@ static int __write_data_page(struct page *page, bool *submitted, .submitted = false, .need_lock = LOCK_RETRY, .io_type = io_type, + .io_wbc = wbc, }; trace_f2fs_writepage(page, DATA); diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 6abf26c..4887dde 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -957,6 +957,7 @@ struct f2fs_io_info { int need_lock; /* indicate we need to lock cp_rwsem */ bool in_list; /* indicate fio is in io_list */ enum iostat_type io_type; /* io type */ + struct writeback_control *io_wbc; /* writeback control */ }; #define is_read_io(rw) ((rw) == READ) diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index d332275..e4f8bb0 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -1336,6 +1336,7 @@ static int __write_node_page(struct page *page, bool atomic, bool *submitted, .encrypted_page = NULL, .submitted = false, .io_type = io_type, + .io_wbc = wbc, }; trace_f2fs_writepage(page, NODE); diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 708155d..deeba98 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -2475,6 +2475,7 @@ static int f2fs_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags = (sb->s_flags & ~SB_POSIXACL) | (test_opt(sbi, POSIX_ACL) ? SB_POSIXACL : 0); memcpy(>s_uuid, raw_super->uuid, sizeof(raw_super->uuid)); + sb->s_iflags |= SB_I_CGROUPWB; /* init f2fs-specific super block info */ sbi->valid_super_block = valid_super_block; -- 2.9.5
Re: [linux-sunxi] Re: [PATCH 06/11] dt-bindings: display: sun4i-drm: Add A83T HDMI pipeline
Hi, Dne petek, 05. januar 2018 ob 03:49:09 CET je Icenowy Zheng napisal(a): > 于 2018年1月5日 GMT+08:00 上午2:52:10, Maxime Ripard写到: > >On Wed, Jan 03, 2018 at 10:32:26PM +0100, Jernej Škrabec wrote: > >> Hi Rob, > >> > >> Dne sreda, 03. januar 2018 ob 21:21:54 CET je Rob Herring napisal(a): > >> > On Sat, Dec 30, 2017 at 10:01:58PM +0100, Jernej Skrabec wrote: > >> > > This commit adds all necessary compatibles and descriptions > > > >needed to > > > >> > > implement A83T HDMI pipeline. > >> > > > >> > > Mixer is already properly described, so only compatible is added. > >> > > > >> > > However, A83T TCON1, which is connected to HDMI, doesn't have > > > >channel 0, > > > >> > > contrary to all TCONs currently described. Because of that, TCON > >> > > documentation is extended. > >> > > > >> > > A83T features Synopsys DW HDMI controller with a custom PHY which > > > >looks > > > >> > > like Synopsys Gen2 PHY with few additions. Since there is no > >> > > documentation, needed properties were found out through > > > >experimentation > > > >> > > and reading BSP code. > >> > > > >> > > At the end, example is added for newer SoCs, which features DE2 > > > >and DW > > > >> > > HDMI. > >> > > > >> > > Signed-off-by: Jernej Skrabec > >> > > --- > >> > > > >> > > .../bindings/display/sunxi/sun4i-drm.txt | 188 > >> > > - 1 file changed, 181 insertions(+), 7 > > > >deletions(-) > > > >> > > diff --git > > > >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >index > > > >> > > 9f073af4c711..3eca258096a5 100644 > >> > > --- > > > >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > +++ > > > >b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > @@ -64,6 +64,40 @@ Required properties: > >> > > first port should be the input endpoint. The second should > > > >be the > > > >> > > output, usually to an HDMI connector. > >> > > > >> > > +DWC HDMI TX Encoder > >> > > +- > >> > > + > >> > > +The HDMI transmitter is a Synopsys DesignWare HDMI 1.4 TX > > > >controller IP > > > >> > > +with Allwinner's own PHY IP. It supports audio and video outputs > > > >and CEC. > > > >> > > + > >> > > +These DT bindings follow the Synopsys DWC HDMI TX bindings > > > >defined in > > > >> > > +Documentation/devicetree/bindings/display/bridge/dw_hdmi.txt > > > >with the > > > >> > > +following device-specific properties. > >> > > + > >> > > +Required properties: > >> > > + > >> > > + - compatible: value must be one of: > >> > > +* "allwinner,sun8i-a83t-dw-hdmi" > >> > > + - reg: two pairs of base address and size of memory-mapped > > > >region, > > > >> > > first > >> > > +for controller and second for PHY > >> > > +registers. > >> > > >> > Seems like the phy should be a separate node and use the phy > > > >binding. > > > >> > You can use the phy binding even if you don't use the kernel phy > >> > framework... > >> > >> Unfortunately, it's not so straighforward. Phy is actually accessed > > > >through > > > >> I2C implemented in HDMI controller. Second memory region in this case > > > >has > > > >> small influence on phy. However, it has big influence on controller. > > > >For > > > >> example, magic number has to be written in one register in second > > > >memory > > > >> region in order to unlock read access to any register from first > > > >memory region > > > >> (controller). However, they shouldn't be merged to one region, > > > >because first > > > >> memory region requires byte access while second memory region can be > > > >accessed > > > >> per byte or word. > >> > >> To complicate things more, later I want to add support for another > > > >SoC which > > > >> has same glue layer (unlocking read access, etc.) and uses memory > > > >mapped phy > > > >> registers in second memory region. > >> > >> I think current binding is the least complicated way to represent > > > >this. > > > >I agree with Rob here. I did a similar thing for the DSI patches I've > >sent a few monthes ago and it turned out to not be that difficult, so > >I'm sure you can come up with something :) > > In A83T/H3/A64/H5/R40 this part is not purely a PHY. > It controls the access of main controller's register (e.g. read/write > lock and register obfuscation). So it should be called a "glue" > with PHY part (and on A83T seems a pure glue) but not a simple > PHY. It's not so simple. Actually it has PHY settings also on A83T. For example, value at 0x01EF0001 depends on polarity. Value at 0x01EF0002 sets PHY I2C address. Bit 7 at 0x01EF0007 enables/disables external resistor. That is info I discovered/received after I sent patches, so it's not cleary marked. Proper memory map (starts at 0x01EE): 0x0 - 0x1 -> DW HDMI controller 0x1 - 0x10010 -> (almost?)
Re: [linux-sunxi] Re: [PATCH 06/11] dt-bindings: display: sun4i-drm: Add A83T HDMI pipeline
Hi, Dne petek, 05. januar 2018 ob 03:49:09 CET je Icenowy Zheng napisal(a): > 于 2018年1月5日 GMT+08:00 上午2:52:10, Maxime Ripard 写到: > >On Wed, Jan 03, 2018 at 10:32:26PM +0100, Jernej Škrabec wrote: > >> Hi Rob, > >> > >> Dne sreda, 03. januar 2018 ob 21:21:54 CET je Rob Herring napisal(a): > >> > On Sat, Dec 30, 2017 at 10:01:58PM +0100, Jernej Skrabec wrote: > >> > > This commit adds all necessary compatibles and descriptions > > > >needed to > > > >> > > implement A83T HDMI pipeline. > >> > > > >> > > Mixer is already properly described, so only compatible is added. > >> > > > >> > > However, A83T TCON1, which is connected to HDMI, doesn't have > > > >channel 0, > > > >> > > contrary to all TCONs currently described. Because of that, TCON > >> > > documentation is extended. > >> > > > >> > > A83T features Synopsys DW HDMI controller with a custom PHY which > > > >looks > > > >> > > like Synopsys Gen2 PHY with few additions. Since there is no > >> > > documentation, needed properties were found out through > > > >experimentation > > > >> > > and reading BSP code. > >> > > > >> > > At the end, example is added for newer SoCs, which features DE2 > > > >and DW > > > >> > > HDMI. > >> > > > >> > > Signed-off-by: Jernej Skrabec > >> > > --- > >> > > > >> > > .../bindings/display/sunxi/sun4i-drm.txt | 188 > >> > > - 1 file changed, 181 insertions(+), 7 > > > >deletions(-) > > > >> > > diff --git > > > >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >index > > > >> > > 9f073af4c711..3eca258096a5 100644 > >> > > --- > > > >a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > +++ > > > >b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt > > > >> > > @@ -64,6 +64,40 @@ Required properties: > >> > > first port should be the input endpoint. The second should > > > >be the > > > >> > > output, usually to an HDMI connector. > >> > > > >> > > +DWC HDMI TX Encoder > >> > > +- > >> > > + > >> > > +The HDMI transmitter is a Synopsys DesignWare HDMI 1.4 TX > > > >controller IP > > > >> > > +with Allwinner's own PHY IP. It supports audio and video outputs > > > >and CEC. > > > >> > > + > >> > > +These DT bindings follow the Synopsys DWC HDMI TX bindings > > > >defined in > > > >> > > +Documentation/devicetree/bindings/display/bridge/dw_hdmi.txt > > > >with the > > > >> > > +following device-specific properties. > >> > > + > >> > > +Required properties: > >> > > + > >> > > + - compatible: value must be one of: > >> > > +* "allwinner,sun8i-a83t-dw-hdmi" > >> > > + - reg: two pairs of base address and size of memory-mapped > > > >region, > > > >> > > first > >> > > +for controller and second for PHY > >> > > +registers. > >> > > >> > Seems like the phy should be a separate node and use the phy > > > >binding. > > > >> > You can use the phy binding even if you don't use the kernel phy > >> > framework... > >> > >> Unfortunately, it's not so straighforward. Phy is actually accessed > > > >through > > > >> I2C implemented in HDMI controller. Second memory region in this case > > > >has > > > >> small influence on phy. However, it has big influence on controller. > > > >For > > > >> example, magic number has to be written in one register in second > > > >memory > > > >> region in order to unlock read access to any register from first > > > >memory region > > > >> (controller). However, they shouldn't be merged to one region, > > > >because first > > > >> memory region requires byte access while second memory region can be > > > >accessed > > > >> per byte or word. > >> > >> To complicate things more, later I want to add support for another > > > >SoC which > > > >> has same glue layer (unlocking read access, etc.) and uses memory > > > >mapped phy > > > >> registers in second memory region. > >> > >> I think current binding is the least complicated way to represent > > > >this. > > > >I agree with Rob here. I did a similar thing for the DSI patches I've > >sent a few monthes ago and it turned out to not be that difficult, so > >I'm sure you can come up with something :) > > In A83T/H3/A64/H5/R40 this part is not purely a PHY. > It controls the access of main controller's register (e.g. read/write > lock and register obfuscation). So it should be called a "glue" > with PHY part (and on A83T seems a pure glue) but not a simple > PHY. It's not so simple. Actually it has PHY settings also on A83T. For example, value at 0x01EF0001 depends on polarity. Value at 0x01EF0002 sets PHY I2C address. Bit 7 at 0x01EF0007 enables/disables external resistor. That is info I discovered/received after I sent patches, so it's not cleary marked. Proper memory map (starts at 0x01EE): 0x0 - 0x1 -> DW HDMI controller 0x1 - 0x10010 -> (almost?) Common PHY settings 0x10010 - 0x10020 -> Allwinner
[PATCH] f2fs: add resgid and resuid to reserve root blocks
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u. It only activates with reserve_root=%u. Signed-off-by: Jaegeuk Kim--- fs/f2fs/f2fs.h | 26 -- fs/f2fs/super.c | 46 -- 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 4d255aac49bb..e5554b851fd8 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -131,6 +131,12 @@ struct f2fs_mount_info { #define F2FS_CLEAR_FEATURE(sb, mask) \ (F2FS_SB(sb)->raw_super->feature &= ~cpu_to_le32(mask)) +/* + * Default values for user and/or group using reserved blocks + */ +#defineF2FS_DEF_RESUID 0 +#defineF2FS_DEF_RESGID 0 + /* * For checkpoint manager */ @@ -,6 +1117,8 @@ struct f2fs_sb_info { block_t reserved_blocks;/* configurable reserved blocks */ block_t current_reserved_blocks;/* current reserved blocks */ block_t root_reserved_blocks; /* root reserved blocks */ + kuid_t s_resuid;/* reserved blocks for uid */ + kgid_t s_resgid;/* reserved blocks for gid */ unsigned int nquota_files; /* # of quota sysfile */ @@ -1563,6 +1571,20 @@ static inline bool f2fs_has_xattr_block(unsigned int ofs) return ofs == XATTR_NODE_OFFSET; } +static inline bool __allow_reserved_blocks(struct f2fs_sb_info *sbi) +{ + if (!test_opt(sbi, RESERVE_ROOT)) + return false; + if (capable(CAP_SYS_RESOURCE)) + return true; + if (uid_eq(sbi->s_resuid, current_fsuid())) + return true; + if (!gid_eq(sbi->s_resgid, GLOBAL_ROOT_GID) && + in_group_p(sbi->s_resgid)) + return true; + return false; +} + static inline void f2fs_i_blocks_write(struct inode *, block_t, bool, bool); static inline int inc_valid_block_count(struct f2fs_sb_info *sbi, struct inode *inode, blkcnt_t *count) @@ -1593,7 +1615,7 @@ static inline int inc_valid_block_count(struct f2fs_sb_info *sbi, avail_user_block_count = sbi->user_block_count - sbi->current_reserved_blocks; - if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE))) + if (!__allow_reserved_blocks(sbi)) avail_user_block_count -= sbi->root_reserved_blocks; if (unlikely(sbi->total_valid_block_count > avail_user_block_count)) { @@ -1794,7 +1816,7 @@ static inline int inc_valid_node_count(struct f2fs_sb_info *sbi, valid_block_count = sbi->total_valid_block_count + sbi->current_reserved_blocks + 1; - if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE))) + if (!__allow_reserved_blocks(sbi)) valid_block_count += sbi->root_reserved_blocks; if (unlikely(valid_block_count > sbi->user_block_count)) { diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 4904d1644052..ef40bc3d91e8 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -108,6 +108,8 @@ enum { Opt_noinline_data, Opt_data_flush, Opt_reserve_root, + Opt_resgid, + Opt_resuid, Opt_mode, Opt_io_size_bits, Opt_fault_injection, @@ -159,6 +161,8 @@ static match_table_t f2fs_tokens = { {Opt_noinline_data, "noinline_data"}, {Opt_data_flush, "data_flush"}, {Opt_reserve_root, "reserve_root=%u"}, + {Opt_resgid, "resgid=%u"}, + {Opt_resuid, "resuid=%u"}, {Opt_mode, "mode=%s"}, {Opt_io_size_bits, "io_bits=%u"}, {Opt_fault_injection, "fault_injection=%u"}, @@ -204,6 +208,15 @@ static inline void limit_reserve_root(struct f2fs_sb_info *sbi) "Reduce reserved blocks for root = %u", sbi->root_reserved_blocks); } + if (!test_opt(sbi, RESERVE_ROOT) && + (!uid_eq(sbi->s_resuid, + make_kuid(_user_ns, F2FS_DEF_RESUID)) || + !gid_eq(sbi->s_resgid, + make_kgid(_user_ns, F2FS_DEF_RESGID + f2fs_msg(sbi->sb, KERN_INFO, + "Ignore s_resuid=%u, s_resgid=%u w/o reserve_root", + from_kuid_munged(_user_ns, sbi->s_resuid), + from_kgid_munged(_user_ns, sbi->s_resgid)); } static void init_once(void *foo) @@ -336,6 +349,8 @@ static int parse_options(struct super_block *sb, char *options) substring_t args[MAX_OPT_ARGS]; char *p, *name; int arg = 0; + kuid_t uid; + kgid_t gid; #ifdef CONFIG_QUOTA int ret; #endif @@ -515,6 +530,28 @@ static int parse_options(struct super_block *sb, char
[PATCH] f2fs: add resgid and resuid to reserve root blocks
This patch adds mount options to reserve some blocks via resgid=%u,resuid=%u. It only activates with reserve_root=%u. Signed-off-by: Jaegeuk Kim --- fs/f2fs/f2fs.h | 26 -- fs/f2fs/super.c | 46 -- 2 files changed, 68 insertions(+), 4 deletions(-) diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h index 4d255aac49bb..e5554b851fd8 100644 --- a/fs/f2fs/f2fs.h +++ b/fs/f2fs/f2fs.h @@ -131,6 +131,12 @@ struct f2fs_mount_info { #define F2FS_CLEAR_FEATURE(sb, mask) \ (F2FS_SB(sb)->raw_super->feature &= ~cpu_to_le32(mask)) +/* + * Default values for user and/or group using reserved blocks + */ +#defineF2FS_DEF_RESUID 0 +#defineF2FS_DEF_RESGID 0 + /* * For checkpoint manager */ @@ -,6 +1117,8 @@ struct f2fs_sb_info { block_t reserved_blocks;/* configurable reserved blocks */ block_t current_reserved_blocks;/* current reserved blocks */ block_t root_reserved_blocks; /* root reserved blocks */ + kuid_t s_resuid;/* reserved blocks for uid */ + kgid_t s_resgid;/* reserved blocks for gid */ unsigned int nquota_files; /* # of quota sysfile */ @@ -1563,6 +1571,20 @@ static inline bool f2fs_has_xattr_block(unsigned int ofs) return ofs == XATTR_NODE_OFFSET; } +static inline bool __allow_reserved_blocks(struct f2fs_sb_info *sbi) +{ + if (!test_opt(sbi, RESERVE_ROOT)) + return false; + if (capable(CAP_SYS_RESOURCE)) + return true; + if (uid_eq(sbi->s_resuid, current_fsuid())) + return true; + if (!gid_eq(sbi->s_resgid, GLOBAL_ROOT_GID) && + in_group_p(sbi->s_resgid)) + return true; + return false; +} + static inline void f2fs_i_blocks_write(struct inode *, block_t, bool, bool); static inline int inc_valid_block_count(struct f2fs_sb_info *sbi, struct inode *inode, blkcnt_t *count) @@ -1593,7 +1615,7 @@ static inline int inc_valid_block_count(struct f2fs_sb_info *sbi, avail_user_block_count = sbi->user_block_count - sbi->current_reserved_blocks; - if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE))) + if (!__allow_reserved_blocks(sbi)) avail_user_block_count -= sbi->root_reserved_blocks; if (unlikely(sbi->total_valid_block_count > avail_user_block_count)) { @@ -1794,7 +1816,7 @@ static inline int inc_valid_node_count(struct f2fs_sb_info *sbi, valid_block_count = sbi->total_valid_block_count + sbi->current_reserved_blocks + 1; - if (!(test_opt(sbi, RESERVE_ROOT) && capable(CAP_SYS_RESOURCE))) + if (!__allow_reserved_blocks(sbi)) valid_block_count += sbi->root_reserved_blocks; if (unlikely(valid_block_count > sbi->user_block_count)) { diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 4904d1644052..ef40bc3d91e8 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -108,6 +108,8 @@ enum { Opt_noinline_data, Opt_data_flush, Opt_reserve_root, + Opt_resgid, + Opt_resuid, Opt_mode, Opt_io_size_bits, Opt_fault_injection, @@ -159,6 +161,8 @@ static match_table_t f2fs_tokens = { {Opt_noinline_data, "noinline_data"}, {Opt_data_flush, "data_flush"}, {Opt_reserve_root, "reserve_root=%u"}, + {Opt_resgid, "resgid=%u"}, + {Opt_resuid, "resuid=%u"}, {Opt_mode, "mode=%s"}, {Opt_io_size_bits, "io_bits=%u"}, {Opt_fault_injection, "fault_injection=%u"}, @@ -204,6 +208,15 @@ static inline void limit_reserve_root(struct f2fs_sb_info *sbi) "Reduce reserved blocks for root = %u", sbi->root_reserved_blocks); } + if (!test_opt(sbi, RESERVE_ROOT) && + (!uid_eq(sbi->s_resuid, + make_kuid(_user_ns, F2FS_DEF_RESUID)) || + !gid_eq(sbi->s_resgid, + make_kgid(_user_ns, F2FS_DEF_RESGID + f2fs_msg(sbi->sb, KERN_INFO, + "Ignore s_resuid=%u, s_resgid=%u w/o reserve_root", + from_kuid_munged(_user_ns, sbi->s_resuid), + from_kgid_munged(_user_ns, sbi->s_resgid)); } static void init_once(void *foo) @@ -336,6 +349,8 @@ static int parse_options(struct super_block *sb, char *options) substring_t args[MAX_OPT_ARGS]; char *p, *name; int arg = 0; + kuid_t uid; + kgid_t gid; #ifdef CONFIG_QUOTA int ret; #endif @@ -515,6 +530,28 @@ static int parse_options(struct super_block *sb, char *options)
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
Hi Dave, On 2018/1/5 13:18, Dave Hansen wrote: > On 01/04/2018 08:16 PM, Yisheng Xie wrote: >>> === Page Table Poisoning === >>> >>> KAISER has two copies of the page tables: one for the kernel and >>> one for when running in userspace. >> >> So, we have 2 page table, thinking about this case: >> If _ONE_ process includes _TWO_ threads, one run in user space, the other >> run in kernel, they can run in one core with Hyper-Threading, right? > > Yes. > >> So both userspace and kernel space is valid, right? And for one core >> with Hyper-Threading, they may share TLB, so the timing problem >> described in the paper may still exist? > > No. The TLB is managed per logical CPU (hyperthread), as is the CR3 > register that points to the page tables. Two threads running the same > process might use the same CR3 _value_, but that does not mean they > share TLB entries. Get it, and thanks for your explain. BTW, we have just reported a bug caused by kaiser[1], which looks like caused by SMEP. Could you please help to have a look? [1] https://lkml.org/lkml/2018/1/5/3 Thanks Yisheng > > One thread *can* be in the kernel with the kernel page tables while the > other is in userspace with the user page tables active. They will even > use a different PCID/ASID for the same page tables normally. > >> Can this case still be protected by KAISER? > > Yes. > > . >
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
Hi Dave, On 2018/1/5 13:18, Dave Hansen wrote: > On 01/04/2018 08:16 PM, Yisheng Xie wrote: >>> === Page Table Poisoning === >>> >>> KAISER has two copies of the page tables: one for the kernel and >>> one for when running in userspace. >> >> So, we have 2 page table, thinking about this case: >> If _ONE_ process includes _TWO_ threads, one run in user space, the other >> run in kernel, they can run in one core with Hyper-Threading, right? > > Yes. > >> So both userspace and kernel space is valid, right? And for one core >> with Hyper-Threading, they may share TLB, so the timing problem >> described in the paper may still exist? > > No. The TLB is managed per logical CPU (hyperthread), as is the CR3 > register that points to the page tables. Two threads running the same > process might use the same CR3 _value_, but that does not mean they > share TLB entries. Get it, and thanks for your explain. BTW, we have just reported a bug caused by kaiser[1], which looks like caused by SMEP. Could you please help to have a look? [1] https://lkml.org/lkml/2018/1/5/3 Thanks Yisheng > > One thread *can* be in the kernel with the kernel page tables while the > other is in userspace with the user page tables active. They will even > use a different PCID/ASID for the same page tables normally. > >> Can this case still be protected by KAISER? > > Yes. > > . >
[PATCH v2] mm/fadvise: discard partial page if endbyte is also EOF
From: shidao.yttDuring our recent testing with fadvise(FADV_DONTNEED), we find that if given offset/length is not page-aligned, the last page will not be discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we map a 10KB-sized file into memory and then try to run this tool to evict the whole file mapping, but the last single page always remains staying in the memory: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 2.1e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 1/3 4K/12K 33.3% Elapsed: 5.5e-05 seconds However when we test with an older kernel, say 3.10, this problem is gone. So we wonder if this is a regression: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 8.2e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 0/3 0/12K 0% <-- partial page also discarded Elapsed: 5e-05 seconds After digging a little bit into this problem, we find it seems not a regression. Not discarding partial page is likely to be on purpose according to commit 441c228f817f7 ("mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel Gorman. He explained why partial pages should be preserved instead of being discarded when using fadvise(FADV_DONTNEED). However, the interesting part is that the actual code did NOT work as the same as it was described, the partial page was still discarded anyway, due to a calculation mistake of `end_index' passed to invalidate_mapping_pages(). This mistake has not been fixed until recently, that's why we fail to reproduce our problem in old kernels. The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin. Back to the original testing, our problem becomes that there is a speical case that, if the page-unaligned `endbyte' is also the end of file, it is not necessary at all to preserve the last partial page, as we all know no one else will use the rest of it. It should be safe enough if we just discard the whole page. So we add an EOF check in this patch. We also find a poosbile real world issue in mainline kernel. Assume such scenario: A userspace backup application want to backup a huge amount of small files (<4k) at once, the developer might (I guess) want to use fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't really happen since the only page mapped is a partial page, and kernel will preserve it. Our patch also fixes this problem, since we know the endbyte is EOF, so we discard it. Here is a simple reproducer to reproduce and verify each scenario we described above: test_fadvise.c == #include #include #include #include #include #include #include int main(int argc, char **argv) { int i, fd, ret, len; struct stat buf; void *addr; unsigned char *vec; char *strbuf; ssize_t pagesize = getpagesize(); ssize_t filesize; fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if (fd < 0) return -1; filesize = strtoul(argv[2], NULL, 10); strbuf = malloc(filesize); memset(strbuf, 42, filesize); write(fd, strbuf, filesize); free(strbuf); fsync(fd); len = (filesize + pagesize - 1) / pagesize; printf("length of pages: %d\n", len); addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) return -1; ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED); if (ret < 0) return -1; vec = malloc(len); ret = mincore(addr, filesize, (void *)vec); if (ret < 0) return -1; for (i = 0; i < len; i++) printf("pages[%d]: %x\n", i, vec[i] & 0x1); free(vec); close(fd); return 0; } == Test 1: running on kernel with commit 18aba41cbf reverted: [root@caspar ~]# uname -r 4.15.0-rc6.revert+ [root@caspar ~]# ./test_fadvise file1 1024 length of pages: 1 pages[0]: 0# <-- partial page discarded [root@caspar ~]# ./test_fadvise file2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise file3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0# <-- partial page discarded Test 2: running on mainline kernel: [root@caspar ~]# uname -r 4.15.0-rc6+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 1# <-- partial and the only page not discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 1# <-- partial page not
[PATCH v2] mm/fadvise: discard partial page if endbyte is also EOF
From: shidao.ytt During our recent testing with fadvise(FADV_DONTNEED), we find that if given offset/length is not page-aligned, the last page will not be discarded. The tool we use is vmtouch (https://hoytech.com/vmtouch/), we map a 10KB-sized file into memory and then try to run this tool to evict the whole file mapping, but the last single page always remains staying in the memory: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 2.1e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 1/3 4K/12K 33.3% Elapsed: 5.5e-05 seconds However when we test with an older kernel, say 3.10, this problem is gone. So we wonder if this is a regression: $./vmtouch -e test_10K Files: 1 Directories: 0 Evicted Pages: 3 (12K) Elapsed: 8.2e-05 seconds $./vmtouch test_10K Files: 1 Directories: 0 Resident Pages: 0/3 0/12K 0% <-- partial page also discarded Elapsed: 5e-05 seconds After digging a little bit into this problem, we find it seems not a regression. Not discarding partial page is likely to be on purpose according to commit 441c228f817f7 ("mm: fadvise: document the fadvise(FADV_DONTNEED) behaviour for partial pages") written by Mel Gorman. He explained why partial pages should be preserved instead of being discarded when using fadvise(FADV_DONTNEED). However, the interesting part is that the actual code did NOT work as the same as it was described, the partial page was still discarded anyway, due to a calculation mistake of `end_index' passed to invalidate_mapping_pages(). This mistake has not been fixed until recently, that's why we fail to reproduce our problem in old kernels. The fix is done in commit 18aba41cbf ("mm/fadvise.c: do not discard partial pages with POSIX_FADV_DONTNEED") by Oleg Drokin. Back to the original testing, our problem becomes that there is a speical case that, if the page-unaligned `endbyte' is also the end of file, it is not necessary at all to preserve the last partial page, as we all know no one else will use the rest of it. It should be safe enough if we just discard the whole page. So we add an EOF check in this patch. We also find a poosbile real world issue in mainline kernel. Assume such scenario: A userspace backup application want to backup a huge amount of small files (<4k) at once, the developer might (I guess) want to use fadvise(FADV_DONTNEED) to save memory. However, FADV_DONTNEED won't really happen since the only page mapped is a partial page, and kernel will preserve it. Our patch also fixes this problem, since we know the endbyte is EOF, so we discard it. Here is a simple reproducer to reproduce and verify each scenario we described above: test_fadvise.c == #include #include #include #include #include #include #include int main(int argc, char **argv) { int i, fd, ret, len; struct stat buf; void *addr; unsigned char *vec; char *strbuf; ssize_t pagesize = getpagesize(); ssize_t filesize; fd = open(argv[1], O_RDWR|O_CREAT, S_IRUSR|S_IWUSR); if (fd < 0) return -1; filesize = strtoul(argv[2], NULL, 10); strbuf = malloc(filesize); memset(strbuf, 42, filesize); write(fd, strbuf, filesize); free(strbuf); fsync(fd); len = (filesize + pagesize - 1) / pagesize; printf("length of pages: %d\n", len); addr = mmap(NULL, filesize, PROT_READ, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) return -1; ret = posix_fadvise(fd, 0, filesize, POSIX_FADV_DONTNEED); if (ret < 0) return -1; vec = malloc(len); ret = mincore(addr, filesize, (void *)vec); if (ret < 0) return -1; for (i = 0; i < len; i++) printf("pages[%d]: %x\n", i, vec[i] & 0x1); free(vec); close(fd); return 0; } == Test 1: running on kernel with commit 18aba41cbf reverted: [root@caspar ~]# uname -r 4.15.0-rc6.revert+ [root@caspar ~]# ./test_fadvise file1 1024 length of pages: 1 pages[0]: 0# <-- partial page discarded [root@caspar ~]# ./test_fadvise file2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise file3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 0# <-- partial page discarded Test 2: running on mainline kernel: [root@caspar ~]# uname -r 4.15.0-rc6+ [root@caspar ~]# ./test_fadvise test1 1024 length of pages: 1 pages[0]: 1# <-- partial and the only page not discarded [root@caspar ~]# ./test_fadvise test2 8192 length of pages: 2 pages[0]: 0 pages[1]: 0 [root@caspar ~]# ./test_fadvise test3 10240 length of pages: 3 pages[0]: 0 pages[1]: 0 pages[2]: 1# <-- partial page not discarded Test 3: running on
Re: [PATCH 0/7] IBRS patch series
* Linus Torvalds: > On Thu, Jan 4, 2018 at 9:56 AM, Tim Chenwrote: >> >> Speculation on Skylake and later requires these patches ("dynamic IBRS") >> be used instead of retpoline[1]. > > Can somebody explain this part? > > I was assuming that retpoline would work around this issue on all uarchs. > > This seems to say "retpoline does nothing on Skylake+" Retpoline also looks incompatible with CET, so future Intel CPUs will eventually need a different approach anyway.
Re: [PATCH 0/7] IBRS patch series
* Linus Torvalds: > On Thu, Jan 4, 2018 at 9:56 AM, Tim Chen wrote: >> >> Speculation on Skylake and later requires these patches ("dynamic IBRS") >> be used instead of retpoline[1]. > > Can somebody explain this part? > > I was assuming that retpoline would work around this issue on all uarchs. > > This seems to say "retpoline does nothing on Skylake+" Retpoline also looks incompatible with CET, so future Intel CPUs will eventually need a different approach anyway.
Re: [PATCH] driver: input :touchscreen :Modify Raydium Firmware update input file
Hi Jeffrey, On Thu, Dec 21, 2017 at 09:51:22PM +0800, jeffrey.lin wrote: > Modify update firmware to accept alternative file name > > Signed-off-by: jeffrey.lin> --- > drivers/input/touchscreen/raydium_i2c_ts.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c > b/drivers/input/touchscreen/raydium_i2c_ts.c > index a99fb5cac5a0..439d43c3519c 100644 > --- a/drivers/input/touchscreen/raydium_i2c_ts.c > +++ b/drivers/input/touchscreen/raydium_i2c_ts.c > @@ -130,6 +130,7 @@ struct raydium_data { > struct gpio_desc *reset_gpio; > > struct raydium_info info; > + char fw_file[64]; You do not really need to keep the firmware name in driver data, just use a temporary in raydium_i2c_fw_update(). > > struct mutex sysfs_mutex; > > @@ -752,12 +753,16 @@ static int raydium_i2c_fw_update(struct raydium_data > *ts) > { > struct i2c_client *client = ts->client; > const struct firmware *fw = NULL; > - const char *fw_file = "raydium.fw"; > int error; > > - error = request_firmware(, fw_file, >dev); > + /* Firmware name */ > + snprintf(ts->fw_file, sizeof(ts->fw_file), > + "raydium_%x.fw", ts->info.hw_ver); hw_ver is LE32, you need to convert it to CPU endianness before using. Also it would be better if we used the same encoding for the hardware version as the one that we use when we output it in sysfs. It makes userspace life a bit easier I think. How about the version of the patch below? Thanks. -- Dmitry Input: raydium_i2c_ts - include hardware version in firmware name From: Jeffrey Lin Add hardware version to the firmware file name to handle scenarios where single system image supports variety of devices. Signed-off-by: Jeffrey Lin Signed-off-by: Dmitry Torokhov --- drivers/input/touchscreen/raydium_i2c_ts.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c b/drivers/input/touchscreen/raydium_i2c_ts.c index 100538d64fff..d1c09e6a2cb6 100644 --- a/drivers/input/touchscreen/raydium_i2c_ts.c +++ b/drivers/input/touchscreen/raydium_i2c_ts.c @@ -752,13 +752,20 @@ static int raydium_i2c_fw_update(struct raydium_data *ts) { struct i2c_client *client = ts->client; const struct firmware *fw = NULL; - const char *fw_file = "raydium.fw"; + char *fw_file; int error; + fw_file = kasprintf(GFP_KERNEL, "raydium_%#04x.fw", + le32_to_cpu(ts->info.hw_ver)); + if (!fw_file) + return -ENOMEM; + + dev_dbg(>dev, "firmware name: %s\n", fw_file); + error = request_firmware(, fw_file, >dev); if (error) { dev_err(>dev, "Unable to open firmware %s\n", fw_file); - return error; + goto out_free_fw_file; } disable_irq(client->irq); @@ -787,6 +794,9 @@ static int raydium_i2c_fw_update(struct raydium_data *ts) release_firmware(fw); +out_free_fw_file: + kfree(fw_file); + return error; }
Re: [PATCH] driver: input :touchscreen :Modify Raydium Firmware update input file
Hi Jeffrey, On Thu, Dec 21, 2017 at 09:51:22PM +0800, jeffrey.lin wrote: > Modify update firmware to accept alternative file name > > Signed-off-by: jeffrey.lin > --- > drivers/input/touchscreen/raydium_i2c_ts.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c > b/drivers/input/touchscreen/raydium_i2c_ts.c > index a99fb5cac5a0..439d43c3519c 100644 > --- a/drivers/input/touchscreen/raydium_i2c_ts.c > +++ b/drivers/input/touchscreen/raydium_i2c_ts.c > @@ -130,6 +130,7 @@ struct raydium_data { > struct gpio_desc *reset_gpio; > > struct raydium_info info; > + char fw_file[64]; You do not really need to keep the firmware name in driver data, just use a temporary in raydium_i2c_fw_update(). > > struct mutex sysfs_mutex; > > @@ -752,12 +753,16 @@ static int raydium_i2c_fw_update(struct raydium_data > *ts) > { > struct i2c_client *client = ts->client; > const struct firmware *fw = NULL; > - const char *fw_file = "raydium.fw"; > int error; > > - error = request_firmware(, fw_file, >dev); > + /* Firmware name */ > + snprintf(ts->fw_file, sizeof(ts->fw_file), > + "raydium_%x.fw", ts->info.hw_ver); hw_ver is LE32, you need to convert it to CPU endianness before using. Also it would be better if we used the same encoding for the hardware version as the one that we use when we output it in sysfs. It makes userspace life a bit easier I think. How about the version of the patch below? Thanks. -- Dmitry Input: raydium_i2c_ts - include hardware version in firmware name From: Jeffrey Lin Add hardware version to the firmware file name to handle scenarios where single system image supports variety of devices. Signed-off-by: Jeffrey Lin Signed-off-by: Dmitry Torokhov --- drivers/input/touchscreen/raydium_i2c_ts.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/drivers/input/touchscreen/raydium_i2c_ts.c b/drivers/input/touchscreen/raydium_i2c_ts.c index 100538d64fff..d1c09e6a2cb6 100644 --- a/drivers/input/touchscreen/raydium_i2c_ts.c +++ b/drivers/input/touchscreen/raydium_i2c_ts.c @@ -752,13 +752,20 @@ static int raydium_i2c_fw_update(struct raydium_data *ts) { struct i2c_client *client = ts->client; const struct firmware *fw = NULL; - const char *fw_file = "raydium.fw"; + char *fw_file; int error; + fw_file = kasprintf(GFP_KERNEL, "raydium_%#04x.fw", + le32_to_cpu(ts->info.hw_ver)); + if (!fw_file) + return -ENOMEM; + + dev_dbg(>dev, "firmware name: %s\n", fw_file); + error = request_firmware(, fw_file, >dev); if (error) { dev_err(>dev, "Unable to open firmware %s\n", fw_file); - return error; + goto out_free_fw_file; } disable_irq(client->irq); @@ -787,6 +794,9 @@ static int raydium_i2c_fw_update(struct raydium_data *ts) release_firmware(fw); +out_free_fw_file: + kfree(fw_file); + return error; }
linux-next: Tree for Jan 5
Hi all, Changes since 20180104: The drm tree gained a conflict against the drm-intel-fixes tree. The akpm-current tree gained a build failure for which I applied a patch. Non-merge commits (relative to Linus' tree): 6981 7369 files changed, 288333 insertions(+), 202735 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 255 trees (counting Linus' and 43 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e1915c8195b3 Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc) Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging kbuild-current/fixes (cfe17c9bbe6a kbuild: move cc-option and cc-disable-warning after incl. arch Makefile) Merging arc-current/for-curr (af1be2e21203 ARC: handle gcc generated __builtin_trap for older compiler) Merging arm-current/fixes (36b0cb84ee85 ARM: 8731/1: Fix csum_partial_copy_from_user() stack mismatch) Merging m68k-current/for-linus (5e387199c17c m68k/defconfig: Update defconfigs for v4.14-rc7) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (ecb101aed861 powerpc/mm: Fix SEGV on mapped region to return SEGV_ACCERR) Merging sparc/master (59585b4be9ae sparc64: repair calling incorrect hweight function from stubs) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (6926e041a892 uapi/if_ether.h: prevent redefinition of struct ethhdr) Merging bpf/master (820d1d5eba5e Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue) Merging ipsec/master (2f10a61cee8f xfrm: fix rcu usage in xfrm_get_type_offload) Merging netfilter/master (8bea728dce89 netfilter: nf_tables: fix potential NULL-ptr deref in nf_tables_dump_obj_done()) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (a41886f56b7b Merge tag 'iwlwifi-for-kalle-2017-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes) Merging mac80211/master (736a80bbfda7 mac80211: mesh: drop frames appearing to be from us) Merging sound-current/for-linus (db6f09448550 ALSA: pcm: Workaround for weird PulseAudio behavior on rewind error) Merging pci-current/for-linus (1291a0d5049d Linux 4.15-rc4) Merging driver-core.current/driver-core-linus (30a7acd57389 Linux 4.15-rc6) Merging tty.current/tty-linus (30a7acd57389 Linux 4.15-rc6) Merging usb.current/usb-linus (5fd77a3a0e40 usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer) Merging usb-gadget-fixes/fixes (1291a0d5049d Linux 4.15-rc4) Merging usb-serial-fixes/usb-linus (d14ac576d10f USB: serial: cp210x: add new device ID ELV ALC 8xxx) Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: fix ulpi-node lookup) Merging phy/fixes (2b88212c4cc6 phy: rcar-gen3-usb2: select USB_COMMON) Merging staging.current/staging-linus (30a7acd57389 Linux 4.15-rc6) Merging char-misc.current/char-misc-linus (06e7e776ca4d Bluetooth: Prevent stack info leak from the EFS element.) Merging input-current/for-linus (8b7e9d9e2d8b Input: hideep - fix compile error due to missing i
linux-next: Tree for Jan 5
Hi all, Changes since 20180104: The drm tree gained a conflict against the drm-intel-fixes tree. The akpm-current tree gained a build failure for which I applied a patch. Non-merge commits (relative to Linus' tree): 6981 7369 files changed, 288333 insertions(+), 202735 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc, an allmodconfig for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu (with and without kvm enabled). Below is a summary of the state of the merge. I am currently merging 255 trees (counting Linus' and 43 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e1915c8195b3 Merge tag 'armsoc-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc) Merging fixes/master (820bf5c419e4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi) Merging kbuild-current/fixes (cfe17c9bbe6a kbuild: move cc-option and cc-disable-warning after incl. arch Makefile) Merging arc-current/for-curr (af1be2e21203 ARC: handle gcc generated __builtin_trap for older compiler) Merging arm-current/fixes (36b0cb84ee85 ARM: 8731/1: Fix csum_partial_copy_from_user() stack mismatch) Merging m68k-current/for-linus (5e387199c17c m68k/defconfig: Update defconfigs for v4.14-rc7) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (ecb101aed861 powerpc/mm: Fix SEGV on mapped region to return SEGV_ACCERR) Merging sparc/master (59585b4be9ae sparc64: repair calling incorrect hweight function from stubs) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (6926e041a892 uapi/if_ether.h: prevent redefinition of struct ethhdr) Merging bpf/master (820d1d5eba5e Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue) Merging ipsec/master (2f10a61cee8f xfrm: fix rcu usage in xfrm_get_type_offload) Merging netfilter/master (8bea728dce89 netfilter: nf_tables: fix potential NULL-ptr deref in nf_tables_dump_obj_done()) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (a41886f56b7b Merge tag 'iwlwifi-for-kalle-2017-12-05' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-fixes) Merging mac80211/master (736a80bbfda7 mac80211: mesh: drop frames appearing to be from us) Merging sound-current/for-linus (db6f09448550 ALSA: pcm: Workaround for weird PulseAudio behavior on rewind error) Merging pci-current/for-linus (1291a0d5049d Linux 4.15-rc4) Merging driver-core.current/driver-core-linus (30a7acd57389 Linux 4.15-rc6) Merging tty.current/tty-linus (30a7acd57389 Linux 4.15-rc6) Merging usb.current/usb-linus (5fd77a3a0e40 usbip: vudc_tx: fix v_send_ret_submit() vulnerability to null xfer buffer) Merging usb-gadget-fixes/fixes (1291a0d5049d Linux 4.15-rc4) Merging usb-serial-fixes/usb-linus (d14ac576d10f USB: serial: cp210x: add new device ID ELV ALC 8xxx) Merging usb-chipidea-fixes/ci-for-usb-stable (964728f9f407 USB: chipidea: msm: fix ulpi-node lookup) Merging phy/fixes (2b88212c4cc6 phy: rcar-gen3-usb2: select USB_COMMON) Merging staging.current/staging-linus (30a7acd57389 Linux 4.15-rc6) Merging char-misc.current/char-misc-linus (06e7e776ca4d Bluetooth: Prevent stack info leak from the EFS element.) Merging input-current/for-linus (8b7e9d9e2d8b Input: hideep - fix compile error due to missing i
[PATCH] [v3] x86/doc: add PTI description
Changes from v2: * Update some wording * Minor typo and grammar fixes * Further clarify what INVPCID is. Changes from v1: * update kernel-parameters.txt to clarify that the pti= option is not just for disabling. Also describe what 'pti=auto' does and why * Add a note about the presence of NX in the user portion of the kernel page tables * Clarify _additional_ 4k of PGD space * Add a note about the runtime overhead of PCID without INVPCID --- From: Dave HansenAdd some details about how PTI works, what some of the downsides are, and how to debug it when things go wrong. Also document the kernel parameter: 'nopti'. Signed-off-by: Dave Hansen Reviewed-by: Kees Cook Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Hugh Dickins Cc: x...@kernel.org --- b/Documentation/admin-guide/kernel-parameters.txt | 21 +- b/Documentation/x86/pti.txt | 187 ++ 2 files changed, 201 insertions(+), 7 deletions(-) diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt --- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc 2018-01-03 17:04:23.255028797 -0800 +++ b/Documentation/admin-guide/kernel-parameters.txt 2018-01-04 21:30:58.402773426 -0800 @@ -2712,8 +2712,6 @@ steal time is computed, but won't influence scheduler behaviour - nopti [X86-64] Disable kernel page table isolation - nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. @@ -3288,11 +3286,20 @@ pt. [PARIDE] See Documentation/blockdev/paride.txt. - pti=[X86_64] - Control user/kernel address space isolation: - on - enable - off - disable - auto - default setting + pti=[X86_64] Control Page Table Isolation of user and + kernel address spaces. Disabling this feature + removes hardening, but improves performance of + system calls and interrupts. + + on - unconditionally enable + off - unconditionally disable + auto - kernel detects whether your CPU model is + vulnerable to issues that PTI mitigates + + Not specifying this option is equivalent to pti=auto. + + nopti [X86_64] + Equivalent to pti=off pty.legacy_count= [KNL] Number of legacy pty's. Overwrites compiled-in diff -puN /dev/null Documentation/x86/pti.txt --- /dev/null 2017-12-15 13:48:30.454245127 -0800 +++ b/Documentation/x86/pti.txt 2018-01-04 21:38:28.826772303 -0800 @@ -0,0 +1,187 @@ +Overview + + +Page Table Isolation (pti, previously known as KAISER[1]) is a +countermeasure against attacks on the shared user/kernel address +space such as the "Meltdown" approach[2]. + +To mitigate this class of attacks, we create an independent set of +page tables for use only when running userspace applications. When +the kernel is entered via syscalls, interrupts or exceptions, the +page tables are switched to the full "kernel" copy. When the system +switches back to user mode, the user copy is used again. + +The userspace page tables contain only a minimal amount of kernel +data: only what is needed to enter/exit the kernel such as the +entry/exit functions themselves and the interrupt descriptor table +(IDT). There are a few strictly unnecessary things that get mapped +such as the first C function when entering an interrupt (see +comments in pti.c). + +This approach helps to ensure that side-channel attacks leveraging +the paging structures do not function when PTI is enabled. It can be +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. +Once enabled at compile-time, it can be disabled at boot with the +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). + +Page Table Management += + +When PTI is enabled, the kernel manages two sets of page tables. +The first set is very similar to the single set which is present in +kernels without PTI. This includes a complete mapping of userspace +that the kernel can use for things like copy_to_user(). + +Although _complete_, the user portion of the kernel page tables is +crippled by setting the NX bit in the top level. This ensures +that any missed
[PATCH] [v3] x86/doc: add PTI description
Changes from v2: * Update some wording * Minor typo and grammar fixes * Further clarify what INVPCID is. Changes from v1: * update kernel-parameters.txt to clarify that the pti= option is not just for disabling. Also describe what 'pti=auto' does and why * Add a note about the presence of NX in the user portion of the kernel page tables * Clarify _additional_ 4k of PGD space * Add a note about the runtime overhead of PCID without INVPCID --- From: Dave Hansen Add some details about how PTI works, what some of the downsides are, and how to debug it when things go wrong. Also document the kernel parameter: 'nopti'. Signed-off-by: Dave Hansen Reviewed-by: Kees Cook Cc: Moritz Lipp Cc: Daniel Gruss Cc: Michael Schwarz Cc: Richard Fellner Cc: Andy Lutomirski Cc: Linus Torvalds Cc: Hugh Dickins Cc: x...@kernel.org --- b/Documentation/admin-guide/kernel-parameters.txt | 21 +- b/Documentation/x86/pti.txt | 187 ++ 2 files changed, 201 insertions(+), 7 deletions(-) diff -puN Documentation/admin-guide/kernel-parameters.txt~kpti-doc Documentation/admin-guide/kernel-parameters.txt --- a/Documentation/admin-guide/kernel-parameters.txt~kpti-doc 2018-01-03 17:04:23.255028797 -0800 +++ b/Documentation/admin-guide/kernel-parameters.txt 2018-01-04 21:30:58.402773426 -0800 @@ -2712,8 +2712,6 @@ steal time is computed, but won't influence scheduler behaviour - nopti [X86-64] Disable kernel page table isolation - nolapic [X86-32,APIC] Do not enable or use the local APIC. nolapic_timer [X86-32,APIC] Do not use the local APIC timer. @@ -3288,11 +3286,20 @@ pt. [PARIDE] See Documentation/blockdev/paride.txt. - pti=[X86_64] - Control user/kernel address space isolation: - on - enable - off - disable - auto - default setting + pti=[X86_64] Control Page Table Isolation of user and + kernel address spaces. Disabling this feature + removes hardening, but improves performance of + system calls and interrupts. + + on - unconditionally enable + off - unconditionally disable + auto - kernel detects whether your CPU model is + vulnerable to issues that PTI mitigates + + Not specifying this option is equivalent to pti=auto. + + nopti [X86_64] + Equivalent to pti=off pty.legacy_count= [KNL] Number of legacy pty's. Overwrites compiled-in diff -puN /dev/null Documentation/x86/pti.txt --- /dev/null 2017-12-15 13:48:30.454245127 -0800 +++ b/Documentation/x86/pti.txt 2018-01-04 21:38:28.826772303 -0800 @@ -0,0 +1,187 @@ +Overview + + +Page Table Isolation (pti, previously known as KAISER[1]) is a +countermeasure against attacks on the shared user/kernel address +space such as the "Meltdown" approach[2]. + +To mitigate this class of attacks, we create an independent set of +page tables for use only when running userspace applications. When +the kernel is entered via syscalls, interrupts or exceptions, the +page tables are switched to the full "kernel" copy. When the system +switches back to user mode, the user copy is used again. + +The userspace page tables contain only a minimal amount of kernel +data: only what is needed to enter/exit the kernel such as the +entry/exit functions themselves and the interrupt descriptor table +(IDT). There are a few strictly unnecessary things that get mapped +such as the first C function when entering an interrupt (see +comments in pti.c). + +This approach helps to ensure that side-channel attacks leveraging +the paging structures do not function when PTI is enabled. It can be +enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. +Once enabled at compile-time, it can be disabled at boot with the +'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). + +Page Table Management += + +When PTI is enabled, the kernel manages two sets of page tables. +The first set is very similar to the single set which is present in +kernels without PTI. This includes a complete mapping of userspace +that the kernel can use for things like copy_to_user(). + +Although _complete_, the user portion of the kernel page tables is +crippled by setting the NX bit in the top level. This ensures +that any missed kernel->user CR3 switch will immediately crash +userspace upon executing its first instruction. + +The userspace page tables map only the kernel data needed to enter +and exit the kernel. This data is entirely contained in the 'struct +cpu_entry_area' structure which is
[RFC] selftests/x86: Add test_vsyscall
This tests that the vsyscall entries do what they're expected to do. It also confirms that attempts to read the vsyscall page behave as expected. If changes are made to the vsyscall code or its memory map handling, running this test in all three of vsyscall=none, vsyscall=emulate, and vsyscall=native are helpful. (Because it's easy, this also compares the vsyscall results to their vDSO equivalents.) Signed-off-by: Andy Lutomirski--- It's RFC because I want to re-read it myself first. It's also missing a test that will reliably make sure that vsyscall=none prevents use of vsyscalls. Also, I want to add vsyscall=emulate_noread that makes the vsyscall page be --x. And I want to add a per-process option to turn off vsyscalls. tools/testing/selftests/x86/Makefile| 2 +- tools/testing/selftests/x86/test_vsyscall.c | 435 2 files changed, 436 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/test_vsyscall.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index 939a337128db..5d4f10ac2af2 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -7,7 +7,7 @@ include ../lib.mk TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall test_mremap_vdso \ check_initial_reg_state sigreturn ldt_gdt iopl mpx-mini-test ioperm \ - protection_keys test_vdso + protection_keys test_vdso test_vsyscall TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer diff --git a/tools/testing/selftests/x86/test_vsyscall.c b/tools/testing/selftests/x86/test_vsyscall.c new file mode 100644 index ..44d873d71b85 --- /dev/null +++ b/tools/testing/selftests/x86/test_vsyscall.c @@ -0,0 +1,435 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define VSYS(x) (x) +#else +# define VSYS(x) 0 +#endif + +#ifndef SYS_getcpu +# ifdef __x86_64__ +# define SYS_getcpu 309 +# else +# define SYS_getcpu 318 +# endif +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), + int flags) +{ + struct sigaction sa; + memset(, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(_mask); + if (sigaction(sig, , 0)) + err(1, "sigaction"); +} + +/* vsyscalls and vDSO */ +bool should_read_vsyscall = false; + +typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz); +gtod_t vgtod = (gtod_t)VSYS(0xff60); +gtod_t vdso_gtod; + +typedef int (*vgettime_t)(clockid_t, struct timespec *); +vgettime_t vdso_gettime; + +typedef long (*time_func_t)(time_t *t); +time_func_t vtime = (time_func_t)VSYS(0xff600400); +time_func_t vdso_time; + +typedef long (*getcpu_t)(unsigned *, unsigned *, void *); +getcpu_t vgetcpu = (getcpu_t)VSYS(0xff600800); +getcpu_t vdso_getcpu; + +static void init_vdso(void) +{ + void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) { + printf("Warning: failed to find vDSO\n"); + return; + } + + vdso_gtod = (gtod_t)dlsym(vdso, "__vdso_gettimeofday"); + if (!vdso_gtod) + printf("Warning: failed to find gettimeofday in vDSO\n"); + + vdso_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime"); + if (!vdso_gettime) + printf("Warning: failed to find clock_gettime in vDSO\n"); + + vdso_time = (time_func_t)dlsym(vdso, "__vdso_time"); + if (!vdso_time) + printf("Warning: failed to find time in vDSO\n"); + + vdso_getcpu = (getcpu_t)dlsym(vdso, "__vdso_getcpu"); + if (!vdso_getcpu) + printf("Warning: failed to find getcpu in vDSO\n"); +} + +static int init_vsys(void) +{ +#ifdef __x86_64__ + int nerrs = 0; + FILE *maps; + char line[128]; + bool found = false; + + maps = fopen("/proc/self/maps", "r"); + if (!maps) { + printf("[WARN]\tCould not open /proc/self/maps -- assuming vsyscall is r-x\n"); + should_read_vsyscall = true; + return 0; + } + + while (fgets(line, sizeof(line), maps)) { + char r, x; + void *start, *end; + char name[128]; + if (sscanf(line, "%p-%p %c-%cp %*x %*x:%*x %*u %s", + , , , , name)
[RFC] selftests/x86: Add test_vsyscall
This tests that the vsyscall entries do what they're expected to do. It also confirms that attempts to read the vsyscall page behave as expected. If changes are made to the vsyscall code or its memory map handling, running this test in all three of vsyscall=none, vsyscall=emulate, and vsyscall=native are helpful. (Because it's easy, this also compares the vsyscall results to their vDSO equivalents.) Signed-off-by: Andy Lutomirski --- It's RFC because I want to re-read it myself first. It's also missing a test that will reliably make sure that vsyscall=none prevents use of vsyscalls. Also, I want to add vsyscall=emulate_noread that makes the vsyscall page be --x. And I want to add a per-process option to turn off vsyscalls. tools/testing/selftests/x86/Makefile| 2 +- tools/testing/selftests/x86/test_vsyscall.c | 435 2 files changed, 436 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/x86/test_vsyscall.c diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile index 939a337128db..5d4f10ac2af2 100644 --- a/tools/testing/selftests/x86/Makefile +++ b/tools/testing/selftests/x86/Makefile @@ -7,7 +7,7 @@ include ../lib.mk TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt ptrace_syscall test_mremap_vdso \ check_initial_reg_state sigreturn ldt_gdt iopl mpx-mini-test ioperm \ - protection_keys test_vdso + protection_keys test_vdso test_vsyscall TARGETS_C_32BIT_ONLY := entry_from_vm86 syscall_arg_fault test_syscall_vdso unwind_vdso \ test_FCMOV test_FCOMI test_FISTTP \ vdso_restorer diff --git a/tools/testing/selftests/x86/test_vsyscall.c b/tools/testing/selftests/x86/test_vsyscall.c new file mode 100644 index ..44d873d71b85 --- /dev/null +++ b/tools/testing/selftests/x86/test_vsyscall.c @@ -0,0 +1,435 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#ifdef __x86_64__ +# define VSYS(x) (x) +#else +# define VSYS(x) 0 +#endif + +#ifndef SYS_getcpu +# ifdef __x86_64__ +# define SYS_getcpu 309 +# else +# define SYS_getcpu 318 +# endif +#endif + +static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *), + int flags) +{ + struct sigaction sa; + memset(, 0, sizeof(sa)); + sa.sa_sigaction = handler; + sa.sa_flags = SA_SIGINFO | flags; + sigemptyset(_mask); + if (sigaction(sig, , 0)) + err(1, "sigaction"); +} + +/* vsyscalls and vDSO */ +bool should_read_vsyscall = false; + +typedef long (*gtod_t)(struct timeval *tv, struct timezone *tz); +gtod_t vgtod = (gtod_t)VSYS(0xff60); +gtod_t vdso_gtod; + +typedef int (*vgettime_t)(clockid_t, struct timespec *); +vgettime_t vdso_gettime; + +typedef long (*time_func_t)(time_t *t); +time_func_t vtime = (time_func_t)VSYS(0xff600400); +time_func_t vdso_time; + +typedef long (*getcpu_t)(unsigned *, unsigned *, void *); +getcpu_t vgetcpu = (getcpu_t)VSYS(0xff600800); +getcpu_t vdso_getcpu; + +static void init_vdso(void) +{ + void *vdso = dlopen("linux-vdso.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) + vdso = dlopen("linux-gate.so.1", RTLD_LAZY | RTLD_LOCAL | RTLD_NOLOAD); + if (!vdso) { + printf("Warning: failed to find vDSO\n"); + return; + } + + vdso_gtod = (gtod_t)dlsym(vdso, "__vdso_gettimeofday"); + if (!vdso_gtod) + printf("Warning: failed to find gettimeofday in vDSO\n"); + + vdso_gettime = (vgettime_t)dlsym(vdso, "__vdso_clock_gettime"); + if (!vdso_gettime) + printf("Warning: failed to find clock_gettime in vDSO\n"); + + vdso_time = (time_func_t)dlsym(vdso, "__vdso_time"); + if (!vdso_time) + printf("Warning: failed to find time in vDSO\n"); + + vdso_getcpu = (getcpu_t)dlsym(vdso, "__vdso_getcpu"); + if (!vdso_getcpu) + printf("Warning: failed to find getcpu in vDSO\n"); +} + +static int init_vsys(void) +{ +#ifdef __x86_64__ + int nerrs = 0; + FILE *maps; + char line[128]; + bool found = false; + + maps = fopen("/proc/self/maps", "r"); + if (!maps) { + printf("[WARN]\tCould not open /proc/self/maps -- assuming vsyscall is r-x\n"); + should_read_vsyscall = true; + return 0; + } + + while (fgets(line, sizeof(line), maps)) { + char r, x; + void *start, *end; + char name[128]; + if (sscanf(line, "%p-%p %c-%cp %*x %*x:%*x %*u %s", + , , , , name) != 5) +
Re: [PATCH] nvme-pci: fix the timeout case when reset is ongoing
Hi Christoph Many thanks for your kindly response. On 01/04/2018 06:35 PM, Christoph Hellwig wrote: > On Wed, Jan 03, 2018 at 06:31:44AM +0800, Jianchao Wang wrote: >> NVME_CTRL_RESETTING used to indicate the range of nvme initializing >> strictly in fd634f41(nvme: merge probe_work and reset_work), but it >> is not now. The NVME_CTRL_RESETTING is set before queue the >> reset_work, there could be a big gap before the reset work handles >> the outstanding requests. So when the NVME_CTRL_RESETTING is set, >> nvme_timeout will not only meet the admin requests from the >> initializing procedure, but also the IO and admin requests from >> previous work before nvme_dev_disable is invoked. >> >> To fix it, introduce a flag NVME_DEV_FLAG_INITIALIZING to mark the >> range of initializing. When this flag is not set, handle the expried >> requests as nvme_cancel_request. Otherwise, the requests should be >> from the initializing procedure. Handle them as before. Because the >> nvme_reset_work will see the error and disable the dev itself, so >> discard the nvme_dev_disable here. > > Instead of a parallel set of states we'll need to split > NVME_CTRL_RESET into NVME_CTRL_RESET_SCHEDULED and NVME_CTRL_RESETTING. > > And if my memory doesn't fail me we were already considering that a while > ago. > Yes, it is indeed more reasonable to split current NVME_CTRL_RESETTING into two states, but the nvme_dev_disable() in nvme_reset_work() should be the boundary. After that, all the in-flight requests are requeued and request queue is quiesced, the nvme driver is clear. So the new state maybe something like NEW_CTRL_RESET_PREPARE.:) Thanks Jianchao
Re: [PATCH] nvme-pci: fix the timeout case when reset is ongoing
Hi Christoph Many thanks for your kindly response. On 01/04/2018 06:35 PM, Christoph Hellwig wrote: > On Wed, Jan 03, 2018 at 06:31:44AM +0800, Jianchao Wang wrote: >> NVME_CTRL_RESETTING used to indicate the range of nvme initializing >> strictly in fd634f41(nvme: merge probe_work and reset_work), but it >> is not now. The NVME_CTRL_RESETTING is set before queue the >> reset_work, there could be a big gap before the reset work handles >> the outstanding requests. So when the NVME_CTRL_RESETTING is set, >> nvme_timeout will not only meet the admin requests from the >> initializing procedure, but also the IO and admin requests from >> previous work before nvme_dev_disable is invoked. >> >> To fix it, introduce a flag NVME_DEV_FLAG_INITIALIZING to mark the >> range of initializing. When this flag is not set, handle the expried >> requests as nvme_cancel_request. Otherwise, the requests should be >> from the initializing procedure. Handle them as before. Because the >> nvme_reset_work will see the error and disable the dev itself, so >> discard the nvme_dev_disable here. > > Instead of a parallel set of states we'll need to split > NVME_CTRL_RESET into NVME_CTRL_RESET_SCHEDULED and NVME_CTRL_RESETTING. > > And if my memory doesn't fail me we were already considering that a while > ago. > Yes, it is indeed more reasonable to split current NVME_CTRL_RESETTING into two states, but the nvme_dev_disable() in nvme_reset_work() should be the boundary. After that, all the in-flight requests are requeued and request queue is quiesced, the nvme driver is clear. So the new state maybe something like NEW_CTRL_RESET_PREPARE.:) Thanks Jianchao
Re: [PATCH 4.4 00/37] 4.4.110-stable review
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirskiwrote: > >> On Jan 4, 2018, at 12:29 PM, Linus Torvalds >> wrote: >> >>> On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle wrote: >>> >>> Attached a screenshot. >>> Is that useful? Are there some debug options I can add? >> >> Not much of an oops, because the SIGSEGV happens in user space. The >> only reason you get any kernel stack printout at all is because 'init' >> dying will make the kernel print that out. >> >> The segfault address for init looks like the fixmap area to me (first >> byte in the last page of the fixmap?). "Error 5" means that it's a >> user-space read that got a protection fault. So it's not a LDT of GDT >> update or anything like that, it's a normal access from user space (or >> a qemu emulation bug, but that sounds unlikely). >> >> Is that the vsyscall page? >> >> Adding Luto to the participants. I think he noticed one of the >> vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 >> series had something similar.. >> > > That's almost certainly it. > > I'll try to find some time today or tomorrow to add a proper selftest. > Give this a shot: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pti=17c5ebeb2e00879b0af1a9c32bf37ecdd9b9b31b Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate and run both the 32-bit and 64-bit variants of that test. All six combinations should pass. But I bet they don't on 4.4.
Re: [PATCH 4.4 00/37] 4.4.110-stable review
On Thu, Jan 4, 2018 at 12:43 PM, Andy Lutomirski wrote: > >> On Jan 4, 2018, at 12:29 PM, Linus Torvalds >> wrote: >> >>> On Thu, Jan 4, 2018 at 12:16 PM, Thomas Voegtle wrote: >>> >>> Attached a screenshot. >>> Is that useful? Are there some debug options I can add? >> >> Not much of an oops, because the SIGSEGV happens in user space. The >> only reason you get any kernel stack printout at all is because 'init' >> dying will make the kernel print that out. >> >> The segfault address for init looks like the fixmap area to me (first >> byte in the last page of the fixmap?). "Error 5" means that it's a >> user-space read that got a protection fault. So it's not a LDT of GDT >> update or anything like that, it's a normal access from user space (or >> a qemu emulation bug, but that sounds unlikely). >> >> Is that the vsyscall page? >> >> Adding Luto to the participants. I think he noticed one of the >> vsyscall patches missing earlier in the 4.9 series. Maybe the 4.4 >> series had something similar.. >> > > That's almost certainly it. > > I'll try to find some time today or tomorrow to add a proper selftest. > Give this a shot: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/pti=17c5ebeb2e00879b0af1a9c32bf37ecdd9b9b31b Boot with each of vsyscall=none, vsyscall=native, and vsyscall=emulate and run both the 32-bit and 64-bit variants of that test. All six combinations should pass. But I bet they don't on 4.4.
Re: [PATCH] [v2] x86/doc: add PTI description
On 01/04/2018 05:43 PM, Hector Martin 'marcan' wrote: > On 2018-01-05 09:24, Dave Hansen wrote: >> +Not specifying this option nothing is equivalent to >> +pti=auto. > > -nothing Sure, will fix. >> +Page Table Isolation (pti, previously known as KAISER[1]) is a >> +countermeasure against attacks on kernel address information such >> +as the "Meltdown" approach[2]. > > It's not really just address information, but any data. Maybe "attacks > that leak kernel memory"? It's not just kernel leaks either, though. >> +To avoid leaking address information, we create an new, independent > > Same issue here. Also an -> a. Will fix. >> +copy of the page tables which are used only when running userspace > > are -> is. The copy is singular. I've reworded the sentence to remove the ambiguity. >> +applications. When the kernel is entered via syscalls, interrupts or >> +exceptions, page tables are switched to the full "kernel" copy. When > > "the page tables". No thanks. It's fine the way it is. >> +crippled by setting the NX bit in the top level. This ensures >> +that if a kernel->user CR3 switch is missed that userspace will >> +crash immediately upon executing its first instruction. > > "that userspace" -> "then userspace"
Re: [PATCH] [v2] x86/doc: add PTI description
On 01/04/2018 05:43 PM, Hector Martin 'marcan' wrote: > On 2018-01-05 09:24, Dave Hansen wrote: >> +Not specifying this option nothing is equivalent to >> +pti=auto. > > -nothing Sure, will fix. >> +Page Table Isolation (pti, previously known as KAISER[1]) is a >> +countermeasure against attacks on kernel address information such >> +as the "Meltdown" approach[2]. > > It's not really just address information, but any data. Maybe "attacks > that leak kernel memory"? It's not just kernel leaks either, though. >> +To avoid leaking address information, we create an new, independent > > Same issue here. Also an -> a. Will fix. >> +copy of the page tables which are used only when running userspace > > are -> is. The copy is singular. I've reworded the sentence to remove the ambiguity. >> +applications. When the kernel is entered via syscalls, interrupts or >> +exceptions, page tables are switched to the full "kernel" copy. When > > "the page tables". No thanks. It's fine the way it is. >> +crippled by setting the NX bit in the top level. This ensures >> +that if a kernel->user CR3 switch is missed that userspace will >> +crash immediately upon executing its first instruction. > > "that userspace" -> "then userspace"
Re: [PATCH V7 12/12] arm64: dts: add clocks for SC9860
On 5 January 2018 at 07:01, Arnd Bergmannwrote: > On Thu, Jan 4, 2018 at 10:34 PM, Arnd Bergmann wrote: >> On Thu, Dec 7, 2017 at 1:57 PM, Chunyan Zhang >> wrote: >>> Some clocks on SC9860 are in the same address area with syscon devices, >>> those are what have a property of 'sprd,syscon' which would refer to >>> syscon devices, others would have a reg property indicated their address >>> ranges. >>> >>> Signed-off-by: Chunyan Zhang >>> --- >>> arch/arm64/boot/dts/sprd/sc9860.dtsi | 115 >>> +++ >>> arch/arm64/boot/dts/sprd/whale2.dtsi | 18 +- >>> 2 files changed, 131 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> b/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> index 7b7d8ce..bf03da4 100644 >>> --- a/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> +++ b/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> @@ -7,6 +7,7 @@ >>> */ >>> >>> #include >>> +#include >>> #include "whale2.dtsi" >> >> This caused a build error since the sprd,sc9860-clk.h file does not >> exist, I'll revert or undo the patch tomorrow. > > I've taken another look, and fixing it by removing the broken #include > was easier than undoing the patches, so I did that now, see > https://patchwork.kernel.org/patch/10145773/ Ok, thanks Arnd! Chunyan > > Arnd
Re: [PATCH V7 12/12] arm64: dts: add clocks for SC9860
On 5 January 2018 at 07:01, Arnd Bergmann wrote: > On Thu, Jan 4, 2018 at 10:34 PM, Arnd Bergmann wrote: >> On Thu, Dec 7, 2017 at 1:57 PM, Chunyan Zhang >> wrote: >>> Some clocks on SC9860 are in the same address area with syscon devices, >>> those are what have a property of 'sprd,syscon' which would refer to >>> syscon devices, others would have a reg property indicated their address >>> ranges. >>> >>> Signed-off-by: Chunyan Zhang >>> --- >>> arch/arm64/boot/dts/sprd/sc9860.dtsi | 115 >>> +++ >>> arch/arm64/boot/dts/sprd/whale2.dtsi | 18 +- >>> 2 files changed, 131 insertions(+), 2 deletions(-) >>> >>> diff --git a/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> b/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> index 7b7d8ce..bf03da4 100644 >>> --- a/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> +++ b/arch/arm64/boot/dts/sprd/sc9860.dtsi >>> @@ -7,6 +7,7 @@ >>> */ >>> >>> #include >>> +#include >>> #include "whale2.dtsi" >> >> This caused a build error since the sprd,sc9860-clk.h file does not >> exist, I'll revert or undo the patch tomorrow. > > I've taken another look, and fixing it by removing the broken #include > was easier than undoing the patches, so I did that now, see > https://patchwork.kernel.org/patch/10145773/ Ok, thanks Arnd! Chunyan > > Arnd
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
On 01/04/2018 08:16 PM, Yisheng Xie wrote: >> === Page Table Poisoning === >> >> KAISER has two copies of the page tables: one for the kernel and >> one for when running in userspace. > > So, we have 2 page table, thinking about this case: > If _ONE_ process includes _TWO_ threads, one run in user space, the other > run in kernel, they can run in one core with Hyper-Threading, right? Yes. > So both userspace and kernel space is valid, right? And for one core > with Hyper-Threading, they may share TLB, so the timing problem > described in the paper may still exist? No. The TLB is managed per logical CPU (hyperthread), as is the CR3 register that points to the page tables. Two threads running the same process might use the same CR3 _value_, but that does not mean they share TLB entries. One thread *can* be in the kernel with the kernel page tables while the other is in userspace with the user page tables active. They will even use a different PCID/ASID for the same page tables normally. > Can this case still be protected by KAISER? Yes.
Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
On 01/04/2018 08:16 PM, Yisheng Xie wrote: >> === Page Table Poisoning === >> >> KAISER has two copies of the page tables: one for the kernel and >> one for when running in userspace. > > So, we have 2 page table, thinking about this case: > If _ONE_ process includes _TWO_ threads, one run in user space, the other > run in kernel, they can run in one core with Hyper-Threading, right? Yes. > So both userspace and kernel space is valid, right? And for one core > with Hyper-Threading, they may share TLB, so the timing problem > described in the paper may still exist? No. The TLB is managed per logical CPU (hyperthread), as is the CR3 register that points to the page tables. Two threads running the same process might use the same CR3 _value_, but that does not mean they share TLB entries. One thread *can* be in the kernel with the kernel page tables while the other is in userspace with the user page tables active. They will even use a different PCID/ASID for the same page tables normally. > Can this case still be protected by KAISER? Yes.
[RFC] boot failed when enable KAISER/KPTI
I run the latest RHEL 7.2 with the KAISER/KPTI patch, and boot failed. ... [0.00] PM: Registered nosave memory: [mem 0x810-0x8ff] [0.00] PM: Registered nosave memory: [mem 0x910-0xfff] [0.00] PM: Registered nosave memory: [mem 0x1010-0x10ff] [0.00] PM: Registered nosave memory: [mem 0x1110-0x17ff] [0.00] PM: Regitered nosave memory: [mem 0x1810-0x18ff] [0.00] e820: [mem 0x9000-0xfed1bfff] available for PCI devices [0.00] Booting paravirtualized kernel on bare hardware [0.00] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:1536 nr_cpu_ids:1536 nr_node_ids:8 [0.00] PERCPU: max_distance=0x180ffe24 too large for vmalloc space 0x1fff [0.00] setup_percpu: auto allocator failed (-22), falling back to page size [0.00] PERCPU: 32 4K pages/cpu @c900 s107200 r8192 d15680 [0.00] Built 8 zonelists in Zone order, mobility grouping on. Total pages: 132001804 [0.00] Policy zone: Normal iosdevname=0 8250.nr_uarts=8 efi=old_map rdloaddriver=usb_storage rdloaddriver=sd_mod udev.event-timeout=600 softlockup_panic=0 rcupdate.rcu_cpu_stall_timeout=300 [0.00] Intel-IOMMU: enabled [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) [0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 [0.00] xsave: enabled xstate_bv 0x7, cntxt size 0x340 [0.00] AGP: Checking aperture... [0.00] AGP: No AGP bridge found [0.00] Memory: 526901612k/26910638080k available (6528k kernel code, 26374249692k absent, 9486776k reserved, 4302k data, 1676k init) [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1536, Nodes=8 [0.00] x86/pti: Unmapping kernel while in userspace [0.00] Hierarchical RCU implementation. [0.00] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=1536. [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-1535. [0.00] NR_IRQS:327936 nr_irqs:15976 0 [0.00] Console: colour dummy device 80x25 [0.00] console [tty0] enabled [0.00] console [ttyS0] enabled [0.00] allocated 2145910784 bytes of page_cgroup [0.00] please try 'cgroup_disable=memory' option if you don't want memory cgroups [0.00] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [0.00] tsc: Fast TSC calibration using PIT [0.00] tsc: Detected 2799.999 MHz processor [0.001803] Calibrating delay loop (skipped), value calculated using timer frequency.. 5599.99 BogoMIPS (lpj=279) [0.012408] pid_max: default: 1572864 minimum: 12288 [0.017987] init_memory_mapping: [mem 0x5947f000-0x5b47efff] [0.023701] init_memory_mapping: [mem 0x5b47f000-0x5b87efff] [0.029369] init_memory_mapping: [mem 0x6d368000-0x6d3edfff] [0.039130] BUG: unable to handle kernel paging request at 5b835f90 [0.046101] IP: [<5b835f90>] 0x5b835f8f [0.050637] PGD 81f61067 PUD 190ffefff067 PMD 190ffeffd067 PTE 5b835063 [0.057989] Oops: 0011 [#1] SMP [0.061241] Modules linked in: [0.064304] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-327.59.59.46.h42.x86_64 #1 [0.072280] Hardware name: Huawei FusionServer9032/IT91SMUB, BIOS BLXSV316 11/14/2017 [0.080082] task: 8196e440 ti: 81958000 task.ti: 81958000 [0.087539] RIP: 0010:[<5b835f90>] [<5b835f90>] 0x5b835f8f [0.094494] RSP: :8195be28 EFLAGS: 00010046 [0.099788] RAX: 80050033 RBX: 910fbc802000 RCX: 02d0 [0.106897] RDX: 0030 RSI: 02d0 RDI: 5b835f90 [0.114006] RBP: 8195bf38 R08: 0001 R09: 090fbc802000 [0.121116] R10: 88ffbcc07340 R11: 0001 R12: 0001 [0.128225] R13: 090fbc802000 R14: 02d0 R15: 0001 [0.135336] FS: () GS:c900() knlGS: [0.143398] CS: 0010 DS: ES: CR0: 80050033 [0.149124] CR2: 5b835f90 CR3: 01966000 CR4: 000606b0 [0.156234] DR0: DR1: DR2: [0.163344] DR3: DR6: fffe0ff0 DR7: 0400 [0.170454] Call Trace: [0.172899] [] ? efi_call4+0x6c/0xf0 [0.178108] [] ? native_flush_tlb_global+0x8e/0xc0 [0.184527] [] ? set_memory_x+0x43/0x50 [0.189997] [] ? efi_enter_virtual_mode+0x3bc/0x538 [0.196505] [] start_kernel+0x39f/0x44f [0.201972] [] ? repair_env_string+0x5c/0x5c [0.207872] [] ? early_idt_handlers+0x120/0x120 [0.214030] [] x86_64_start_reservations+0x2a/0x2c [0.220449] [] x86_64_start_kernel+0x152/0x175 [0.226521] Code: Bad
[RFC] boot failed when enable KAISER/KPTI
I run the latest RHEL 7.2 with the KAISER/KPTI patch, and boot failed. ... [0.00] PM: Registered nosave memory: [mem 0x810-0x8ff] [0.00] PM: Registered nosave memory: [mem 0x910-0xfff] [0.00] PM: Registered nosave memory: [mem 0x1010-0x10ff] [0.00] PM: Registered nosave memory: [mem 0x1110-0x17ff] [0.00] PM: Regitered nosave memory: [mem 0x1810-0x18ff] [0.00] e820: [mem 0x9000-0xfed1bfff] available for PCI devices [0.00] Booting paravirtualized kernel on bare hardware [0.00] setup_percpu: NR_CPUS:5120 nr_cpumask_bits:1536 nr_cpu_ids:1536 nr_node_ids:8 [0.00] PERCPU: max_distance=0x180ffe24 too large for vmalloc space 0x1fff [0.00] setup_percpu: auto allocator failed (-22), falling back to page size [0.00] PERCPU: 32 4K pages/cpu @c900 s107200 r8192 d15680 [0.00] Built 8 zonelists in Zone order, mobility grouping on. Total pages: 132001804 [0.00] Policy zone: Normal iosdevname=0 8250.nr_uarts=8 efi=old_map rdloaddriver=usb_storage rdloaddriver=sd_mod udev.event-timeout=600 softlockup_panic=0 rcupdate.rcu_cpu_stall_timeout=300 [0.00] Intel-IOMMU: enabled [0.00] PID hash table entries: 4096 (order: 3, 32768 bytes) [0.00] x86/fpu: xstate_offset[2]: 0240, xstate_sizes[2]: 0100 [0.00] xsave: enabled xstate_bv 0x7, cntxt size 0x340 [0.00] AGP: Checking aperture... [0.00] AGP: No AGP bridge found [0.00] Memory: 526901612k/26910638080k available (6528k kernel code, 26374249692k absent, 9486776k reserved, 4302k data, 1676k init) [0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1536, Nodes=8 [0.00] x86/pti: Unmapping kernel while in userspace [0.00] Hierarchical RCU implementation. [0.00] RCU restricting CPUs from NR_CPUS=5120 to nr_cpu_ids=1536. [0.00] Offload RCU callbacks from all CPUs [0.00] Offload RCU callbacks from CPUs: 0-1535. [0.00] NR_IRQS:327936 nr_irqs:15976 0 [0.00] Console: colour dummy device 80x25 [0.00] console [tty0] enabled [0.00] console [ttyS0] enabled [0.00] allocated 2145910784 bytes of page_cgroup [0.00] please try 'cgroup_disable=memory' option if you don't want memory cgroups [0.00] Enabling automatic NUMA balancing. Configure with numa_balancing= or the kernel.numa_balancing sysctl [0.00] tsc: Fast TSC calibration using PIT [0.00] tsc: Detected 2799.999 MHz processor [0.001803] Calibrating delay loop (skipped), value calculated using timer frequency.. 5599.99 BogoMIPS (lpj=279) [0.012408] pid_max: default: 1572864 minimum: 12288 [0.017987] init_memory_mapping: [mem 0x5947f000-0x5b47efff] [0.023701] init_memory_mapping: [mem 0x5b47f000-0x5b87efff] [0.029369] init_memory_mapping: [mem 0x6d368000-0x6d3edfff] [0.039130] BUG: unable to handle kernel paging request at 5b835f90 [0.046101] IP: [<5b835f90>] 0x5b835f8f [0.050637] PGD 81f61067 PUD 190ffefff067 PMD 190ffeffd067 PTE 5b835063 [0.057989] Oops: 0011 [#1] SMP [0.061241] Modules linked in: [0.064304] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-327.59.59.46.h42.x86_64 #1 [0.072280] Hardware name: Huawei FusionServer9032/IT91SMUB, BIOS BLXSV316 11/14/2017 [0.080082] task: 8196e440 ti: 81958000 task.ti: 81958000 [0.087539] RIP: 0010:[<5b835f90>] [<5b835f90>] 0x5b835f8f [0.094494] RSP: :8195be28 EFLAGS: 00010046 [0.099788] RAX: 80050033 RBX: 910fbc802000 RCX: 02d0 [0.106897] RDX: 0030 RSI: 02d0 RDI: 5b835f90 [0.114006] RBP: 8195bf38 R08: 0001 R09: 090fbc802000 [0.121116] R10: 88ffbcc07340 R11: 0001 R12: 0001 [0.128225] R13: 090fbc802000 R14: 02d0 R15: 0001 [0.135336] FS: () GS:c900() knlGS: [0.143398] CS: 0010 DS: ES: CR0: 80050033 [0.149124] CR2: 5b835f90 CR3: 01966000 CR4: 000606b0 [0.156234] DR0: DR1: DR2: [0.163344] DR3: DR6: fffe0ff0 DR7: 0400 [0.170454] Call Trace: [0.172899] [] ? efi_call4+0x6c/0xf0 [0.178108] [] ? native_flush_tlb_global+0x8e/0xc0 [0.184527] [] ? set_memory_x+0x43/0x50 [0.189997] [] ? efi_enter_virtual_mode+0x3bc/0x538 [0.196505] [] start_kernel+0x39f/0x44f [0.201972] [] ? repair_env_string+0x5c/0x5c [0.207872] [] ? early_idt_handlers+0x120/0x120 [0.214030] [] x86_64_start_reservations+0x2a/0x2c [0.220449] [] x86_64_start_kernel+0x152/0x175 [0.226521] Code: Bad
Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts
On 01/04/2018 08:51 PM, Andy Lutomirski wrote: > Do we need an arch_prctl() to enable IBRS for user mode? Eventually, once the dust settles. I think there's a spectrum of paranoia here, that is roughly (with increasing paranoia): 1. do nothing 2. do retpoline 3. do IBRS in kernel 4. do IBRS always I think you're asking for ~3.5. Patches for 1-3 are out there and 4 is pretty straightforward. Doing a arch_prctl() is still straightforward, but will be a much more niche thing than any of the other choices. Plus, with a user interface, we have to argue over the ABI for at least a month or two. ;)
Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts
On 01/04/2018 08:51 PM, Andy Lutomirski wrote: > Do we need an arch_prctl() to enable IBRS for user mode? Eventually, once the dust settles. I think there's a spectrum of paranoia here, that is roughly (with increasing paranoia): 1. do nothing 2. do retpoline 3. do IBRS in kernel 4. do IBRS always I think you're asking for ~3.5. Patches for 1-3 are out there and 4 is pretty straightforward. Doing a arch_prctl() is still straightforward, but will be a much more niche thing than any of the other choices. Plus, with a user interface, we have to argue over the ABI for at least a month or two. ;)
Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS
On 01/04/2018 08:54 PM, Andy Lutomirski wrote: > On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansenwrote: >> On 01/04/2018 02:21 PM, Tim Chen wrote: Does this really have to live outside of arch/x86/entry/ ? >>> There are some inline C routines later in this file >>> that will be needed by other functions. Want to consolidate >>> them in the same file. >> >> We could put all of the assembly into calling.h along with the PTI >> assembly. Seems as sane a place as anywhere else to put it. > > We should also stop thinking that NMI is at all special. All the > paranoid entry paths + NMI should just save and restore it, just like > CR3. Otherwise we get nasty corner cases with MCE, kprobes, etc. I've probably been too imprecise in my language here. The goal is absolutely to deal with all the paranoid paths. It's just that the NMI one is the easiest to understand and easiest to exercise. It also *is* special because it's the only one needing paranoid handling that does not use paranoid_exit itself.
Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS
On 01/04/2018 08:54 PM, Andy Lutomirski wrote: > On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen wrote: >> On 01/04/2018 02:21 PM, Tim Chen wrote: Does this really have to live outside of arch/x86/entry/ ? >>> There are some inline C routines later in this file >>> that will be needed by other functions. Want to consolidate >>> them in the same file. >> >> We could put all of the assembly into calling.h along with the PTI >> assembly. Seems as sane a place as anywhere else to put it. > > We should also stop thinking that NMI is at all special. All the > paranoid entry paths + NMI should just save and restore it, just like > CR3. Otherwise we get nasty corner cases with MCE, kprobes, etc. I've probably been too imprecise in my language here. The goal is absolutely to deal with all the paranoid paths. It's just that the NMI one is the easiest to understand and easiest to exercise. It also *is* special because it's the only one needing paranoid handling that does not use paranoid_exit itself.
linux-next: build failure after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (x86_64 allmodconfig) failed like this: mm/migrate.c: In function 'migrate_misplaced_page': mm/migrate.c:1933:46: error: passing argument 2 of 'migrate_pages' from incompatible pointer type [-Werror=incompatible-pointer-types] nr_remaining = migrate_pages(, alloc_misplaced_dst_page, ^ mm/migrate.c:1358:5: note: expected 'struct page * (*)(struct page *, long unsigned int)' but argument is of type 'struct page * (*)(struct page *, long unsigned int, int **)' int migrate_pages(struct list_head *from, new_page_t get_new_page, ^ Caused by commit d6f08a86f78a ("mm, migrate: remove reason argument from new_page_t") I applied the following fix patch for today (the mm/memory-failure.c error turned up after fixing the above): From: Stephen RothwellDate: Fri, 5 Jan 2018 15:46:02 +1100 Subject: [PATCH] mm, migrate: remove reason argument from new_page_t fix Signed-off-by: Stephen Rothwell --- mm/memory-failure.c | 2 +- mm/migrate.c| 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 4acdf393a801..d530ac1db680 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1483,7 +1483,7 @@ int unpoison_memory(unsigned long pfn) } EXPORT_SYMBOL(unpoison_memory); -static struct page *new_page(struct page *p, unsigned long private, int **x) +static struct page *new_page(struct page *p, unsigned long private) { int nid = page_to_nid(p); diff --git a/mm/migrate.c b/mm/migrate.c index 3cb0f5955b41..5d0dc7b85f90 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1797,8 +1797,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, } static struct page *alloc_misplaced_dst_page(struct page *page, - unsigned long data, - int **result) + unsigned long data) { int nid = (int) data; struct page *newpage; -- 2.15.0 -- Cheers, Stephen Rothwell
linux-next: build failure after merge of the akpm-current tree
Hi Andrew, After merging the akpm-current tree, today's linux-next build (x86_64 allmodconfig) failed like this: mm/migrate.c: In function 'migrate_misplaced_page': mm/migrate.c:1933:46: error: passing argument 2 of 'migrate_pages' from incompatible pointer type [-Werror=incompatible-pointer-types] nr_remaining = migrate_pages(, alloc_misplaced_dst_page, ^ mm/migrate.c:1358:5: note: expected 'struct page * (*)(struct page *, long unsigned int)' but argument is of type 'struct page * (*)(struct page *, long unsigned int, int **)' int migrate_pages(struct list_head *from, new_page_t get_new_page, ^ Caused by commit d6f08a86f78a ("mm, migrate: remove reason argument from new_page_t") I applied the following fix patch for today (the mm/memory-failure.c error turned up after fixing the above): From: Stephen Rothwell Date: Fri, 5 Jan 2018 15:46:02 +1100 Subject: [PATCH] mm, migrate: remove reason argument from new_page_t fix Signed-off-by: Stephen Rothwell --- mm/memory-failure.c | 2 +- mm/migrate.c| 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 4acdf393a801..d530ac1db680 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1483,7 +1483,7 @@ int unpoison_memory(unsigned long pfn) } EXPORT_SYMBOL(unpoison_memory); -static struct page *new_page(struct page *p, unsigned long private, int **x) +static struct page *new_page(struct page *p, unsigned long private) { int nid = page_to_nid(p); diff --git a/mm/migrate.c b/mm/migrate.c index 3cb0f5955b41..5d0dc7b85f90 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1797,8 +1797,7 @@ static bool migrate_balanced_pgdat(struct pglist_data *pgdat, } static struct page *alloc_misplaced_dst_page(struct page *page, - unsigned long data, - int **result) + unsigned long data) { int nid = (int) data; struct page *newpage; -- 2.15.0 -- Cheers, Stephen Rothwell
Re: KASAN: slab-out-of-bounds Read in cap_inode_getsecurity
On Thu, Jan 04, 2018 at 08:58:02AM -0800, syzbot wrote: > Hello, > > syzkaller hit the following crash on > 71ee203389f7cb1c1927eab22b95baa01405791c > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+37db7b2a61b64a9ab...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > audit: type=1400 audit(1514753657.623:7): avc: denied { map } for > pid=3504 comm="syzkaller926656" path="/root/syzkaller926656864" dev="sda1" > ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 > tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 > == > BUG: KASAN: slab-out-of-bounds in cap_inode_getsecurity+0x621/0x7d0 > security/commoncap.c:408 > Read of size 4 at addr 8801bea30b00 by task syzkaller926656/3504 > > CPU: 1 PID: 3504 Comm: syzkaller926656 Not tainted 4.15.0-rc5+ #244 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 > cap_inode_getsecurity+0x621/0x7d0 security/commoncap.c:408 > security_inode_getsecurity+0xcd/0x110 security/security.c:809 > xattr_getsecurity+0xd3/0x1f0 fs/xattr.c:244 > vfs_getxattr+0xc8/0x110 fs/xattr.c:333 > getxattr+0x116/0x2a0 fs/xattr.c:540 > path_getxattr+0xed/0x170 fs/xattr.c:568 > SYSC_getxattr fs/xattr.c:580 [inline] > SyS_getxattr+0x33/0x40 fs/xattr.c:577 > entry_SYSCALL_64_fastpath+0x23/0x9a Already fixed in Linus's tree. #syz fix: capabilities: fix buffer overread on very short xattr
Re: KASAN: slab-out-of-bounds Read in cap_inode_getsecurity
On Thu, Jan 04, 2018 at 08:58:02AM -0800, syzbot wrote: > Hello, > > syzkaller hit the following crash on > 71ee203389f7cb1c1927eab22b95baa01405791c > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master > compiler: gcc (GCC) 7.1.1 20170620 > .config is attached > Raw console output is attached. > C reproducer is attached > syzkaller reproducer is attached. See https://goo.gl/kgGztJ > for information about syzkaller reproducers > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+37db7b2a61b64a9ab...@syzkaller.appspotmail.com > It will help syzbot understand when the bug is fixed. See footer for > details. > If you forward the report, please keep this part and the footer. > > audit: type=1400 audit(1514753657.623:7): avc: denied { map } for > pid=3504 comm="syzkaller926656" path="/root/syzkaller926656864" dev="sda1" > ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 > tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 > == > BUG: KASAN: slab-out-of-bounds in cap_inode_getsecurity+0x621/0x7d0 > security/commoncap.c:408 > Read of size 4 at addr 8801bea30b00 by task syzkaller926656/3504 > > CPU: 1 PID: 3504 Comm: syzkaller926656 Not tainted 4.15.0-rc5+ #244 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x257 lib/dump_stack.c:53 > print_address_description+0x73/0x250 mm/kasan/report.c:252 > kasan_report_error mm/kasan/report.c:351 [inline] > kasan_report+0x25b/0x340 mm/kasan/report.c:409 > __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 > cap_inode_getsecurity+0x621/0x7d0 security/commoncap.c:408 > security_inode_getsecurity+0xcd/0x110 security/security.c:809 > xattr_getsecurity+0xd3/0x1f0 fs/xattr.c:244 > vfs_getxattr+0xc8/0x110 fs/xattr.c:333 > getxattr+0x116/0x2a0 fs/xattr.c:540 > path_getxattr+0xed/0x170 fs/xattr.c:568 > SYSC_getxattr fs/xattr.c:580 [inline] > SyS_getxattr+0x33/0x40 fs/xattr.c:577 > entry_SYSCALL_64_fastpath+0x23/0x9a Already fixed in Linus's tree. #syz fix: capabilities: fix buffer overread on very short xattr
Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS
On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansenwrote: > On 01/04/2018 02:21 PM, Tim Chen wrote: >>> Does this really have to live outside of arch/x86/entry/ ? >>> >> There are some inline C routines later in this file >> that will be needed by other functions. Want to consolidate >> them in the same file. > > We could put all of the assembly into calling.h along with the PTI > assembly. Seems as sane a place as anywhere else to put it. > We should also stop thinking that NMI is at all special. All the paranoid entry paths + NMI should just save and restore it, just like CR3. Otherwise we get nasty corner cases with MCE, kprobes, etc.
Re: [PATCH 2/7] x86/enter: MACROS to set/clear IBRS
On Thu, Jan 4, 2018 at 2:23 PM, Dave Hansen wrote: > On 01/04/2018 02:21 PM, Tim Chen wrote: >>> Does this really have to live outside of arch/x86/entry/ ? >>> >> There are some inline C routines later in this file >> that will be needed by other functions. Want to consolidate >> them in the same file. > > We could put all of the assembly into calling.h along with the PTI > assembly. Seems as sane a place as anywhere else to put it. > We should also stop thinking that NMI is at all special. All the paranoid entry paths + NMI should just save and restore it, just like CR3. Otherwise we get nasty corner cases with MCE, kprobes, etc.
Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts
On Thu, Jan 4, 2018 at 4:08 PM, Dave Hansenwrote: > On 01/04/2018 02:33 PM, Peter Zijlstra wrote: >> On Thu, Jan 04, 2018 at 09:56:44AM -0800, Tim Chen wrote: >>> Set IBRS upon kernel entrance via syscall and interrupts. Clear it >>> upon exit. >> >> So not only did we add a CR3 write, we're now adding an MSR write to the >> entry/exit paths. Please tell me that these are 'fast' MSRs? Given >> people are already reporting stupid numbers with just the existing >> PTI/CR3, what kind of pain are we going to get from adding this? > > This "dynamic IBRS" that does runtime switching will not be on by > default and will be patched around by alternatives unless someone > explicitly opts in. > > If you decide you want the additional protection that it provides, you > can take the performance hit. How much is that? We've been saying that > these new MSRs are roughly as expensive as the CR3 writes. How > expensive are those? Don't take my word for it, a few folks were > talking about it today: > > Google says[1]: "We see negligible impact on performance." > Amazon says[2]: "We don’t expect meaningful performance impact." > > I chopped a few qualifiers out of there, but I think that roughly > captures the sentiment. > > 1. > https://security.googleblog.com/2018/01/more-details-about-mitigations-for-cpu_4.html > 2. > http://www.businessinsider.com/google-amazon-performance-hit-meltdown-spectre-fixes-overblown-2018-1 Do we need an arch_prctl() to enable IBRS for user mode?
Re: [PATCH 3/7] x86/enter: Use IBRS on syscall and interrupts
On Thu, Jan 4, 2018 at 4:08 PM, Dave Hansen wrote: > On 01/04/2018 02:33 PM, Peter Zijlstra wrote: >> On Thu, Jan 04, 2018 at 09:56:44AM -0800, Tim Chen wrote: >>> Set IBRS upon kernel entrance via syscall and interrupts. Clear it >>> upon exit. >> >> So not only did we add a CR3 write, we're now adding an MSR write to the >> entry/exit paths. Please tell me that these are 'fast' MSRs? Given >> people are already reporting stupid numbers with just the existing >> PTI/CR3, what kind of pain are we going to get from adding this? > > This "dynamic IBRS" that does runtime switching will not be on by > default and will be patched around by alternatives unless someone > explicitly opts in. > > If you decide you want the additional protection that it provides, you > can take the performance hit. How much is that? We've been saying that > these new MSRs are roughly as expensive as the CR3 writes. How > expensive are those? Don't take my word for it, a few folks were > talking about it today: > > Google says[1]: "We see negligible impact on performance." > Amazon says[2]: "We don’t expect meaningful performance impact." > > I chopped a few qualifiers out of there, but I think that roughly > captures the sentiment. > > 1. > https://security.googleblog.com/2018/01/more-details-about-mitigations-for-cpu_4.html > 2. > http://www.businessinsider.com/google-amazon-performance-hit-meltdown-spectre-fixes-overblown-2018-1 Do we need an arch_prctl() to enable IBRS for user mode?
Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote: > On 04/01/18 03:13 PM, Jason Gunthorpe wrote: > >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote: > >>We tried things like this in an earlier iteration[1] which assumed the SG > >>was homogenous (all P2P or all regular memory). This required serious > >>ugliness to try and ensure SGs were in fact homogenous[2]. > > > >I'm confused, these patches already assume the sg is homogenous, > >right? Sure looks that way. So [2] is just debugging?? > > Yes, but it's a bit different to expect that someone calling > pci_p2pmem_map_sg() will know what they're doing and provide a homogenous > SG. It is relatively clear by convention that the entire SG must be > homogenous given they're calling a pci_p2pmem function. Where as, allowing > P2P SGs into the core DMA code means all we can do is hope that future > developers don't screw it up and allow P2P pages to mix in with regular > pages. Well that argument applies equally to the RDMA RW API wrappers around the DMA API. I think it is fine if sgl are defined to only have P2P or not, and that debugging support seemed reasonable to me.. > It's also very difficult to add similar functionality to dma_map_page seeing > dma_unmap_page won't have any way to know what it's dealing with. It just > seems confusing to support P2P in the SG version and not the page version. Well, this proposal is to support P2P in only some RDMA APIs and not others, so it seems about as confusing to me.. > >Then we don't need to patch RDMA because RDMA is not special when it > >comes to P2P. P2P should work with everything. > > Yes, I agree this would be very nice. Well, it is more than very nice. We have to keep RDMA working after all, and if you make it even more special things become harder for us. It is already the case that DMA in RDMA is very strange. We have drivers that provide their own DMA ops, for instance. And on that topic, does this scheme work with HFI? On first glance, it looks like no. The PCI device the HFI device is attached to may be able to do P2P, so it should be able to trigger the support. However, substituting the p2p_dma_map for the real device op dma_map will cause a kernel crash when working with HFI. HFI uses a custom DMA ops that returns CPU addreses in the dma_addr_t which the driver handles in various special ways. One cannot just replace them with PCI bus addresses. So, this kinda looks to me like it causes bad breakage for some RDMA drivers?? This is why P2P must fit in to the common DMA framework somehow, we rely on these abstractions to work properly and fully in RDMA. I think you should consider pushing this directly into the dma_ops implementations. Add a p2p_supported flag to struct dma_map_ops, and only if it is true can a caller pass a homogeneous SGL to ops->map_sg. Only map_sg would be supported for P2P. Upgraded implementations can call the helper function. Jason
Re: [PATCH 06/12] IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]()
On Thu, Jan 04, 2018 at 04:44:00PM -0700, Logan Gunthorpe wrote: > On 04/01/18 03:13 PM, Jason Gunthorpe wrote: > >On Thu, Jan 04, 2018 at 12:52:24PM -0700, Logan Gunthorpe wrote: > >>We tried things like this in an earlier iteration[1] which assumed the SG > >>was homogenous (all P2P or all regular memory). This required serious > >>ugliness to try and ensure SGs were in fact homogenous[2]. > > > >I'm confused, these patches already assume the sg is homogenous, > >right? Sure looks that way. So [2] is just debugging?? > > Yes, but it's a bit different to expect that someone calling > pci_p2pmem_map_sg() will know what they're doing and provide a homogenous > SG. It is relatively clear by convention that the entire SG must be > homogenous given they're calling a pci_p2pmem function. Where as, allowing > P2P SGs into the core DMA code means all we can do is hope that future > developers don't screw it up and allow P2P pages to mix in with regular > pages. Well that argument applies equally to the RDMA RW API wrappers around the DMA API. I think it is fine if sgl are defined to only have P2P or not, and that debugging support seemed reasonable to me.. > It's also very difficult to add similar functionality to dma_map_page seeing > dma_unmap_page won't have any way to know what it's dealing with. It just > seems confusing to support P2P in the SG version and not the page version. Well, this proposal is to support P2P in only some RDMA APIs and not others, so it seems about as confusing to me.. > >Then we don't need to patch RDMA because RDMA is not special when it > >comes to P2P. P2P should work with everything. > > Yes, I agree this would be very nice. Well, it is more than very nice. We have to keep RDMA working after all, and if you make it even more special things become harder for us. It is already the case that DMA in RDMA is very strange. We have drivers that provide their own DMA ops, for instance. And on that topic, does this scheme work with HFI? On first glance, it looks like no. The PCI device the HFI device is attached to may be able to do P2P, so it should be able to trigger the support. However, substituting the p2p_dma_map for the real device op dma_map will cause a kernel crash when working with HFI. HFI uses a custom DMA ops that returns CPU addreses in the dma_addr_t which the driver handles in various special ways. One cannot just replace them with PCI bus addresses. So, this kinda looks to me like it causes bad breakage for some RDMA drivers?? This is why P2P must fit in to the common DMA framework somehow, we rely on these abstractions to work properly and fully in RDMA. I think you should consider pushing this directly into the dma_ops implementations. Add a p2p_supported flag to struct dma_map_ops, and only if it is true can a caller pass a homogeneous SGL to ops->map_sg. Only map_sg would be supported for P2P. Upgraded implementations can call the helper function. Jason