RE: [PATCH 3/5] cifsd: add file operations
> On Mon, Mar 22, 2021 at 02:13:42PM +0900, Namjae Jeon wrote: > > This adds file operations and buffer pool for cifsd. > > Some random notes: > > > +static void rollback_path_modification(char *filename) { > > + if (filename) { > > + filename--; > > + *filename = '/'; > What an odd way to spell filename[-1] = '/';... Okay. Will fix. > > > +int ksmbd_vfs_inode_permission(struct dentry *dentry, int acc_mode, > > +bool delete) { > > > + if (delete) { > > + struct dentry *parent; > > + > > + parent = dget_parent(dentry); > > + if (!parent) > > + return -EINVAL; > > + > > + if (inode_permission(&init_user_ns, d_inode(parent), MAY_EXEC | > > MAY_WRITE)) { > > + dput(parent); > > + return -EACCES; > > + } > > + dput(parent); > > Who's to guarantee that parent is stable? IOW, by the time of that > inode_permission() call dentry might very well not be a child of that thing... Okay, Will fix. > > > + parent = dget_parent(dentry); > > + if (!parent) > > + return 0; > > + > > + if (!inode_permission(&init_user_ns, d_inode(parent), MAY_EXEC | > > MAY_WRITE)) > > + *daccess |= FILE_DELETE_LE; > > Ditto. Okay. > > > +int ksmbd_vfs_mkdir(struct ksmbd_work *work, > > + const char *name, > > + umode_t mode) > > > > + err = vfs_mkdir(&init_user_ns, d_inode(path.dentry), dentry, mode); > > + if (!err) { > > + ksmbd_vfs_inherit_owner(work, d_inode(path.dentry), > > + d_inode(dentry)); > > ->mkdir() might very well return success, with dentry left unhashed negative. > Look at the callers of vfs_mkdir() to see how it should be handled. Okay, Will fix. > > > +static int check_lock_range(struct file *filp, > > + loff_t start, > > + loff_t end, > > + unsigned char type) > > +{ > > + struct file_lock *flock; > > + struct file_lock_context *ctx = file_inode(filp)->i_flctx; > > + int error = 0; > > + > > + if (!ctx || list_empty_careful(&ctx->flc_posix)) > > + return 0; > > + > > + spin_lock(&ctx->flc_lock); > > + list_for_each_entry(flock, &ctx->flc_posix, fl_list) { > > + /* check conflict locks */ > > + if (flock->fl_end >= start && end >= flock->fl_start) { > > + if (flock->fl_type == F_RDLCK) { > > + if (type == WRITE) { > > + ksmbd_err("not allow write by shared > > lock\n"); > > + error = 1; > > + goto out; > > + } > > + } else if (flock->fl_type == F_WRLCK) { > > + /* check owner in lock */ > > + if (flock->fl_file != filp) { > > + error = 1; > > + ksmbd_err("not allow rw access by > > exclusive lock from other > opens\n"); > > + goto out; > > + } > > + } > > + } > > + } > > +out: > > + spin_unlock(&ctx->flc_lock); > > + return error; > > +} > > WTF is that doing in smbd? This code was added to pass the smb2 lock test of samba's smbtorture. Will fix it. > > > + filp = fp->filp; > > + inode = d_inode(filp->f_path.dentry); > > That should be file_inode(). Try it on overlayfs, watch it do interesting > things... Okay. > > > + nbytes = kernel_read(filp, rbuf, count, pos); > > + if (nbytes < 0) { > > + name = d_path(&filp->f_path, namebuf, sizeof(namebuf)); > > + if (IS_ERR(name)) > > + name = "(error)"; > > + ksmbd_err("smb read failed for (%s), err = %zd\n", > > + name, nbytes); > > Do you really want the full pathname here? For (presumably) spew into syslog? No, Will fix. > > > +int ksmbd_vfs_remove_file(struct ksmbd_work *work, char *name) { > > + struct path parent; > > + struct dentry *dir, *dentry; > > + char *last; > > + int err = -ENOENT; > > + > > + last = extract_last_component(name); > > + if (!last) > > + return -ENOENT; > > Yeccchhh... I guess I change it err instead of -ENOENT. > > > + if (ksmbd_override_fsids(work)) > > + return -ENOMEM; > > + > > + err = kern_path(name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, &parent); > > + if (err) { > > + ksmbd_debug(VFS, "can't get %s, err %d\n", name, err); > > + ksmbd_revert_fsids(work); > > + rollback_path_modification(last); > > + return err; > > + } > > + > > + dir = parent.dentry; > > + if (!d_inode(dir)) > > + goto out; > > Really? When does that happen? Will fix. > > > +static int __ksmbd_vfs_rename(struct ksmbd_work *work, > > +
[PATCH] clkdev: remove unnecessary __ref annotations
clkdev_alloc and vclkdev_alloc do not call any functions living in .init.text (nor are there comments as to when and why that would be safe). The original annotation dates back to when the code was imported from ARM in 6d803ba736ab, and AFAICT the last reason to have the annotation vanished with ea17c9d868c1 (sh: Use generic clkdev.h header). Signed-off-by: Rasmus Villemoes --- drivers/clk/clkdev.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/clk/clkdev.c b/drivers/clk/clkdev.c index 0f2e3fcf0f19..d446c2c697d5 100644 --- a/drivers/clk/clkdev.c +++ b/drivers/clk/clkdev.c @@ -153,7 +153,7 @@ struct clk_lookup_alloc { charcon_id[MAX_CON_ID]; }; -static struct clk_lookup * __ref +static struct clk_lookup * vclkdev_alloc(struct clk_hw *hw, const char *con_id, const char *dev_fmt, va_list ap) { @@ -190,7 +190,7 @@ vclkdev_create(struct clk_hw *hw, const char *con_id, const char *dev_fmt, return cl; } -struct clk_lookup * __ref +struct clk_lookup * clkdev_alloc(struct clk *clk, const char *con_id, const char *dev_fmt, ...) { struct clk_lookup *cl; -- 2.29.2
Re: [PATCH 3/5] cifsd: add file operations
On (21/03/22 17:09), Matthew Wilcox wrote: > On Mon, Mar 22, 2021 at 06:03:21PM +0900, Sergey Senozhatsky wrote: > > On (21/03/22 08:15), Matthew Wilcox wrote: > > > > > > What's the scenario for which your allocator performs better than slub > > > > > > > IIRC request and reply buffers can be up to 4M in size. So this stuff > > just allocates a number of fat buffers and keeps them around so that > > it doesn't have to vmalloc(4M) for every request and every response. > > Hang on a minute, this speaks to a deeper design problem. If we're doing > a 'request' or 'reply' that is that large, the I/O should be coming from > or going to the page cache. If it goes via a 4MB virtually contiguous > buffer first, that's a memcpy that could/should be avoided. But huge vmalloc buffers are still needed. For instance, `ls -la` in a directory with a huge number of entries. > But now i'm looking for how ksmbd_find_buffer() is used, and it isn't. > So it looks like someone came to the same conclusion I did, but forgot > to delete the wm code. Yes, I think it's disabled by default and requires some smb.conf configuration. So I guess that wm code can be removed. Especially given that > That said, there are plenty of opportunities to optimise the vmalloc code, > and that's worth pursuing. That would be really interesting to see! > And here's the receive path which contains > the memcpy that should be avoided (ok, it's actually the call to ->read; > we shouldn't be reading in the entire 4MB but rather the header): > + conn->request_buf = ksmbd_alloc_request(size); > + if (!conn->request_buf) > + continue; > + > + memcpy(conn->request_buf, hdr_buf, sizeof(hdr_buf)); > + if (!ksmbd_smb_request(conn)) > + break; > + > + /* > + * We already read 4 bytes to find out PDU size, now > + * read in PDU > + */ > + size = t->ops->read(t, conn->request_buf + 4, pdu_size); // A side note, it seems that the maximum read/write/trans buffer size that // windows supports is 8MB, not 4MB.
Re: [PATCH] xsysace: Remove SYSACE driver
Hi, On Mon, 09 Nov 2020, Michal Simek wrote: Sysace IP is no longer used on Xilinx PowerPC 405/440 and Microblaze systems. The driver is not regularly tested and very likely not working for quite a long time that's why remove it. Is there a reason this patch was never merged? can the driver be removed? I ran into this as a potential tasklet user that can be replaced/removed. Thanks, Davidlohr Signed-off-by: Michal Simek --- Based on discussion https://lore.kernel.org/linux-arm-kernel/5ab9a2a1-20e3-c7b2-f666-2034df436...@kernel.dk/ I have grepped the kernel and found any old ppc platform. I have included it in this patch to have a discussion about it. --- MAINTAINERS |1 - arch/microblaze/boot/dts/system.dts |8 - arch/powerpc/boot/dts/icon.dts |7 - arch/powerpc/configs/44x/icon_defconfig |1 - drivers/block/Kconfig |6 - drivers/block/Makefile |1 - drivers/block/xsysace.c | 1273 --- 7 files changed, 1297 deletions(-) delete mode 100644 drivers/block/xsysace.c diff --git a/MAINTAINERS b/MAINTAINERS index cba8ddf87a08..38556c009758 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -2741,7 +2741,6 @@ T:git https://github.com/Xilinx/linux-xlnx.git F: Documentation/devicetree/bindings/i2c/cdns,i2c-r1p10.yaml F: Documentation/devicetree/bindings/i2c/xlnx,xps-iic-2.00.a.yaml F: arch/arm/mach-zynq/ -F: drivers/block/xsysace.c F: drivers/clocksource/timer-cadence-ttc.c F: drivers/cpuidle/cpuidle-zynq.c F: drivers/edac/synopsys_edac.c diff --git a/arch/microblaze/boot/dts/system.dts b/arch/microblaze/boot/dts/system.dts index 5b236527176e..b7ee1056779e 100644 --- a/arch/microblaze/boot/dts/system.dts +++ b/arch/microblaze/boot/dts/system.dts @@ -310,14 +310,6 @@ RS232_Uart_1: serial@8400 { xlnx,odd-parity = <0x0>; xlnx,use-parity = <0x0>; } ; - SysACE_CompactFlash: sysace@8360 { - compatible = "xlnx,xps-sysace-1.00.a"; - interrupt-parent = <&xps_intc_0>; - interrupts = < 4 2 >; - reg = < 0x8360 0x1 >; - xlnx,family = "virtex5"; - xlnx,mem-width = <0x10>; - } ; debug_module: debug@8440 { compatible = "xlnx,mdm-1.00.d"; reg = < 0x8440 0x1 >; diff --git a/arch/powerpc/boot/dts/icon.dts b/arch/powerpc/boot/dts/icon.dts index fbaa60b8f87a..4fd7a4fbb4fb 100644 --- a/arch/powerpc/boot/dts/icon.dts +++ b/arch/powerpc/boot/dts/icon.dts @@ -197,13 +197,6 @@ partition@fa { reg = <0x00fa 0x0006>; }; }; - - SysACE_CompactFlash: sysace@1,0 { - compatible = "xlnx,sysace"; - interrupt-parent = <&UIC2>; - interrupts = <24 0x4>; - reg = <0x0001 0x 0x1>; - }; }; UART0: serial@f200 { diff --git a/arch/powerpc/configs/44x/icon_defconfig b/arch/powerpc/configs/44x/icon_defconfig index 930948a1da76..fb9a15573546 100644 --- a/arch/powerpc/configs/44x/icon_defconfig +++ b/arch/powerpc/configs/44x/icon_defconfig @@ -28,7 +28,6 @@ CONFIG_MTD_CFI_AMDSTD=y CONFIG_MTD_PHYSMAP_OF=y CONFIG_BLK_DEV_RAM=y CONFIG_BLK_DEV_RAM_SIZE=35000 -CONFIG_XILINX_SYSACE=y CONFIG_SCSI=y CONFIG_BLK_DEV_SD=y CONFIG_SCSI_CONSTANTS=y diff --git a/drivers/block/Kconfig b/drivers/block/Kconfig index ecceaaa1a66f..9cb02861298d 100644 --- a/drivers/block/Kconfig +++ b/drivers/block/Kconfig @@ -388,12 +388,6 @@ config SUNVDC source "drivers/s390/block/Kconfig" -config XILINX_SYSACE - tristate "Xilinx SystemACE support" - depends on 4xx || MICROBLAZE - help - Include support for the Xilinx SystemACE CompactFlash interface - config XEN_BLKDEV_FRONTEND tristate "Xen virtual block device support" depends on XEN diff --git a/drivers/block/Makefile b/drivers/block/Makefile index e1f63117ee94..5ddd9370972a 100644 --- a/drivers/block/Makefile +++ b/drivers/block/Makefile @@ -19,7 +19,6 @@ obj-$(CONFIG_ATARI_FLOPPY)+= ataflop.o obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o obj-$(CONFIG_BLK_DEV_RAM) += brd.o obj-$(CONFIG_BLK_DEV_LOOP) += loop.o -obj-$(CONFIG_XILINX_SYSACE)+= xsysace.o obj-$(CONFIG_CDROM_PKTCDVD) += pktcdvd.o obj-$(CONFIG_SUNVDC)+= sunvdc.o obj-$(CONFIG_BLK_DEV_SKD) += skd.o diff --git a/drivers/block/xsysace.c b/drivers/block/xsysace.c deleted file mode 100644 index eb8ef65778c3..000
Re: [PATCH -tip v4 08/12] arm: kprobes: Make a space for regs->ARM_pc at kretprobe_trampoline
On Mon, 22 Mar 2021 15:41:18 +0900 Masami Hiramatsu wrote: > Change kretprobe_trampoline to make a space for regs->ARM_pc so that > kretprobe_trampoline_handler can call instruction_pointer_set() > safely. BTW, if kretprobe_trampoline is replaced with the assembly code, I think it should fill all the regs as much as possible, because originally it is written by a software break. Thus the regs->sp should point the stack address at the entry of kretprobe_trampoline, and also regs->lr and regs->pc will be kretprobe_trampoline, so that user handler can access caller stack. Thanks, > > Signed-off-by: Masami Hiramatsu > --- > arch/arm/probes/kprobes/core.c |2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c > index 1782b41df095..5f3c2b42787f 100644 > --- a/arch/arm/probes/kprobes/core.c > +++ b/arch/arm/probes/kprobes/core.c > @@ -397,11 +397,13 @@ int __kprobes kprobe_exceptions_notify(struct > notifier_block *self, > void __naked __kprobes kretprobe_trampoline(void) > { > __asm__ __volatile__ ( > + "subsp, sp, #16 \n\t" > "stmdb sp!, {r0 - r11} \n\t" > "movr0, sp \n\t" > "bl trampoline_handler \n\t" > "movlr, r0 \n\t" > "ldmia sp!, {r0 - r11} \n\t" > + "addsp, sp, #16 \n\t" > #ifdef CONFIG_THUMB2_KERNEL > "bx lr \n\t" > #else > -- Masami Hiramatsu
Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation
On 3/22/21 5:56 PM, Andrew Lunn wrote: The solution is to create a user space tool inside the drivers/net/ipa directory that will link with the kernel source files and will perform all the basic one-time checks I want to make. Hi Alex Have you found any other driver doing this? Where do they keep there code? Could this be a selftest, put somewhere in tools/testing/selftests. Or can this be a test kernel module. Eg. we have crypt/testmsg.c which runs a number of tests on the crypto subsystem, ./kernel/time/test_udelay.c which runs times on udelay. Rather than inventing something new, please follow other examples already in the kernel. I will. I did see the tools/testing directory and I'll look at how people have done things there. I need to try to get it working first, then I'll figure out where it belongs. I think I'll be able to do a user space test, but it's a little tricky to be sure it's actually testing what I want. If that ends up being too hard, I'll look into kernel test module. Thanks for the suggestions. -Alex Andrew
Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
On 7/22/20 6:00 AM, Felix Fietkau wrote: On 2020-07-22 14:55, Johannes Berg wrote: On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: I'm considering testing a different approach (with mt76 initially): - Add a mac80211 rx function that puts processed skbs into a list instead of handing them to the network stack directly. Would this be *after* all the mac80211 processing, i.e. in place of the rx-up-to-stack? Yes, it would run all the rx handlers normally and then put the resulting skbs into a list instead of calling netif_receive_skb or napi_gro_frags. Whatever came of this? I realized I'm running Felix's patch since his mt76 driver needs it. Any chance it will go upstream? Thanks, Ben - Felix -- Ben Greear Candela Technologies Inc http://www.candelatech.com
Re: [selftests] e48d82b67a: BUG_TestSlub_RZ_alloc(Not_tainted):Redzone_overwritten
On 3/17/21 7:53 PM, David Rientjes wrote: > On Wed, 17 Mar 2021, Vlastimil Babka wrote: >> > >> > [ 22.154049] random: get_random_u32 called from >> > __kmem_cache_create+0x23/0x3e0 with crng_init=0 >> > [ 22.154070] random: get_random_u32 called from >> > cache_random_seq_create+0x7c/0x140 with crng_init=0 >> > [ 22.154167] random: get_random_u32 called from >> > allocate_slab+0x155/0x5e0 with crng_init=0 >> > [ 22.154690] test_slub: 1. kmem_cache: Clobber Redzone 0x12->0x(ptrval) >> > [ 22.164499] >> > = >> > [ 22.166629] BUG TestSlub_RZ_alloc (Not tainted): Redzone overwritten >> > [ 22.168179] >> > - >> > [ 22.168179] >> > [ 22.168372] Disabling lock debugging due to kernel taint >> > [ 22.168372] INFO: 0x(ptrval)-0x(ptrval) @offset=1064. First byte 0x12 >> > instead of 0xcc >> > [ 22.168372] INFO: Allocated in resiliency_test+0x47/0x1be age=3 cpu=0 >> > pid=1 >> > [ 22.168372] __slab_alloc+0x57/0x80 >> > [ 22.168372] kmem_cache_alloc (kbuild/src/consumer/mm/slub.c:2871 >> > kbuild/src/consumer/mm/slub.c:2915 kbuild/src/consumer/mm/slub.c:2920) >> > [ 22.168372] resiliency_test (kbuild/src/consumer/lib/test_slub.c:34 >> > kbuild/src/consumer/lib/test_slub.c:107) >> > [ 22.168372] test_slub_init (kbuild/src/consumer/lib/test_slub.c:124) >> > [ 22.168372] do_one_initcall (kbuild/src/consumer/init/main.c:1226) >> > [ 22.168372] kernel_init_freeable (kbuild/src/consumer/init/main.c:1298 >> > kbuild/src/consumer/init/main.c:1315 kbuild/src/consumer/init/main.c:1335 >> > kbuild/src/consumer/init/main.c:1537) >> > [ 22.168372] kernel_init (kbuild/src/consumer/init/main.c:1426) >> > [ 22.168372] ret_from_fork >> > (kbuild/src/consumer/arch/x86/entry/entry_32.S:856) >> > [ 22.168372] INFO: Slab 0x(ptrval) objects=16 used=1 fp=0x(ptrval) >> > flags=0x4201 >> > [ 22.168372] INFO: Object 0x(ptrval) @offset=1000 fp=0x(ptrval) >> > [ 22.168372] >> > [ 22.168372] Redzone (ptrval): cc cc cc cc cc cc cc cc >> > >> > [ 22.168372] Object (ptrval): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b >> > 6b 6b >> > [ 22.168372] Object (ptrval): 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b >> > 6b a5 kkk. >> > [ 22.168372] Redzone (ptrval): 12 cc cc cc >> > >> > [ 22.168372] Padding (ptrval): 5a 5a 5a 5a 5a 5a 5a 5a >> > >> > [ 22.168372] CPU: 0 PID: 1 Comm: swapper/0 Tainted: GB >> > 5.12.0-rc2-1-ge48d82b67a2b #1 >> > [ 22.168372] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> > 1.12.0-1 04/01/2014 >> > [ 22.168372] Call Trace: >> > [ 22.168372] dump_stack (kbuild/src/consumer/lib/dump_stack.c:122) >> > [ 22.168372] print_trailer (kbuild/src/consumer/mm/slub.c:737) >> > [ 22.168372] check_bytes_and_report.cold >> > (kbuild/src/consumer/mm/slub.c:807) >> > [ 22.168372] check_object (kbuild/src/consumer/mm/slub.c:914) >> > [ 22.168372] validate_slab (kbuild/src/consumer/mm/slub.c:4635) >> >> Hm but in this case the output means the tested functionality (slub >> debugging) >> is working as intended. So what can we do? Indicate/teach somehow to the bot >> that this is OK? Does kselftest have some support for this? Or silence the >> validation output for testing purposes? (I would prefer not to) >> > > Unless you're familiar with everything that CONFIG_TEST_SLUB does, maybe > this could be inferred as an actual issue that the test has uncovered that > is unexpected? > > I don't have a good way of silencing the check_bytes_and_report() output > other than a big hammer: implement {disable,enable}_slub_warnings() that > the resiliency test could call into before triggering these checks. So Oliver has implemented this, but now I got a different idea that should be much cleaner IMHO. We could add a per-cache flag SLAB_SILENT_ERRORS (similar to SLAB_RED_ZONE etc) instead of a global bool. The test would just create the caches with this flag. The flag should be added to the SLAB_NEVER_MERGE, SLAB_DEBUG_FLAGS, SLAB_FLAGS_PERMITTED macros as well. A similar suggestion is that adding the errors counter parameter to all validate_slab_cache() and relevant functions is tedious - there are more that had to be modified like this than initially expected. Instead the error counter can be added to SLUB's struct kmem_cache definition, incremented by the various checks and the tests can look at that after validation. Thanks, Vlastimil
Re: [PATCH v2 0/4] drm/bridge: ti-sn65dsi86: Support EDID reading
Hi Stephen, On Mon, Nov 02, 2020 at 05:15:24PM -0800, Stephen Boyd wrote: > Quoting Sam Ravnborg (2020-11-01 09:37:41) > > Hi Stephen. > > > > On Thu, Oct 29, 2020 at 06:17:34PM -0700, Stephen Boyd wrote: > > > This patch series cleans up the DDC code a little bit so that > > > it is more efficient time wise and supports grabbing the EDID > > > of the eDP panel over the aux channel. I timed this on a board > > > I have on my desk and it takes about 20ms to grab the EDID out > > > of the panel and make sure it is valid. > > > > > > The first two patches seem less controversial so I stuck them at > > > the beginning. The third patch does the EDID reading and caches > > > it so we don't have to keep grabbing it over and over again. And > > > finally the last patch updates the reply field so that short > > > reads and nacks over the channel are reflected properly instead of > > > treating them as some sort of error that can't be discerned. > > > > > > Stephen Boyd (4): > > > drm/bridge: ti-sn65dsi86: Combine register accesses in > > > ti_sn_aux_transfer() > > > drm/bridge: ti-sn65dsi86: Make polling a busy loop > > > drm/bridge: ti-sn65dsi86: Read EDID blob over DDC > > > drm/bridge: ti-sn65dsi86: Update reply on aux failures > > > > Series looks good. You can add my a-b on the full series. > > Acked-by: Sam Ravnborg > > > > I can apply after Douglas have had a look at the patches he did not r-b > > yet. > > > > Any chance we can convince you to prepare this bridge driver for use in > > a chained bridge setup where the connector is created by the display > > driver and uses drm_bridge_funcs? > > > > First step wuld be to introduce the use of a panel_bridge. > > Then add get_edid to drm_bridge_funcs and maybe more helpers. > > > > Then natural final step would be to move connector creation to the > > display driver - see how other uses drm_bridge_connector_init() to do so > > - it is relatively simple. > > > > Should be doable - and reach out if you need some help. > > I started to look at this and got stuck at ti_sn_bridge_get_bpp(). Where > can I get the details of the bpc for the downstream bridge or panel? Is > there some function that can tell this bridge what the bpc is for the > attached connector? I've posted a patch series to convert to DRM_BRIDGE_ATTACH_NO_CONNECTOR yesterday (and have CC'ed you), but I've overlooked this particular problem :-S You can't get the connector in the .enable() operation, but you can get it in .atomic_enable(), with drm_atomic_get_new_connector_for_encoder(). This being said, it's probably not the right option. What matters here isn't the bpc for the connector, but the format expected by the next bridge in the chain. drm_bridge_funcs has two operations, .atomic_get_output_bus_fmts() and .atomic_get_input_bus_fmts(), to negotiate that format along a chain of bridges. The panel bridge driver (drivers/gpu/drm/bridge/panel.c) doesn't implement those operations, and neither does display-connector.c, so that may be what we should start with. > I see that td_mode_valid() in > drivers/gpu/drm/bridge/tc358775.c stores away the bpc from the incoming > drm_display_info pointer but I'm not sure that is correct because can't > that be called for various and not necessarily the one we're using? You're right, .mode_valid() shouldn't do that. -- Regards, Laurent Pinchart
Re: [PATCH 3/3] powerpc/mm/hash: Avoid multiple HPT resize-downs on memory hotunplug
On Fri, Mar 12, 2021 at 04:29:41AM -0300, Leonardo Bras wrote: > During memory hotunplug, after each LMB is removed, the HPT may be > resized-down if it would map a max of 4 times the current amount of memory. > (2 shifts, due to introduced histeresis) > > It usually is not an issue, but it can take a lot of time if HPT > resizing-down fails. This happens because resize-down failures > usually repeat at each LMB removal, until there are no more bolted entries > conflict, which can take a while to happen. > > This can be solved by doing a single HPT resize at the end of memory > hotunplug, after all requested entries are removed. > > To make this happen, it's necessary to temporarily disable all HPT > resize-downs before hotunplug, re-enable them after hotunplug ends, > and then resize-down HPT to the current memory size. > > As an example, hotunplugging 256GB from a 385GB guest took 621s without > this patch, and 100s after applied. > > Signed-off-by: Leonardo Bras > --- > arch/powerpc/include/asm/book3s/64/hash.h | 2 ++ > arch/powerpc/include/asm/sparsemem.h | 2 ++ > arch/powerpc/mm/book3s64/hash_utils.c | 28 +++ > arch/powerpc/mm/book3s64/pgtable.c| 12 > .../platforms/pseries/hotplug-memory.c| 16 +++ > 5 files changed, 60 insertions(+) > > diff --git a/arch/powerpc/include/asm/book3s/64/hash.h > b/arch/powerpc/include/asm/book3s/64/hash.h > index 843b0a178590..f92697c107f7 100644 > --- a/arch/powerpc/include/asm/book3s/64/hash.h > +++ b/arch/powerpc/include/asm/book3s/64/hash.h > @@ -256,6 +256,8 @@ int hash__create_section_mapping(unsigned long start, > unsigned long end, > int hash__remove_section_mapping(unsigned long start, unsigned long end); > > void hash_memory_batch_expand_prepare(unsigned long newsize); > +void hash_memory_batch_shrink_begin(void); > +void hash_memory_batch_shrink_end(void); > > #endif /* !__ASSEMBLY__ */ > #endif /* __KERNEL__ */ > diff --git a/arch/powerpc/include/asm/sparsemem.h > b/arch/powerpc/include/asm/sparsemem.h > index 16b5f5300c84..a7a8a0d070fc 100644 > --- a/arch/powerpc/include/asm/sparsemem.h > +++ b/arch/powerpc/include/asm/sparsemem.h > @@ -18,6 +18,8 @@ extern int memory_add_physaddr_to_nid(u64 start); > #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid > > void memory_batch_expand_prepare(unsigned long newsize); > +void memory_batch_shrink_begin(void); > +void memory_batch_shrink_end(void); > > #ifdef CONFIG_NUMA > extern int hot_add_scn_to_nid(unsigned long scn_addr); > diff --git a/arch/powerpc/mm/book3s64/hash_utils.c > b/arch/powerpc/mm/book3s64/hash_utils.c > index 1f6aa0bf27e7..e16f207de8e4 100644 > --- a/arch/powerpc/mm/book3s64/hash_utils.c > +++ b/arch/powerpc/mm/book3s64/hash_utils.c > @@ -794,6 +794,9 @@ static unsigned long __init htab_get_table_size(void) > } > > #ifdef CONFIG_MEMORY_HOTPLUG > + > +atomic_t hpt_resize_disable = ATOMIC_INIT(0); > + > static int resize_hpt_for_hotplug(unsigned long new_mem_size, bool shrinking) > { > unsigned target_hpt_shift; > @@ -805,6 +808,10 @@ static int resize_hpt_for_hotplug(unsigned long > new_mem_size, bool shrinking) > > if (shrinking) { > > + /* When batch removing entries, only resizes HPT at the end. */ > + if (atomic_read_acquire(&hpt_resize_disable)) > + return 0; > + I'm not quite convinced by this locking. Couldn't hpt_resize_disable be set after this point, but while you're still inside resize_hpt_for_hotplug()? Probably better to use an explicit mutex (and mutex_trylock()) to make the critical sections clearer. Except... do we even need the fancy mechanics to suppress the resizes in one place to do them elswhere. Couldn't we just replace the existing resize calls with the batched ones? > /* >* To avoid lots of HPT resizes if memory size is fluctuating >* across a boundary, we deliberately have some hysterisis > @@ -872,6 +879,27 @@ void hash_memory_batch_expand_prepare(unsigned long > newsize) > pr_warn("Hash collision while resizing HPT\n"); > } > } > + > +void hash_memory_batch_shrink_begin(void) > +{ > + /* Disable HPT resize-down during hot-unplug */ > + atomic_set_release(&hpt_resize_disable, 1); > +} > + > +void hash_memory_batch_shrink_end(void) > +{ > + unsigned long newsize; > + > + /* Re-enables HPT resize-down after hot-unplug */ > + atomic_set_release(&hpt_resize_disable, 0); > + > + newsize = memblock_phys_mem_size(); > + /* Resize to smallest SHIFT possible */ > + while (resize_hpt_for_hotplug(newsize, true) == -ENOSPC) { > + newsize *= 2; As noted earlier, doing this without an explicit cap on the new hpt size (of the existing size) this makes me nervous. Less so, but doing the calculations on memory size, rather than explictly on HPT size / HPT order also seems kinda clunky. >
Re: [PATCH -tip v4 12/12] tracing: Show kretprobe unknown indicator only for kretprobe_trampoline
Hi Steve, On Mon, 22 Mar 2021 11:11:42 -0400 Steven Rostedt wrote: > On Mon, 22 Mar 2021 15:42:02 +0900 > Masami Hiramatsu wrote: > > > ftrace shows "[unknown/kretprobe'd]" indicator all addresses in the > > kretprobe_trampoline, but the modified address by kretprobe should > > be only kretprobe_trampoline+0. > > > > Signed-off-by: Masami Hiramatsu > > Acked-by: Steven Rostedt (VMware) Thank you for the Ack! > > -- Steve > -- Masami Hiramatsu
Re: [PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
On Mon, Mar 22, 2021 at 3:00 PM Collin Fijalkovich wrote: > > Transparent huge pages are supported for read-only non-shmem filesystems, > but are only used for vmas with VM_DENYWRITE. This condition ensures that > file THPs are protected from writes while an application is running > (ETXTBSY). Any existing file THPs are then dropped from the page cache > when a file is opened for write in do_dentry_open(). Since sys_mmap > ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas > produced by execve(). > > Systems that make heavy use of shared libraries (e.g. Android) are unable > to apply VM_DENYWRITE through the dynamic linker, preventing them from > benefiting from the resultant reduced contention on the TLB. > > This patch reduces the constraint on file THPs allowing use with any > executable mapping from a file not opened for write (see > inode_is_open_for_write()). It also introduces additional conditions to > ensure that files opened for write will never be backed by file THPs. Thanks for working on this. We could also use this in many data center workloads. Question: when we use this on shared library, the library is still writable. When the shared library is opened for write, these pages will refault in as 4kB pages, right? Thanks, Song
linux-next: manual merge of the v4l-dvb tree with the hwmon-staging tree
Hi all, Today's linux-next merge of the v4l-dvb tree got a conflict in: MAINTAINERS between commit: 38f15506d965 ("hwmon: add driver for NZXT Kraken X42/X52/X62/X72") from the hwmon-staging tree and commit: be157db0a3d8 ("media: Add maintainer for IMX jpeg v4l2 driver") from the v4l-dvb tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc MAINTAINERS index dda782e9e578,04e6df934eb0.. --- a/MAINTAINERS +++ b/MAINTAINERS @@@ -12941,13 -12910,14 +12941,21 @@@ L:linux-...@lists.01.org (moderated fo S:Supported F:drivers/nfc/nxp-nci + NXP i.MX 8QXP/8QM JPEG V4L2 DRIVER + M:Mirela Rabulea + R:NXP Linux Team + L:linux-me...@vger.kernel.org + S:Maintained + F:Documentation/devicetree/bindings/media/imx8-jpeg.yaml + F:drivers/media/platform/imx-jpeg + +NZXT-KRAKEN2 HARDWARE MONITORING DRIVER +M:Jonas Malaco +L:linux-hw...@vger.kernel.org +S:Maintained +F:Documentation/hwmon/nzxt-kraken2.rst +F:drivers/hwmon/nzxt-kraken2.c + OBJAGG M:Jiri Pirko L:net...@vger.kernel.org pgpHSyPNfPDwt.pgp Description: OpenPGP digital signature
[PATCH v4 net-next 11/11] net: ocelot: replay switchdev events when joining bridge
From: Vladimir Oltean The premise of this change is that the switchdev port attributes and objects offloaded by ocelot might have been missed when we are joining an already existing bridge port, such as a bonding interface. The patch pulls these switchdev attributes and objects from the bridge, on behalf of the 'bridge port' net device which might be either the ocelot switch interface, or the bonding upper interface. The ocelot_net.c belongs strictly to the switchdev ocelot driver, while ocelot.c is part of a library shared with the DSA felix driver. The ocelot_port_bridge_leave function (part of the common library) used to call ocelot_port_vlan_filtering(false), something which is not necessary for DSA, since the framework deals with that already there. So we move this function to ocelot_switchdev_unsync, which is specific to the switchdev driver. The code movement described above makes ocelot_port_bridge_leave no longer return an error code, so we change its type from int to void. Signed-off-by: Vladimir Oltean --- drivers/net/dsa/ocelot/felix.c | 4 +- drivers/net/ethernet/mscc/Kconfig | 3 +- drivers/net/ethernet/mscc/ocelot.c | 18 ++-- drivers/net/ethernet/mscc/ocelot_net.c | 117 + include/soc/mscc/ocelot.h | 6 +- 5 files changed, 113 insertions(+), 35 deletions(-) diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c index 628afb47b579..6b5442be0230 100644 --- a/drivers/net/dsa/ocelot/felix.c +++ b/drivers/net/dsa/ocelot/felix.c @@ -719,7 +719,9 @@ static int felix_bridge_join(struct dsa_switch *ds, int port, { struct ocelot *ocelot = ds->priv; - return ocelot_port_bridge_join(ocelot, port, br); + ocelot_port_bridge_join(ocelot, port, br); + + return 0; } static void felix_bridge_leave(struct dsa_switch *ds, int port, diff --git a/drivers/net/ethernet/mscc/Kconfig b/drivers/net/ethernet/mscc/Kconfig index 05cb040c2677..2d3157e4d081 100644 --- a/drivers/net/ethernet/mscc/Kconfig +++ b/drivers/net/ethernet/mscc/Kconfig @@ -11,7 +11,7 @@ config NET_VENDOR_MICROSEMI if NET_VENDOR_MICROSEMI -# Users should depend on NET_SWITCHDEV, HAS_IOMEM +# Users should depend on NET_SWITCHDEV, HAS_IOMEM, BRIDGE config MSCC_OCELOT_SWITCH_LIB select NET_DEVLINK select REGMAP_MMIO @@ -24,6 +24,7 @@ config MSCC_OCELOT_SWITCH_LIB config MSCC_OCELOT_SWITCH tristate "Ocelot switch driver" + depends on BRIDGE || BRIDGE=n depends on NET_SWITCHDEV depends on HAS_IOMEM depends on OF_NET diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c index ce57929ba3d1..1a36b416fd9b 100644 --- a/drivers/net/ethernet/mscc/ocelot.c +++ b/drivers/net/ethernet/mscc/ocelot.c @@ -1514,34 +1514,28 @@ int ocelot_port_mdb_del(struct ocelot *ocelot, int port, } EXPORT_SYMBOL(ocelot_port_mdb_del); -int ocelot_port_bridge_join(struct ocelot *ocelot, int port, - struct net_device *bridge) +void ocelot_port_bridge_join(struct ocelot *ocelot, int port, +struct net_device *bridge) { struct ocelot_port *ocelot_port = ocelot->ports[port]; ocelot_port->bridge = bridge; - return 0; + ocelot_apply_bridge_fwd_mask(ocelot); } EXPORT_SYMBOL(ocelot_port_bridge_join); -int ocelot_port_bridge_leave(struct ocelot *ocelot, int port, -struct net_device *bridge) +void ocelot_port_bridge_leave(struct ocelot *ocelot, int port, + struct net_device *bridge) { struct ocelot_port *ocelot_port = ocelot->ports[port]; struct ocelot_vlan pvid = {0}, native_vlan = {0}; - int ret; ocelot_port->bridge = NULL; - ret = ocelot_port_vlan_filtering(ocelot, port, false); - if (ret) - return ret; - ocelot_port_set_pvid(ocelot, port, pvid); ocelot_port_set_native_vlan(ocelot, port, native_vlan); - - return 0; + ocelot_apply_bridge_fwd_mask(ocelot); } EXPORT_SYMBOL(ocelot_port_bridge_leave); diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c index d1376f7b34fd..36f32a4d9b0f 100644 --- a/drivers/net/ethernet/mscc/ocelot_net.c +++ b/drivers/net/ethernet/mscc/ocelot_net.c @@ -1117,47 +1117,126 @@ static int ocelot_port_obj_del(struct net_device *dev, return ret; } +static void ocelot_inherit_brport_flags(struct ocelot *ocelot, int port, + struct net_device *brport_dev) +{ + struct switchdev_brport_flags flags = {0}; + int flag; + + flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; + + for_each_set_bit(flag, &flags.mask, 32) + if (br_port_flag_is_set(brport_dev, BIT(flag))) + flags.val |= BIT(flag); + + ocelot_port_bridge_flags(ocelot, port, flags); +} + +sta
[PATCH v4 net-next 07/11] net: dsa: pass extack to dsa_port_{bridge,lag}_join
From: Vladimir Oltean This is a pretty noisy change that was broken out of the larger change for replaying switchdev attributes and objects at bridge join time, which is when these extack objects are actually used. Signed-off-by: Vladimir Oltean Reviewed-by: Florian Fainelli Reviewed-by: Tobias Waldekranz --- net/dsa/dsa_priv.h | 6 -- net/dsa/port.c | 8 +--- net/dsa/slave.c| 7 +-- 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index 4c43c5406834..b8778c5d8529 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -181,12 +181,14 @@ int dsa_port_enable_rt(struct dsa_port *dp, struct phy_device *phy); int dsa_port_enable(struct dsa_port *dp, struct phy_device *phy); void dsa_port_disable_rt(struct dsa_port *dp); void dsa_port_disable(struct dsa_port *dp); -int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br); +int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br, +struct netlink_ext_ack *extack); void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br); int dsa_port_lag_change(struct dsa_port *dp, struct netdev_lag_lower_state_info *linfo); int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev, - struct netdev_lag_upper_info *uinfo); + struct netdev_lag_upper_info *uinfo, + struct netlink_ext_ack *extack); void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev); int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering, struct netlink_ext_ack *extack); diff --git a/net/dsa/port.c b/net/dsa/port.c index d39262a9fe0e..fcbe5b1545b8 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -144,7 +144,8 @@ static void dsa_port_change_brport_flags(struct dsa_port *dp, } } -int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br) +int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br, +struct netlink_ext_ack *extack) { struct dsa_notifier_bridge_info info = { .tree_index = dp->ds->dst->index, @@ -241,7 +242,8 @@ int dsa_port_lag_change(struct dsa_port *dp, } int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag, - struct netdev_lag_upper_info *uinfo) + struct netdev_lag_upper_info *uinfo, + struct netlink_ext_ack *extack) { struct dsa_notifier_lag_info info = { .sw_index = dp->ds->index, @@ -263,7 +265,7 @@ int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag, if (!bridge_dev || !netif_is_bridge_master(bridge_dev)) return 0; - err = dsa_port_bridge_join(dp, bridge_dev); + err = dsa_port_bridge_join(dp, bridge_dev, extack); if (err) goto err_bridge_join; diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 992fcab4b552..1ff48be476bb 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -1976,11 +1976,14 @@ static int dsa_slave_changeupper(struct net_device *dev, struct netdev_notifier_changeupper_info *info) { struct dsa_port *dp = dsa_slave_to_port(dev); + struct netlink_ext_ack *extack; int err = NOTIFY_DONE; + extack = netdev_notifier_info_to_extack(&info->info); + if (netif_is_bridge_master(info->upper_dev)) { if (info->linking) { - err = dsa_port_bridge_join(dp, info->upper_dev); + err = dsa_port_bridge_join(dp, info->upper_dev, extack); if (!err) dsa_bridge_mtu_normalization(dp); err = notifier_from_errno(err); @@ -1991,7 +1994,7 @@ static int dsa_slave_changeupper(struct net_device *dev, } else if (netif_is_lag_master(info->upper_dev)) { if (info->linking) { err = dsa_port_lag_join(dp, info->upper_dev, - info->upper_info); + info->upper_info, extack); if (err == -EOPNOTSUPP) { NL_SET_ERR_MSG_MOD(info->info.extack, "Offloading not supported"); -- 2.25.1
[PATCH v4 net-next 10/11] net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG
From: Vladimir Oltean Similar to the DSA situation, ocelot supports LAG offload but treats this scenario improperly: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 We do the same thing as we do there, which is to simulate a 'bridge join' on 'lag join', if we detect that the bonding upper has a bridge upper. Again, same as DSA, ocelot supports software fallback for LAG, and in that case, we should avoid calling ocelot_netdevice_changeupper. Signed-off-by: Vladimir Oltean --- drivers/net/ethernet/mscc/ocelot_net.c | 111 +++-- 1 file changed, 86 insertions(+), 25 deletions(-) diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c index c08164cd88f4..d1376f7b34fd 100644 --- a/drivers/net/ethernet/mscc/ocelot_net.c +++ b/drivers/net/ethernet/mscc/ocelot_net.c @@ -1117,10 +1117,15 @@ static int ocelot_port_obj_del(struct net_device *dev, return ret; } -static int ocelot_netdevice_bridge_join(struct ocelot *ocelot, int port, - struct net_device *bridge) +static int ocelot_netdevice_bridge_join(struct net_device *dev, + struct net_device *bridge, + struct netlink_ext_ack *extack) { + struct ocelot_port_private *priv = netdev_priv(dev); + struct ocelot_port *ocelot_port = &priv->port; + struct ocelot *ocelot = ocelot_port->ocelot; struct switchdev_brport_flags flags; + int port = priv->chip_port; int err; flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; @@ -1135,10 +1140,14 @@ static int ocelot_netdevice_bridge_join(struct ocelot *ocelot, int port, return 0; } -static int ocelot_netdevice_bridge_leave(struct ocelot *ocelot, int port, +static int ocelot_netdevice_bridge_leave(struct net_device *dev, struct net_device *bridge) { + struct ocelot_port_private *priv = netdev_priv(dev); + struct ocelot_port *ocelot_port = &priv->port; + struct ocelot *ocelot = ocelot_port->ocelot; struct switchdev_brport_flags flags; + int port = priv->chip_port; int err; flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; @@ -1151,43 +1160,89 @@ static int ocelot_netdevice_bridge_leave(struct ocelot *ocelot, int port, return err; } -static int ocelot_netdevice_changeupper(struct net_device *dev, - struct netdev_notifier_changeupper_info *info) +static int ocelot_netdevice_lag_join(struct net_device *dev, +struct net_device *bond, +struct netdev_lag_upper_info *info, +struct netlink_ext_ack *extack) { struct ocelot_port_private *priv = netdev_priv(dev); struct ocelot_port *ocelot_port = &priv->port; struct ocelot *ocelot = ocelot_port->ocelot; + struct net_device *bridge_dev; int port = priv->chip_port; + int err; + + err = ocelot_port_lag_join(ocelot, port, bond, info); + if (err == -EOPNOTSUPP) { + NL_SET_ERR_MSG_MOD(extack, "Offloading not supported"); + return 0; + } + + bridge_dev = netdev_master_upper_dev_get(bond); + if (!bridge_dev || !netif_is_bridge_master(bridge_dev)) + return 0; + + err = ocelot_netdevice_bridge_join(dev, bridge_dev, extack); + if (err) + goto err_bridge_join; + + return 0; + +err_bridge_join: + ocelot_port_lag_leave(ocelot, port, bond); + return err; +} + +static int ocelot_netdevice_lag_leave(struct net_device *dev, + struct net_device *bond) +{ + struct ocelot_port_private *priv = netdev_priv(dev); + struct ocelot_port *ocelot_port = &priv->port; + struct ocelot *ocelot = ocelot_port->ocelot; + struct net_device *bridge_dev; + int port = priv->chip_port; + + ocelot_port_lag_leave(ocelot, port, bond); + + bridge_dev = netdev_master_upper_dev_get(bond); + if (!bridge_dev || !netif_is_bridge_master(bridge_dev)) + return 0; + + return ocelot_netdevice_bridge_leave(dev, bridge_dev); +} + +static int ocelot_netdevice_changeupper(struct net_device *dev, + struct netdev_notifier_changeupper_info *info) +{ + struct netlink_ext_ack *extack; int err = 0; + extack = netdev_notifier_info_to_extack(&info->info); + if (netif_is_bridge_master(info->upper_dev)) { - if (info->linking) { - err = ocelot_netdevice_bridge_join(ocelot, port, - info->upper_dev); - } else { -
[PATCH v4 net-next 09/11] net: dsa: sync up switchdev objects and port attributes when joining the bridge
From: Vladimir Oltean If we join an already-created bridge port, such as a bond master interface, then we can miss the initial switchdev notifications emitted by the bridge for this port, while it wasn't offloaded by anybody. Signed-off-by: Vladimir Oltean --- net/dsa/dsa_priv.h | 3 +++ net/dsa/port.c | 46 ++ net/dsa/slave.c| 4 ++-- 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index b8778c5d8529..92282de54230 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -262,6 +262,9 @@ static inline bool dsa_tree_offloads_bridge_port(struct dsa_switch_tree *dst, /* slave.c */ extern const struct dsa_device_ops notag_netdev_ops; +extern struct notifier_block dsa_slave_switchdev_notifier; +extern struct notifier_block dsa_slave_switchdev_blocking_notifier; + void dsa_slave_mii_bus_init(struct dsa_switch *ds); int dsa_slave_create(struct dsa_port *dp); void dsa_slave_destroy(struct net_device *slave_dev); diff --git a/net/dsa/port.c b/net/dsa/port.c index c712bf3da0a0..01e30264b25b 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -170,12 +170,46 @@ static void dsa_port_clear_brport_flags(struct dsa_port *dp) static int dsa_port_switchdev_sync(struct dsa_port *dp, struct netlink_ext_ack *extack) { + struct net_device *brport_dev = dsa_port_to_bridge_port(dp); + struct net_device *br = dp->bridge_dev; int err; err = dsa_port_inherit_brport_flags(dp, extack); if (err) return err; + err = dsa_port_set_state(dp, br_port_get_stp_state(brport_dev)); + if (err && err != -EOPNOTSUPP) + return err; + + err = dsa_port_vlan_filtering(dp, br_vlan_enabled(br), extack); + if (err && err != -EOPNOTSUPP) + return err; + + err = dsa_port_mrouter(dp->cpu_dp, br_multicast_router(br), extack); + if (err && err != -EOPNOTSUPP) + return err; + + err = dsa_port_ageing_time(dp, br_get_ageing_time(br)); + if (err && err != -EOPNOTSUPP) + return err; + + err = br_mdb_replay(br, brport_dev, + &dsa_slave_switchdev_blocking_notifier, + extack); + if (err && err != -EOPNOTSUPP) + return err; + + err = br_fdb_replay(br, brport_dev, &dsa_slave_switchdev_notifier); + if (err && err != -EOPNOTSUPP) + return err; + + err = br_vlan_replay(br, brport_dev, +&dsa_slave_switchdev_blocking_notifier, +extack); + if (err && err != -EOPNOTSUPP) + return err; + return 0; } @@ -198,6 +232,18 @@ static void dsa_port_switchdev_unsync(struct dsa_port *dp) * so allow it to be in BR_STATE_FORWARDING to be kept functional */ dsa_port_set_state_now(dp, BR_STATE_FORWARDING); + + /* VLAN filtering is handled by dsa_switch_bridge_leave */ + + /* Some drivers treat the notification for having a local multicast +* router by allowing multicast to be flooded to the CPU, so we should +* allow this in standalone mode too. +*/ + dsa_port_mrouter(dp->cpu_dp, true, NULL); + + /* Ageing time may be global to the switch chip, so don't change it +* here because we have no good reason (or value) to change it to. +*/ } int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br, diff --git a/net/dsa/slave.c b/net/dsa/slave.c index 1ff48be476bb..c51e52418a62 100644 --- a/net/dsa/slave.c +++ b/net/dsa/slave.c @@ -2392,11 +2392,11 @@ static struct notifier_block dsa_slave_nb __read_mostly = { .notifier_call = dsa_slave_netdevice_event, }; -static struct notifier_block dsa_slave_switchdev_notifier = { +struct notifier_block dsa_slave_switchdev_notifier = { .notifier_call = dsa_slave_switchdev_event, }; -static struct notifier_block dsa_slave_switchdev_blocking_notifier = { +struct notifier_block dsa_slave_switchdev_blocking_notifier = { .notifier_call = dsa_slave_switchdev_blocking_event, }; -- 2.25.1
[PATCH v4 net-next 08/11] net: dsa: inherit the actual bridge port flags at join time
From: Vladimir Oltean DSA currently assumes that the bridge port starts off with this constellation of bridge port flags: - learning on - unicast flooding on - multicast flooding on - broadcast flooding on just by virtue of code copy-pasta from the bridge layer (new_nbp). This was a simple enough strategy thus far, because the 'bridge join' moment always coincided with the 'bridge port creation' moment. But with sandwiched interfaces, such as: br0 | bond0 | swp0 it may happen that the user has had time to change the bridge port flags of bond0 before enslaving swp0 to it. In that case, swp0 will falsely assume that the bridge port flags are those determined by new_nbp, when in fact this can happen: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 ip link set bond0 type bridge_slave learning off ip link set swp0 master br0 Now swp0 has learning enabled, bond0 has learning disabled. Not nice. Fix this by "dumpster diving" through the actual bridge port flags with br_port_flag_is_set, at bridge join time. We use this opportunity to split dsa_port_change_brport_flags into two distinct functions called dsa_port_inherit_brport_flags and dsa_port_clear_brport_flags, now that the implementation for the two cases is no longer similar. This patch also creates two functions called dsa_port_switchdev_sync and dsa_port_switchdev_unsync which collect what we have so far, even if that's asymmetrical. More is going to be added in the next patch. Signed-off-by: Vladimir Oltean --- net/dsa/port.c | 123 - 1 file changed, 82 insertions(+), 41 deletions(-) diff --git a/net/dsa/port.c b/net/dsa/port.c index fcbe5b1545b8..c712bf3da0a0 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -122,28 +122,84 @@ void dsa_port_disable(struct dsa_port *dp) rtnl_unlock(); } -static void dsa_port_change_brport_flags(struct dsa_port *dp, -bool bridge_offload) +static int dsa_port_inherit_brport_flags(struct dsa_port *dp, +struct netlink_ext_ack *extack) { - struct switchdev_brport_flags flags; - int flag; + const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | + BR_BCAST_FLOOD; + struct net_device *brport_dev = dsa_port_to_bridge_port(dp); + int flag, err; - flags.mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; - if (bridge_offload) - flags.val = flags.mask; - else - flags.val = flags.mask & ~BR_LEARNING; + for_each_set_bit(flag, &mask, 32) { + struct switchdev_brport_flags flags = {0}; + + flags.mask = BIT(flag); - for_each_set_bit(flag, &flags.mask, 32) { - struct switchdev_brport_flags tmp; + if (br_port_flag_is_set(brport_dev, BIT(flag))) + flags.val = BIT(flag); + + err = dsa_port_bridge_flags(dp, flags, extack); + if (err && err != -EOPNOTSUPP) + return err; + } - tmp.val = flags.val & BIT(flag); - tmp.mask = BIT(flag); + return 0; +} - dsa_port_bridge_flags(dp, tmp, NULL); +static void dsa_port_clear_brport_flags(struct dsa_port *dp) +{ + const unsigned long val = BR_FLOOD | BR_MCAST_FLOOD | BR_BCAST_FLOOD; + const unsigned long mask = BR_LEARNING | BR_FLOOD | BR_MCAST_FLOOD | + BR_BCAST_FLOOD; + int flag, err; + + for_each_set_bit(flag, &mask, 32) { + struct switchdev_brport_flags flags = {0}; + + flags.mask = BIT(flag); + flags.val = val & BIT(flag); + + err = dsa_port_bridge_flags(dp, flags, NULL); + if (err && err != -EOPNOTSUPP) + dev_err(dp->ds->dev, + "failed to clear bridge port flag %lu: %pe\n", + flags.val, ERR_PTR(err)); } } +static int dsa_port_switchdev_sync(struct dsa_port *dp, + struct netlink_ext_ack *extack) +{ + int err; + + err = dsa_port_inherit_brport_flags(dp, extack); + if (err) + return err; + + return 0; +} + +static void dsa_port_switchdev_unsync(struct dsa_port *dp) +{ + /* Configure the port for standalone mode (no address learning, +* flood everything). +* The bridge only emits SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS events +* when the user requests it through netlink or sysfs, but not +* automatically at port join or leave, so we need to handle resetting +* the brport flags ourselves. But we even prefer it that way, because +* otherwise, some setups might never get the notification they need, +* for example, when a port leaves a LAG
[PATCH v4 net-next 05/11] net: bridge: add helper to replay VLANs installed on port
From: Vladimir Oltean Currently this simple setup with DSA: ip link add br0 type bridge vlan_filtering 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 will not work because the bridge has created the PVID in br_add_if -> nbp_vlan_init, and it has notified switchdev of the existence of VLAN 1, but that was too early, since swp0 was not yet a lower of bond0, so it had no reason to act upon that notification. We need a helper in the bridge to replay the switchdev VLAN objects that were notified since the bridge port creation, because some of them may have been missed. As opposed to the br_mdb_replay function, the vg->vlan_list write side protection is offered by the rtnl_mutex which is sleepable, so we don't need to queue up the objects in atomic context, we can replay them right away. Signed-off-by: Vladimir Oltean --- include/linux/if_bridge.h | 10 ++ net/bridge/br_vlan.c | 73 +++ 2 files changed, 83 insertions(+) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index b564c4486a45..2cc35038a8ca 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -111,6 +111,8 @@ int br_vlan_get_pvid_rcu(const struct net_device *dev, u16 *p_pvid); int br_vlan_get_proto(const struct net_device *dev, u16 *p_proto); int br_vlan_get_info(const struct net_device *dev, u16 vid, struct bridge_vlan_info *p_vinfo); +int br_vlan_replay(struct net_device *br_dev, struct net_device *dev, + struct notifier_block *nb, struct netlink_ext_ack *extack); #else static inline bool br_vlan_enabled(const struct net_device *dev) { @@ -137,6 +139,14 @@ static inline int br_vlan_get_info(const struct net_device *dev, u16 vid, { return -EINVAL; } + +static inline int br_vlan_replay(struct net_device *br_dev, +struct net_device *dev, +struct notifier_block *nb, +struct netlink_ext_ack *extack) +{ + return -EOPNOTSUPP; +} #endif #if IS_ENABLED(CONFIG_BRIDGE) diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c index 8829f621b8ec..ca8daccff217 100644 --- a/net/bridge/br_vlan.c +++ b/net/bridge/br_vlan.c @@ -1751,6 +1751,79 @@ void br_vlan_notify(const struct net_bridge *br, kfree_skb(skb); } +static int br_vlan_replay_one(struct notifier_block *nb, + struct net_device *dev, + struct switchdev_obj_port_vlan *vlan, + struct netlink_ext_ack *extack) +{ + struct switchdev_notifier_port_obj_info obj_info = { + .info = { + .dev = dev, + .extack = extack, + }, + .obj = &vlan->obj, + }; + int err; + + err = nb->notifier_call(nb, SWITCHDEV_PORT_OBJ_ADD, &obj_info); + return notifier_to_errno(err); +} + +int br_vlan_replay(struct net_device *br_dev, struct net_device *dev, + struct notifier_block *nb, struct netlink_ext_ack *extack) +{ + struct net_bridge_vlan_group *vg; + struct net_bridge_vlan *v; + struct net_bridge_port *p; + struct net_bridge *br; + int err = 0; + u16 pvid; + + ASSERT_RTNL(); + + if (!netif_is_bridge_master(br_dev)) + return -EINVAL; + + if (!netif_is_bridge_master(dev) && !netif_is_bridge_port(dev)) + return -EINVAL; + + if (netif_is_bridge_master(dev)) { + br = netdev_priv(dev); + vg = br_vlan_group(br); + p = NULL; + } else { + p = br_port_get_rtnl(dev); + if (WARN_ON(!p)) + return -EINVAL; + vg = nbp_vlan_group(p); + br = p->br; + } + + if (!vg) + return 0; + + pvid = br_get_pvid(vg); + + list_for_each_entry(v, &vg->vlan_list, vlist) { + struct switchdev_obj_port_vlan vlan = { + .obj.orig_dev = dev, + .obj.id = SWITCHDEV_OBJ_ID_PORT_VLAN, + .flags = br_vlan_flags(v, pvid), + .vid = v->vid, + }; + + if (!br_vlan_should_use(v)) + continue; + + br_vlan_replay_one(nb, dev, &vlan, extack); + if (err) + return err; + } + + return err; +} +EXPORT_SYMBOL_GPL(br_vlan_replay); + /* check if v_curr can enter a range ending in range_end */ bool br_vlan_can_enter_range(const struct net_bridge_vlan *v_curr, const struct net_bridge_vlan *range_end) -- 2.25.1
[PATCH v4 net-next 03/11] net: bridge: add helper to replay port and host-joined mdb entries
From: Vladimir Oltean I have a system with DSA ports, and udhcpcd is configured to bring interfaces up as soon as they are created. I create a bridge as follows: ip link add br0 type bridge As soon as I create the bridge and udhcpcd brings it up, I also have avahi which automatically starts sending IPv6 packets to advertise some local services, and because of that, the br0 bridge joins the following IPv6 groups due to the code path detailed below: 33:33:ff:6d:c1:9c vid 0 33:33:00:00:00:6a vid 0 33:33:00:00:00:fb vid 0 br_dev_xmit -> br_multicast_rcv -> br_ip6_multicast_add_group -> __br_multicast_add_group -> br_multicast_host_join -> br_mdb_notify This is all fine, but inside br_mdb_notify we have br_mdb_switchdev_host hooked up, and switchdev will attempt to offload the host joined groups to an empty list of ports. Of course nobody offloads them. Then when we add a port to br0: ip link set swp0 master br0 the bridge doesn't replay the host-joined MDB entries from br_add_if, and eventually the host joined addresses expire, and a switchdev notification for deleting it is emitted, but surprise, the original addition was already completely missed. The strategy to address this problem is to replay the MDB entries (both the port ones and the host joined ones) when the new port joins the bridge, similar to what vxlan_fdb_replay does (in that case, its FDB can be populated and only then attached to a bridge that you offload). However there are 2 possibilities: the addresses can be 'pushed' by the bridge into the port, or the port can 'pull' them from the bridge. Considering that in the general case, the new port can be really late to the party, and there may have been many other switchdev ports that already received the initial notification, we would like to avoid delivering duplicate events to them, since they might misbehave. And currently, the bridge calls the entire switchdev notifier chain, whereas for replaying it should just call the notifier block of the new guy. But the bridge doesn't know what is the new guy's notifier block, it just knows where the switchdev notifier chain is. So for simplification, we make this a driver-initiated pull for now, and the notifier block is passed as an argument. To emulate the calling context for mdb objects (deferred and put on the blocking notifier chain), we must iterate under RCU protection through the bridge's mdb entries, queue them, and only call them once we're out of the RCU read-side critical section. There was some opportunity for reuse between br_mdb_switchdev_host_port, br_mdb_notify and the newly added br_mdb_queue_one in how the switchdev mdb object is created, so a helper was created. Suggested-by: Ido Schimmel Signed-off-by: Vladimir Oltean --- include/linux/if_bridge.h | 9 +++ include/net/switchdev.h | 1 + net/bridge/br_mdb.c | 148 +- 3 files changed, 141 insertions(+), 17 deletions(-) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index ebd16495459c..f6472969bb44 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -69,6 +69,8 @@ bool br_multicast_has_querier_anywhere(struct net_device *dev, int proto); bool br_multicast_has_querier_adjacent(struct net_device *dev, int proto); bool br_multicast_enabled(const struct net_device *dev); bool br_multicast_router(const struct net_device *dev); +int br_mdb_replay(struct net_device *br_dev, struct net_device *dev, + struct notifier_block *nb, struct netlink_ext_ack *extack); #else static inline int br_multicast_list_adjacent(struct net_device *dev, struct list_head *br_ip_list) @@ -93,6 +95,13 @@ static inline bool br_multicast_router(const struct net_device *dev) { return false; } +static inline int br_mdb_replay(struct net_device *br_dev, + struct net_device *dev, + struct notifier_block *nb, + struct netlink_ext_ack *extack) +{ + return -EOPNOTSUPP; +} #endif #if IS_ENABLED(CONFIG_BRIDGE) && IS_ENABLED(CONFIG_BRIDGE_VLAN_FILTERING) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index b7fc7d0f54e2..8c3218177136 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -68,6 +68,7 @@ enum switchdev_obj_id { }; struct switchdev_obj { + struct list_head list; struct net_device *orig_dev; enum switchdev_obj_id id; u32 flags; diff --git a/net/bridge/br_mdb.c b/net/bridge/br_mdb.c index 8846c5bcd075..95fa4af0e8dd 100644 --- a/net/bridge/br_mdb.c +++ b/net/bridge/br_mdb.c @@ -506,6 +506,134 @@ static void br_mdb_complete(struct net_device *dev, int err, void *priv) kfree(priv); } +static void br_switchdev_mdb_populate(struct switchdev_obj_port_mdb *mdb, + const struct net_bridge_mdb_entry *mp)
[PATCH v4 net-next 06/11] net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge
From: Vladimir Oltean DSA can properly detect and offload this sequence of operations: ip link add br0 type bridge ip link add bond0 type bond ip link set swp0 master bond0 ip link set bond0 master br0 But not this one: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 Actually the second one is more complicated, due to the elapsed time between the enslavement of bond0 and the offloading of it via swp0, a lot of things could have happened to the bond0 bridge port in terms of switchdev objects (host MDBs, VLANs, altered STP state etc). So this is a bit of a can of worms, and making sure that the DSA port's state is in sync with this already existing bridge port is handled in the next patches. Signed-off-by: Vladimir Oltean Reviewed-by: Florian Fainelli Reviewed-by: Tobias Waldekranz --- net/dsa/port.c | 22 ++ 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/net/dsa/port.c b/net/dsa/port.c index c9c6d7ab3f47..d39262a9fe0e 100644 --- a/net/dsa/port.c +++ b/net/dsa/port.c @@ -249,17 +249,31 @@ int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag, .lag = lag, .info = uinfo, }; + struct net_device *bridge_dev; int err; dsa_lag_map(dp->ds->dst, lag); dp->lag_dev = lag; err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, &info); - if (err) { - dp->lag_dev = NULL; - dsa_lag_unmap(dp->ds->dst, lag); - } + if (err) + goto err_lag_join; + bridge_dev = netdev_master_upper_dev_get(lag); + if (!bridge_dev || !netif_is_bridge_master(bridge_dev)) + return 0; + + err = dsa_port_bridge_join(dp, bridge_dev); + if (err) + goto err_bridge_join; + + return 0; + +err_bridge_join: + dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, &info); +err_lag_join: + dp->lag_dev = NULL; + dsa_lag_unmap(dp->ds->dst, lag); return err; } -- 2.25.1
[PATCH v4 net-next 04/11] net: bridge: add helper to replay port and local fdb entries
From: Vladimir Oltean When a switchdev port starts offloading a LAG that is already in a bridge and has an FDB entry pointing to it: ip link set bond0 master br0 bridge fdb add dev bond0 00:01:02:03:04:05 master static ip link set swp0 master bond0 the switchdev driver will have no idea that this FDB entry is there, because it missed the switchdev event emitted at its creation. Ido Schimmel pointed this out during a discussion about challenges with switchdev offloading of stacked interfaces between the physical port and the bridge, and recommended to just catch that condition and deny the CHANGEUPPER event: https://lore.kernel.org/netdev/20210210105949.gb287...@shredder.lan/ But in fact, we might need to deal with the hard thing anyway, which is to replay all FDB addresses relevant to this port, because it isn't just static FDB entries, but also local addresses (ones that are not forwarded but terminated by the bridge). There, we can't just say 'oh yeah, there was an upper already so I'm not joining that'. So, similar to the logic for replaying MDB entries, add a function that must be called by individual switchdev drivers and replays local FDB entries as well as ones pointing towards a bridge port. This time, we use the atomic switchdev notifier block, since that's what FDB entries expect for some reason. Reported-by: Ido Schimmel Signed-off-by: Vladimir Oltean --- include/linux/if_bridge.h | 9 +++ net/bridge/br_fdb.c | 50 +++ 2 files changed, 59 insertions(+) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index f6472969bb44..b564c4486a45 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -147,6 +147,8 @@ void br_fdb_clear_offload(const struct net_device *dev, u16 vid); bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag); u8 br_port_get_stp_state(const struct net_device *dev); clock_t br_get_ageing_time(struct net_device *br_dev); +int br_fdb_replay(struct net_device *br_dev, struct net_device *dev, + struct notifier_block *nb); #else static inline struct net_device * br_fdb_find_port(const struct net_device *br_dev, @@ -175,6 +177,13 @@ static inline clock_t br_get_ageing_time(struct net_device *br_dev) { return 0; } + +static inline int br_fdb_replay(struct net_device *br_dev, + struct net_device *dev, + struct notifier_block *nb) +{ + return -EOPNOTSUPP; +} #endif #endif diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c index b7490237f3fc..698b79747d32 100644 --- a/net/bridge/br_fdb.c +++ b/net/bridge/br_fdb.c @@ -726,6 +726,56 @@ static inline size_t fdb_nlmsg_size(void) + nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */ } +static int br_fdb_replay_one(struct notifier_block *nb, +struct net_bridge_fdb_entry *fdb, +struct net_device *dev) +{ + struct switchdev_notifier_fdb_info item; + int err; + + item.addr = fdb->key.addr.addr; + item.vid = fdb->key.vlan_id; + item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags); + item.offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags); + item.info.dev = dev; + + err = nb->notifier_call(nb, SWITCHDEV_FDB_ADD_TO_DEVICE, &item); + return notifier_to_errno(err); +} + +int br_fdb_replay(struct net_device *br_dev, struct net_device *dev, + struct notifier_block *nb) +{ + struct net_bridge_fdb_entry *fdb; + struct net_bridge *br; + int err = 0; + + if (!netif_is_bridge_master(br_dev) || !netif_is_bridge_port(dev)) + return -EINVAL; + + br = netdev_priv(br_dev); + + rcu_read_lock(); + + hlist_for_each_entry_rcu(fdb, &br->fdb_list, fdb_node) { + struct net_bridge_port *dst = READ_ONCE(fdb->dst); + struct net_device *dst_dev; + + dst_dev = dst ? dst->dev : br->dev; + if (dst_dev != br_dev && dst_dev != dev) + continue; + + err = br_fdb_replay_one(nb, fdb, dst_dev); + if (err) + break; + } + + rcu_read_unlock(); + + return err; +} +EXPORT_SYMBOL_GPL(br_fdb_replay); + static void fdb_notify(struct net_bridge *br, const struct net_bridge_fdb_entry *fdb, int type, bool swdev_notify) -- 2.25.1
[PATCH v4 net-next 02/11] net: bridge: add helper to retrieve the current ageing time
From: Vladimir Oltean The SWITCHDEV_ATTR_ID_BRIDGE_AGEING_TIME attribute is only emitted from: sysfs/ioctl/netlink -> br_set_ageing_time -> __set_ageing_time therefore not at bridge port creation time, so: (a) switchdev drivers have to hardcode the initial value for the address ageing time, because they didn't get any notification (b) that hardcoded value can be out of sync, if the user changes the ageing time before enslaving the port to the bridge We need a helper in the bridge, such that switchdev drivers can query the current value of the bridge ageing time when they start offloading it. Signed-off-by: Vladimir Oltean Reviewed-by: Florian Fainelli Reviewed-by: Tobias Waldekranz --- include/linux/if_bridge.h | 6 ++ net/bridge/br_stp.c | 13 + 2 files changed, 19 insertions(+) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index 920d3a02cc68..ebd16495459c 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -137,6 +137,7 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev, void br_fdb_clear_offload(const struct net_device *dev, u16 vid); bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag); u8 br_port_get_stp_state(const struct net_device *dev); +clock_t br_get_ageing_time(struct net_device *br_dev); #else static inline struct net_device * br_fdb_find_port(const struct net_device *br_dev, @@ -160,6 +161,11 @@ static inline u8 br_port_get_stp_state(const struct net_device *dev) { return BR_STATE_DISABLED; } + +static inline clock_t br_get_ageing_time(struct net_device *br_dev) +{ + return 0; +} #endif #endif diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index 86b5e05d3f21..3dafb6143cff 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -639,6 +639,19 @@ int br_set_ageing_time(struct net_bridge *br, clock_t ageing_time) return 0; } +clock_t br_get_ageing_time(struct net_device *br_dev) +{ + struct net_bridge *br; + + if (!netif_is_bridge_master(br_dev)) + return 0; + + br = netdev_priv(br_dev); + + return jiffies_to_clock_t(br->ageing_time); +} +EXPORT_SYMBOL_GPL(br_get_ageing_time); + /* called under bridge lock */ void __br_set_topology_change(struct net_bridge *br, unsigned char val) { -- 2.25.1
[PATCH v4 net-next 01/11] net: bridge: add helper for retrieving the current bridge port STP state
From: Vladimir Oltean It may happen that we have the following topology with DSA or any other switchdev driver with LAG offload: ip link add br0 type bridge stp_state 1 ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 ip link set swp1 master bond0 STP decides that it should put bond0 into the BLOCKING state, and that's that. The ports that are actively listening for the switchdev port attributes emitted for the bond0 bridge port (because they are offloading it) and have the honor of seeing that switchdev port attribute can react to it, so we can program swp0 and swp1 into the BLOCKING state. But if then we do: ip link set swp2 master bond0 then as far as the bridge is concerned, nothing has changed: it still has one bridge port. But this new bridge port will not see any STP state change notification and will remain FORWARDING, which is how the standalone code leaves it in. We need a function in the bridge driver which retrieves the current STP state, such that drivers can synchronize to it when they may have missed switchdev events. Signed-off-by: Vladimir Oltean Reviewed-by: Florian Fainelli Reviewed-by: Tobias Waldekranz --- include/linux/if_bridge.h | 6 ++ net/bridge/br_stp.c | 14 ++ 2 files changed, 20 insertions(+) diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h index b979005ea39c..920d3a02cc68 100644 --- a/include/linux/if_bridge.h +++ b/include/linux/if_bridge.h @@ -136,6 +136,7 @@ struct net_device *br_fdb_find_port(const struct net_device *br_dev, __u16 vid); void br_fdb_clear_offload(const struct net_device *dev, u16 vid); bool br_port_flag_is_set(const struct net_device *dev, unsigned long flag); +u8 br_port_get_stp_state(const struct net_device *dev); #else static inline struct net_device * br_fdb_find_port(const struct net_device *br_dev, @@ -154,6 +155,11 @@ br_port_flag_is_set(const struct net_device *dev, unsigned long flag) { return false; } + +static inline u8 br_port_get_stp_state(const struct net_device *dev) +{ + return BR_STATE_DISABLED; +} #endif #endif diff --git a/net/bridge/br_stp.c b/net/bridge/br_stp.c index 21c6781906aa..86b5e05d3f21 100644 --- a/net/bridge/br_stp.c +++ b/net/bridge/br_stp.c @@ -64,6 +64,20 @@ void br_set_state(struct net_bridge_port *p, unsigned int state) } } +u8 br_port_get_stp_state(const struct net_device *dev) +{ + struct net_bridge_port *p; + + ASSERT_RTNL(); + + p = br_port_get_rtnl(dev); + if (!p) + return BR_STATE_DISABLED; + + return p->state; +} +EXPORT_SYMBOL_GPL(br_port_get_stp_state); + /* called under bridge lock */ struct net_bridge_port *br_get_port(struct net_bridge *br, u16 port_no) { -- 2.25.1
[PATCH v4 net-next 00/11] Better support for sandwiched LAGs with bridge and DSA
From: Vladimir Oltean Changes in v4: - Added missing EXPORT_SYMBOL_GPL - Using READ_ONCE(fdb->dst) - Split patches into (a) adding the bridge helpers (b) making DSA use them - br_mdb_replay went back to the v1 approach where it allocated memory in atomic context - Created a br_switchdev_mdb_populate which reduces some of the code duplication - Fixed the error message in dsa_port_clear_brport_flags - Replaced "dsa_port_vlan_filtering(dp, br, extack)" with "dsa_port_vlan_filtering(dp, br_vlan_enabled(br), extack)" (duh) - Added review tags (sorry if I missed any) The objective of this series is to make LAG uppers on top of switchdev ports work regardless of which order we link interfaces to their masters (first make the port join the LAG, then the LAG join the bridge, or the other way around). There was a design decision to be made in patches 2-4 on whether we should adopt the "push" model (which attempts to solve the problem centrally, in the bridge layer) where the driver just calls: switchdev_bridge_port_offloaded(brport_dev, &atomic_notifier_block, &blocking_notifier_block, extack); and the bridge just replays the entire collection of switchdev port attributes and objects that it has, in some predefined order and with some predefined error handling logic; or the "pull" model (which attempts to solve the problem by giving the driver the rope to hang itself), where the driver, apart from calling: switchdev_bridge_port_offloaded(brport_dev, extack); has the task of "dumpster diving" (as Tobias puts it) through the bridge attributes and objects by itself, by calling: - br_vlan_replay - br_fdb_replay - br_mdb_replay - br_vlan_enabled - br_port_flag_is_set - br_port_get_stp_state - br_multicast_router - br_get_ageing_time (not necessarily all of them, and not necessarily in this order, and with driver-defined error handling). Even though I'm not in love myself with the "pull" model, I chose it because there is a fundamental trick with replaying switchdev events like this: ip link add br0 type bridge ip link add bond0 type bond ip link set bond0 master br0 ip link set swp0 master bond0 <- this will replay the objects once for the bond0 bridge port, and the swp0 switchdev port will process them ip link set swp1 master bond0 <- this will replay the objects again for the bond0 bridge port, and the swp1 switchdev port will see them, but swp0 will see them for the second time now Basically I believe that it is implementation defined whether the driver wants to error out on switchdev objects seen twice on a port, and the bridge should not enforce a certain model for that. For example, for FDB entries added to a bonding interface, the underling switchdev driver might have an abstraction for just that: an FDB entry pointing towards a logical (as opposed to physical) port. So when the second port joins the bridge, it doesn't realy need to replay FDB entries, since there is already at least one hardware port which has been receiving those events, and the FDB entries don't need to be added a second time to the same logical port. In the other corner, we have the drivers that handle switchdev port attributes on a LAG as individual switchdev port attributes on physical ports (example: VLAN filtering). In fact, the switchdev_handle_port_attr_set helper facilitates this: it is a fan-out from a single orig_dev towards multiple lowers that pass the check_cb(). But that's the point: switchdev_handle_port_attr_set is just a helper which the driver _opts_ to use. The bridge can't enforce the "push" model, because that would assume that all drivers handle port attributes in the same way, which is probably false. For this reason, I preferred to go with the "pull" mode for this patch set. Just to see how bad it is for other switchdev drivers to copy-paste this logic, I added the pull support to ocelot too, and I think it's pretty manageable. Vladimir Oltean (11): net: bridge: add helper for retrieving the current bridge port STP state net: bridge: add helper to retrieve the current ageing time net: bridge: add helper to replay port and host-joined mdb entries net: bridge: add helper to replay port and local fdb entries net: bridge: add helper to replay VLANs installed on port net: dsa: call dsa_port_bridge_join when joining a LAG that is already in a bridge net: dsa: pass extack to dsa_port_{bridge,lag}_join net: dsa: inherit the actual bridge port flags at join time net: dsa: sync up switchdev objects and port attributes when joining the bridge net: ocelot: call ocelot_netdevice_bridge_join when joining a bridged LAG net: ocelot: replay switchdev events when joining bridge drivers/net/dsa/o
Re: [External] : Re: [LKP] Re: [locking/qspinlock] 0e8d8f4f12: stress-ng.zero.ops_per_sec -9.7% regression
> On Mar 22, 2021, at 7:15 PM, Alex Kogan wrote: > > Many thanks to Zhengjun Xing for the help in reproducing the issue. > > On our system, the regression is less than 7% (the numbers are below), > however, > at least at the full capacity, the numbers are very stable. This allowed me > to track down the > issue and identify unnecessary stores into the queue node structure, which > may cause > cache misses during lock transfers. Moving those stores into the > initialization code (cna_init_nodes()) > solves the problem. > > Below are the numbers of “bogo ops/s” reported by stress-ng with various > numbers of workers. > Each number represents an average over 25 runs, with the standard deviation > reported in (). > > #workers stock CNA / speedup > CNA+patch / speedup > 18 16327.844 (581.744) 15480.061 (582.654) / 0.948 16422.349 (473.729) / > 1.006 > 368573.557 (285.058) 8003.888 (196.125) / 0.9348457.436 (258.065) > / 0.986 > 724042.535 (28.766) 3960.407 (28.648) / 0.980 4107.143 (23.037) > / 1.016 > 108 2735.913 (7.440) 2678.888 (7.102) / 0.9792774.751 (4.375) > / 1.014 > 144 2093.477 (3.341) 2042.968 (1.982) / 0.9762109.879 (1.714) > / 1.008 Those are "bogo ops/s (usr+sys time)", btw. Just in case, below are "bogo ops/s (real time)” numbers, which I believe is what is reported by the kernel test robot: #workers stock CNA / speedup CNA+patch / speedup 18 262932.282 (12638.248) 249653.081 (11822.940) / 0.949 265189.104 (9271.447) / 1.009 36 277315.640 (11100.324) 260177.335 (7186.451) / 0.938 274691.250 (10329.523) / 0.991 72 263904.000 (2128.206)259967.180 (1857.393) / 0.985 268971.483 (1713.639) / 1.019 108 273811.373 (664.517) 268949.947 (690.329) / 0.982278196.867 (403.978) / 1.016 144 284321.364 (399.281) 278153.208 (210.776) / 0.978287343.806 (280.963) / 1.011 Regards, — Alex > The patch is attached. As always, comments are welcome! > > Unless there any objections, I will reintegrate the patch into the series, > and post a new > revision. > > Regards, > — Alex
Re: [PATCH v5 00/14] Add BLK_CTL support for i.MX8MP
On Wed, Mar 3, 2021 at 4:54 AM Marek Vasut wrote: > > On 3/3/21 11:47 AM, Abel Vesa wrote: > > On 20-11-03 13:18:12, Abel Vesa wrote: > >> The BLK_CTL according to HW design is basically the wrapper of the entire > >> function specific group of IPs and holds GPRs that usually cannot be placed > >> into one specific IP from that group. Some of these GPRs are used to > >> control > >> some clocks, other some resets, others some very specific function that > >> does > >> not fit into clocks or resets. Since the clocks are registered using the > >> i.MX > >> clock subsystem API, the driver is placed into the clock subsystem, but it > >> also registers the resets. For the other functionalities that other GPRs > >> might > >> have, the syscon is used. > >> > > > > This approach seems to be introducing a possible ABBA deadlock due to > > the core clock and genpd locking. Here is a backtrace I got from Pete > > Zhang (he reported the issue on the internal mailing list): > > > > [ 11.667711][ T108] -> #1 (&genpd->mlock){+.+.}-{3:3}: > > [ 11.675041][ T108]__lock_acquire+0xae4/0xef8 > > [ 11.680093][ T108]lock_acquire+0xfc/0x2f8 > > [ 11.684888][ T108]__mutex_lock+0x90/0x870 > > [ 11.689685][ T108]mutex_lock_nested+0x44/0x50 > > [ 11.694826][ T108]genpd_lock_mtx+0x18/0x24 > > [ 11.699706][ T108]genpd_runtime_resume+0x90/0x214 (hold > > genpd->mlock) > > [ 11.705194][ T108]__rpm_callback+0x80/0x2c0 > > [ 11.710160][ T108]rpm_resume+0x468/0x650 > > [ 11.714866][ T108]__pm_runtime_resume+0x60/0x88 > > [ 11.720180][ T108]clk_pm_runtime_get+0x28/0x9c > > [ 11.725410][ T108]clk_disable_unused_subtree+0x8c/0x144 > > [ 11.731420][ T108]clk_disable_unused_subtree+0x124/0x144 > > [ 11.737518][ T108]clk_disable_unused+0xa4/0x11c (hold > > prepare_lock) > > [ 11.742833][ T108]do_one_initcall+0x98/0x178 > > [ 11.747888][ T108]do_initcall_level+0x9c/0xb8 > > [ 11.753028][ T108]do_initcalls+0x54/0x94 > > [ 11.757736][ T108]do_basic_setup+0x24/0x30 > > [ 11.762614][ T108]kernel_init_freeable+0x70/0xa4 > > [ 11.768014][ T108]kernel_init+0x14/0x18c > > [ 11.772722][ T108]ret_from_fork+0x10/0x18 > > > > [ 11.777512][ T108] -> #0 (prepare_lock){+.+.}-{3:3}: > > [ 11.784749][ T108]check_noncircular+0x134/0x13c > > [ 11.790064][ T108]validate_chain+0x590/0x2a04 > > [ 11.795204][ T108]__lock_acquire+0xae4/0xef8 > > [ 11.800258][ T108]lock_acquire+0xfc/0x2f8 > > [ 11.805050][ T108]__mutex_lock+0x90/0x870 > > [ 11.809841][ T108]mutex_lock_nested+0x44/0x50 > > [ 11.814983][ T108]clk_unprepare+0x5c/0x100 ((hold prepare_lock)) > > [ 11.819864][ T108]imx8m_pd_power_off+0xac/0x110 > > [ 11.825179][ T108]genpd_power_off+0x1b4/0x2dc > > [ 11.830318][ T108]genpd_power_off_work_fn+0x38/0x58 (hold > > genpd->mlock) > > [ 11.835981][ T108]process_one_work+0x270/0x444 > > [ 11.841208][ T108]worker_thread+0x280/0x4e4 > > [ 11.846176][ T108]kthread+0x13c/0x14 > > [ 11.850621][ T108]ret_from_fork+0x10/0x18 > > > > Now, this has been reproduced only on the NXP internal tree, but I think > > it is pretty obvious this could happen in upstream too, with this > > patchset applied. > > > > First, my thought was to change the prepare_lock/enable_lock in clock > > core, from a global approach to a per clock basis. But that doesn't > > actually fix the issue. > > > > The usecase seen above is due to clk_disable_unused, but the same could > > happen when a clock consumer calls prepare/unprepare on a clock. > > > > I guess the conclusion is that the current state of the clock core and > > genpd implementation does not support a usecase where a clock controller > > has a PD which in turn uses another clock (from another clock controller). > > > > Jacky, Pete, did I miss anything here ? > > Just for completeness, I have a feeling I already managed to trigger > this and discussed this with Lucas before, so this concern is certainly > valid. I know it may not be ideal, someone tied a soft-reset and soft-enable to the driver of the Hantro VPU on the IMXMQ [1], and I wonder if something similar could be done for the drivers who are consumers of the clocks. For example: lcdif could request the power domain. The power domain soft-resets and enables bus clock (vis syscon) After successful enabling of power-domain, the LCDIF requests the soft reset and respective clock bits (also via syscon) similar to how [1] and [2] do it for the Hantro VPU. The syscon node could be a common node similar to what was done in [2], and multiple consumers could control when each soft-reset and clock-enable get activated. I know it's probably more of an abuse of the syscon system, but having the individual dri
Re: [PATCH] powerpc/asm: Fix a typo
Randy Dunlap writes: > On 3/22/21 4:32 AM, Bhaskar Chowdhury wrote: >> >> s/poiner/pointer/ >> >> Signed-off-by: Bhaskar Chowdhury > > Acked-by: Randy Dunlap > > However, it would be a GOOD THING to collect multiple similar > patches that are in e.g. arch/powerpc/ and send them as one patch > instead of many little patches. Yes. Please send me one patch for all of the spelling issues you can currently find in arch/powerpc. cheers
[PATCH] kbuild: Merge module sections if and only if CONFIG_LTO_CLANG is enabled
Merge module sections only when using Clang LTO. With gcc-10, merging sections does not appear to update the symbol tables for the module, e.g. 'readelf -s' shows the value that a symbol would have had, if sections were not merged. The stale symbol table breaks gdb's function disassambler, and presumably other things, e.g. gdb -batch -ex "file arch/x86/kvm/kvm.ko" -ex "disassemble kvm_init" reads the wrong bytes and dumps garbage. Fixes: dd2776222abb ("kbuild: lto: merge module sections") Cc: Nick Desaulniers Cc: Sami Tolvanen Cc: Kees Cook Signed-off-by: Sean Christopherson --- This is obviously the quick and dirty approach to fixing the problem, presumably there is a way to actually update the symbols, but that is far, far outside my area of expertise. IIUC how the disassemblers work, the section headers are correctly updated, e.g. objdump displays the correct info, and even gdb's disassembler shows the "correct" offset, it's just the symbol tables that are out of whack. An earlier version of the LTO series did have exactly this #ifdef, but it was dropped when no one objected to Kees' suggestion to unconditionally merge sections. https://lkml.kernel.org/r/202010141548.47CB1BC@keescook scripts/module.lds.S | 2 ++ 1 file changed, 2 insertions(+) diff --git a/scripts/module.lds.S b/scripts/module.lds.S index 168cd27e6122..3580c6d02957 100644 --- a/scripts/module.lds.S +++ b/scripts/module.lds.S @@ -25,6 +25,7 @@ SECTIONS { * -ffunction-sections, which increases the size of the final module. * Merge the split sections in the final binary. */ +#ifdef CONFIG_LTO_CLANG .bss : { *(.bss .bss.[0-9a-zA-Z_]*) *(.bss..L*) @@ -41,6 +42,7 @@ SECTIONS { } .text : { *(.text .text.[0-9a-zA-Z_]*) } +#endif } /* bring in arch-specific sections */ -- 2.31.0.rc2.261.g7f71774620-goog
Re: [PATCH v4 0/3] net: dsa: lantiq: add support for xRX300 and xRX330
Hello: This series was applied to netdev/net-next.git (refs/heads/master): On Mon, 22 Mar 2021 21:37:14 +0100 you wrote: > Changed since v3: > * fixed last compilation warning > > Changed since v2: > * fixed compilation warnings > * removed example bindings for xrx330 > * patches has been refactored due to upstream changes > > [...] Here is the summary with links: - [v4,1/3] net: dsa: lantiq: allow to use all GPHYs on xRX300 and xRX330 https://git.kernel.org/netdev/net-next/c/a09d042b0862 - [v4,2/3] net: dsa: lantiq: verify compatible strings against hardware https://git.kernel.org/netdev/net-next/c/204c7614738e - [v4,3/3] dt-bindings: net: dsa: lantiq: add xRx300 and xRX330 switch bindings https://git.kernel.org/netdev/net-next/c/ee83d82407e4 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
Re: [PATCH v3 2/3] net: ethernet: actions: Add Actions Semi Owl Ethernet MAC driver
> +static void owl_emac_set_multicast(struct net_device *netdev, int count) > +{ > + struct owl_emac_priv *priv = netdev_priv(netdev); > + struct netdev_hw_addr *ha; > + int index = 0; > + > + if (count <= 0) { > + priv->mcaddr_list.count = 0; > + return; > + } > + > + netdev_for_each_mc_addr(ha, netdev) { > + if (!is_multicast_ether_addr(ha->addr)) > + continue; Is this possible? > + > + WARN_ON(index >= OWL_EMAC_MAX_MULTICAST_ADDRS); > + ether_addr_copy(priv->mcaddr_list.addrs[index++], ha->addr); > + } > + > + priv->mcaddr_list.count = index; > + > + owl_emac_setup_frame_xmit(priv); > +} > + > +static void owl_emac_ndo_set_rx_mode(struct net_device *netdev) > +{ > + struct owl_emac_priv *priv = netdev_priv(netdev); > + u32 status, val = 0; > + int mcast_count = 0; > + > + if (netdev->flags & IFF_PROMISC) { > + val = OWL_EMAC_BIT_MAC_CSR6_PR; > + } else if (netdev->flags & IFF_ALLMULTI) { > + val = OWL_EMAC_BIT_MAC_CSR6_PM; > + } else if (netdev->flags & IFF_MULTICAST) { > + mcast_count = netdev_mc_count(netdev); > + > + if (mcast_count > OWL_EMAC_MAX_MULTICAST_ADDRS) { > + val = OWL_EMAC_BIT_MAC_CSR6_PM; > + mcast_count = 0; > + } > + } > + > + spin_lock_bh(&priv->lock); > + > + /* Temporarily stop DMA TX & RX. */ > + status = owl_emac_dma_cmd_stop(priv); > + > + /* Update operation modes. */ > + owl_emac_reg_update(priv, OWL_EMAC_REG_MAC_CSR6, > + OWL_EMAC_BIT_MAC_CSR6_PR | OWL_EMAC_BIT_MAC_CSR6_PM, > + val); > + > + /* Restore DMA TX & RX status. */ > + owl_emac_dma_cmd_set(priv, status); > + > + spin_unlock_bh(&priv->lock); > + > + /* Set/reset multicast addr list. */ > + owl_emac_set_multicast(netdev, mcast_count); > +} I think this can be simplified. At least, you had me going around in circles a while trying to see if WARN_ON() could be triggered from user space. If you have more than OWL_EMAC_MAX_MULTICAST_ADDRS MC addresses, you go into promisc mode. Can you then skip calling owl_emac_set_multicast(), which appears not to do too much useful when passed 0? Andrew
RE: [PATCH -next] x86: Fix unused variable 'msr_val' warning
From: Ingo Molnar Sent: Monday, March 22, 2021 2:08 PM > > * Xu Yihang wrote: > > > Fixes the following W=1 kernel build warning(s): > > arch/x86/hyperv/hv_spinlock.c:28:16: warning: variable 'msr_val' set but > > not used [- > Wunused-but-set-variable] > > unsigned long msr_val; > > > > As Hypervisor Top-Level Functional Specification states in chapter 7.5 > > Virtual Processor > Idle Sleep State, "A partition which possesses the AccessGuestIdleMsr > privilege (refer to > section 4.2.2) may trigger entry into the virtual processor idle sleep state > through a read to > the hypervisor-defined MSR HV_X64_MSR_GUEST_IDLE". That means only a read is > necessary, msr_val is not uesed, so __maybe_unused should be added. > > > > Reference: > > > > https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/reference/tlfs > > > > Reported-by: Hulk Robot > > Signed-off-by: Xu Yihang > > --- > > arch/x86/hyperv/hv_spinlock.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/arch/x86/hyperv/hv_spinlock.c b/arch/x86/hyperv/hv_spinlock.c > > index f3270c1fc48c..67bc15c7752a 100644 > > --- a/arch/x86/hyperv/hv_spinlock.c > > +++ b/arch/x86/hyperv/hv_spinlock.c > > @@ -25,7 +25,7 @@ static void hv_qlock_kick(int cpu) > > > > static void hv_qlock_wait(u8 *byte, u8 val) > > { > > - unsigned long msr_val; > > + unsigned long msr_val __maybe_unused; > > unsigned long flags; > > Please don't add new __maybe_unused annotations to the x86 tree - > improve the flow instead to help GCC recognize the initialization > sequence better. > > Thanks, > > Ingo Could you elaborate on the thinking here, or point to some written discussion? I'm just curious. In this particular case, it's not a problem with the flow or gcc detection. This code really does read an MSR and ignore that value that is read, so it's not clear how gcc would ever figure out that's OK. Michael
[tip: x86/cleanups] x86/fpu/math-emu: Fix function cast warning
The following commit has been merged into the x86/cleanups branch of tip: Commit-ID: 279d56abc67ed7568168cb31bf1c7d735efc89a7 Gitweb: https://git.kernel.org/tip/279d56abc67ed7568168cb31bf1c7d735efc89a7 Author:Arnd Bergmann AuthorDate:Mon, 22 Mar 2021 22:48:19 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 23 Mar 2021 00:08:02 +01:00 x86/fpu/math-emu: Fix function cast warning Building with 'make W=1', gcc points out that casting between incompatible function types can be dangerous: arch/x86/math-emu/fpu_trig.c:1638:60: error: cast between incompatible function types from ‘int (*)(FPU_REG *, u_char)’ {aka ‘int (*)(struct fpu__reg *, unsigned char)’} to ‘void (*)(FPU_REG *, u_char)’ {aka ‘void (*)(struct fpu__reg *, unsigned char)’} [-Werror=cast-function-type] 1638 | fprem, fyl2xp1, fsqrt_, fsincos, frndint_, fscale, (FUNC_ST0) fsin, fcos |^ This one seems harmless, but it is easy enough to work around it by adding an intermediate function that adjusts the return type. Signed-off-by: Arnd Bergmann Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20210322214824.974323-1-a...@kernel.org --- arch/x86/math-emu/fpu_trig.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/arch/x86/math-emu/fpu_trig.c b/arch/x86/math-emu/fpu_trig.c index 4a98878..990d847 100644 --- a/arch/x86/math-emu/fpu_trig.c +++ b/arch/x86/math-emu/fpu_trig.c @@ -547,7 +547,7 @@ static void frndint_(FPU_REG *st0_ptr, u_char st0_tag) single_arg_error(st0_ptr, st0_tag); } -static int fsin(FPU_REG *st0_ptr, u_char tag) +static int f_sin(FPU_REG *st0_ptr, u_char tag) { u_char arg_sign = getsign(st0_ptr); @@ -608,6 +608,11 @@ static int fsin(FPU_REG *st0_ptr, u_char tag) } } +static void fsin(FPU_REG *st0_ptr, u_char tag) +{ + f_sin(st0_ptr, tag); +} + static int f_cos(FPU_REG *st0_ptr, u_char tag) { u_char st0_sign; @@ -724,7 +729,7 @@ static void fsincos(FPU_REG *st0_ptr, u_char st0_tag) } reg_copy(st0_ptr, &arg); - if (!fsin(st0_ptr, st0_tag)) { + if (!f_sin(st0_ptr, st0_tag)) { push(); FPU_copy_to_reg0(&arg, st0_tag); f_cos(&st(0), st0_tag); @@ -1635,7 +1640,7 @@ void FPU_triga(void) } static FUNC_ST0 const trig_table_b[] = { - fprem, fyl2xp1, fsqrt_, fsincos, frndint_, fscale, (FUNC_ST0) fsin, fcos + fprem, fyl2xp1, fsqrt_, fsincos, frndint_, fscale, fsin, fcos }; void FPU_trigb(void)
[tip: x86/boot] x86/boot/tboot: Avoid Wstringop-overread-warning
The following commit has been merged into the x86/boot branch of tip: Commit-ID: cdc34cb8f25d3125d30868376b8eae6fe690119b Gitweb: https://git.kernel.org/tip/cdc34cb8f25d3125d30868376b8eae6fe690119b Author:Arnd Bergmann AuthorDate:Mon, 22 Mar 2021 17:02:40 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 23 Mar 2021 00:16:13 +01:00 x86/boot/tboot: Avoid Wstringop-overread-warning gcc-11 warns about using string operations on pointers that are defined at compile time as offsets from a NULL pointer. Unfortunately that also happens on the result of fix_to_virt(), which is a compile-time constant for a constant input: arch/x86/kernel/tboot.c: In function 'tboot_probe': arch/x86/kernel/tboot.c:70:13: error: '__builtin_memcmp_eq' specified bound 16 exceeds source size 0 [-Werror=stringop-overread] 70 | if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { | ^~ I hope this can get addressed in gcc-11 before the release. As a workaround, split up the tboot_probe() function in two halves to separate the pointer generation from the usage. This is a bit ugly, and hopefully gcc understands that the code is actually correct before it learns to peek into the noinline function. Signed-off-by: Arnd Bergmann Signed-off-by: Ingo Molnar Cc: Linus Torvalds Cc: Martin Sebor Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578 Link: https://lore.kernel.org/r/20210322160253.4032422-3-a...@kernel.org --- arch/x86/kernel/tboot.c | 44 +++- 1 file changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 4c09ba1..f9af561 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -49,6 +49,30 @@ bool tboot_enabled(void) return tboot != NULL; } +/* noinline to prevent gcc from warning about dereferencing constant fixaddr */ +static noinline __init bool check_tboot_version(void) +{ + if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { + pr_warn("tboot at 0x%llx is invalid\n", boot_params.tboot_addr); + return false; + } + + if (tboot->version < 5) { + pr_warn("tboot version is invalid: %u\n", tboot->version); + return false; + } + + pr_info("found shared page at phys addr 0x%llx:\n", + boot_params.tboot_addr); + pr_debug("version: %d\n", tboot->version); + pr_debug("log_addr: 0x%08x\n", tboot->log_addr); + pr_debug("shutdown_entry: 0x%x\n", tboot->shutdown_entry); + pr_debug("tboot_base: 0x%08x\n", tboot->tboot_base); + pr_debug("tboot_size: 0x%x\n", tboot->tboot_size); + + return true; +} + void __init tboot_probe(void) { /* Look for valid page-aligned address for shared page. */ @@ -66,25 +90,9 @@ void __init tboot_probe(void) /* Map and check for tboot UUID. */ set_fixmap(FIX_TBOOT_BASE, boot_params.tboot_addr); - tboot = (struct tboot *)fix_to_virt(FIX_TBOOT_BASE); - if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { - pr_warn("tboot at 0x%llx is invalid\n", boot_params.tboot_addr); + tboot = (void *)fix_to_virt(FIX_TBOOT_BASE); + if (!check_tboot_version()) tboot = NULL; - return; - } - if (tboot->version < 5) { - pr_warn("tboot version is invalid: %u\n", tboot->version); - tboot = NULL; - return; - } - - pr_info("found shared page at phys addr 0x%llx:\n", - boot_params.tboot_addr); - pr_debug("version: %d\n", tboot->version); - pr_debug("log_addr: 0x%08x\n", tboot->log_addr); - pr_debug("shutdown_entry: 0x%x\n", tboot->shutdown_entry); - pr_debug("tboot_base: 0x%08x\n", tboot->tboot_base); - pr_debug("tboot_size: 0x%x\n", tboot->tboot_size); } static pgd_t *tboot_pg_dir;
[tip: x86/boot] x86/boot/compressed: Avoid gcc-11 -Wstringop-overread warning
The following commit has been merged into the x86/boot branch of tip: Commit-ID: e14cfb3bdd0f82147d09e9f46bedda6302f28ee1 Gitweb: https://git.kernel.org/tip/e14cfb3bdd0f82147d09e9f46bedda6302f28ee1 Author:Arnd Bergmann AuthorDate:Mon, 22 Mar 2021 17:02:39 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 23 Mar 2021 00:16:25 +01:00 x86/boot/compressed: Avoid gcc-11 -Wstringop-overread warning GCC gets confused by the comparison of a pointer to an integer literal, with the assumption that this is an offset from a NULL pointer and that dereferencing it is invalid: In file included from arch/x86/boot/compressed/misc.c:18: In function ‘parse_elf’, inlined from ‘extract_kernel’ at arch/x86/boot/compressed/misc.c:442:2: arch/x86/boot/compressed/../string.h:15:23: error: ‘__builtin_memcpy’ reading 64 bytes from a region of size 0 [-Werror=stringop-overread] 15 | #define memcpy(d,s,l) __builtin_memcpy(d,s,l) | ^~~ arch/x86/boot/compressed/misc.c:283:9: note: in expansion of macro ‘memcpy’ 283 | memcpy(&ehdr, output, sizeof(ehdr)); | ^~ I could not find any good workaround for this, but as this is only a warning for a failure during early boot, removing the line entirely works around the warning. Signed-off-by: Arnd Bergmann Signed-off-by: Ingo Molnar Cc: Linus Torvalds Cc: Martin Sebor Link: https://lore.kernel.org/r/20210322160253.4032422-2-a...@kernel.org --- arch/x86/boot/compressed/misc.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/x86/boot/compressed/misc.c b/arch/x86/boot/compressed/misc.c index 267e7f9..1945b8a 100644 --- a/arch/x86/boot/compressed/misc.c +++ b/arch/x86/boot/compressed/misc.c @@ -430,8 +430,6 @@ asmlinkage __visible void *extract_kernel(void *rmode, memptr heap, error("Destination address too large"); #endif #ifndef CONFIG_RELOCATABLE - if ((unsigned long)output != LOAD_PHYSICAL_ADDR) - error("Destination address does not match LOAD_PHYSICAL_ADDR"); if (virt_addr != LOAD_PHYSICAL_ADDR) error("Destination virtual address changed when not relocatable"); #endif
[tip: locking/core] static_call: Fix function type mismatch
The following commit has been merged into the locking/core branch of tip: Commit-ID: 335c73e7c8f7deb23537afbbbe4f8ab48bd5de52 Gitweb: https://git.kernel.org/tip/335c73e7c8f7deb23537afbbbe4f8ab48bd5de52 Author:Arnd Bergmann AuthorDate:Mon, 22 Mar 2021 22:42:24 +01:00 Committer: Ingo Molnar CommitterDate: Tue, 23 Mar 2021 00:08:53 +01:00 static_call: Fix function type mismatch The __static_call_return0() function is declared to return a 'long', while it aliases a couple of functions that all return 'int'. When building with 'make W=1', gcc warns about this: kernel/sched/core.c:5420:37: error: cast between incompatible function types from 'long int (*)(void)' to 'int (*)(void)' [-Werror=cast-function-type] 5420 | static_call_update(might_resched, (typeof(&__cond_resched)) __static_call_return0); Change all these function to return 'long' as well, but remove the cast to ensure we get a warning if any of the types ever change. Signed-off-by: Arnd Bergmann Signed-off-by: Ingo Molnar Link: https://lore.kernel.org/r/20210322214309.730556-1-a...@kernel.org --- include/linux/kernel.h | 4 ++-- include/linux/sched.h | 14 +++--- kernel/sched/core.c| 8 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 5b7ed6d..db24f8c 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -82,12 +82,12 @@ struct user; #ifdef CONFIG_PREEMPT_VOLUNTARY -extern int __cond_resched(void); +extern long __cond_resched(void); # define might_resched() __cond_resched() #elif defined(CONFIG_PREEMPT_DYNAMIC) -extern int __cond_resched(void); +extern long __cond_resched(void); DECLARE_STATIC_CALL(might_resched, __cond_resched); diff --git a/include/linux/sched.h b/include/linux/sched.h index ef00bb2..b08080d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1875,20 +1875,20 @@ static inline int test_tsk_need_resched(struct task_struct *tsk) * cond_resched_lock() will drop the spinlock before scheduling, */ #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) -extern int __cond_resched(void); +extern long __cond_resched(void); #ifdef CONFIG_PREEMPT_DYNAMIC DECLARE_STATIC_CALL(cond_resched, __cond_resched); -static __always_inline int _cond_resched(void) +static __always_inline long _cond_resched(void) { return static_call_mod(cond_resched)(); } #else -static inline int _cond_resched(void) +static inline long _cond_resched(void) { return __cond_resched(); } @@ -1897,7 +1897,7 @@ static inline int _cond_resched(void) #else -static inline int _cond_resched(void) { return 0; } +static inline long _cond_resched(void) { return 0; } #endif /* !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) */ @@ -1906,9 +1906,9 @@ static inline int _cond_resched(void) { return 0; } _cond_resched();\ }) -extern int __cond_resched_lock(spinlock_t *lock); -extern int __cond_resched_rwlock_read(rwlock_t *lock); -extern int __cond_resched_rwlock_write(rwlock_t *lock); +extern long __cond_resched_lock(spinlock_t *lock); +extern long __cond_resched_rwlock_read(rwlock_t *lock); +extern long __cond_resched_rwlock_write(rwlock_t *lock); #define cond_resched_lock(lock) ({ \ ___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 9819121..927fd82 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -6976,7 +6976,7 @@ SYSCALL_DEFINE0(sched_yield) } #if !defined(CONFIG_PREEMPTION) || defined(CONFIG_PREEMPT_DYNAMIC) -int __sched __cond_resched(void) +long __sched __cond_resched(void) { if (should_resched(0)) { preempt_schedule_common(); @@ -7006,7 +7006,7 @@ EXPORT_STATIC_CALL_TRAMP(might_resched); * operations here to prevent schedule() from being called twice (once via * spin_unlock(), once by hand). */ -int __cond_resched_lock(spinlock_t *lock) +long __cond_resched_lock(spinlock_t *lock) { int resched = should_resched(PREEMPT_LOCK_OFFSET); int ret = 0; @@ -7026,7 +7026,7 @@ int __cond_resched_lock(spinlock_t *lock) } EXPORT_SYMBOL(__cond_resched_lock); -int __cond_resched_rwlock_read(rwlock_t *lock) +long __cond_resched_rwlock_read(rwlock_t *lock) { int resched = should_resched(PREEMPT_LOCK_OFFSET); int ret = 0; @@ -7046,7 +7046,7 @@ int __cond_resched_rwlock_read(rwlock_t *lock) } EXPORT_SYMBOL(__cond_resched_rwlock_read); -int __cond_resched_rwlock_write(rwlock_t *lock) +long __cond_resched_rwlock_write(rwlock_t *lock) { int resched = should_resched(PREEMPT_LOCK_OFFSET); int ret = 0;
Re: [RFC PATCH 5/8] hugetlb: change free_pool_huge_page to remove_pool_huge_page
On 3/22/21 7:31 AM, Michal Hocko wrote: > On Fri 19-03-21 15:42:06, Mike Kravetz wrote: > [...] >> @@ -2090,9 +2084,15 @@ static void return_unused_surplus_pages(struct hstate >> *h, >> while (nr_pages--) { >> h->resv_huge_pages--; >> unused_resv_pages--; >> -if (!free_pool_huge_page(h, &node_states[N_MEMORY], 1)) >> +page = remove_pool_huge_page(h, &node_states[N_MEMORY], 1); >> +if (!page) >> goto out; >> -cond_resched_lock(&hugetlb_lock); >> + >> +/* Drop lock and free page to buddy as it could sleep */ >> +spin_unlock(&hugetlb_lock); >> +update_and_free_page(h, page); >> +cond_resched(); >> +spin_lock(&hugetlb_lock); >> } >> >> out: > > This is likely a matter of taste but the repeated pattern of unlock, > update_and_free_page, cond_resched and lock seems rather clumsy. > Would it be slightly better/nicer to remove_pool_huge_page into a > list_head under a single lock invocation and then free up the whole lot > after the lock is dropped? Yes, we can certainly do that. One downside I see is that the list can contain a bunch of pages not accounted for in hugetlb and not free in buddy (or cma). Ideally, we would want to keep those in sync if possible. Also, the commit that added the cond_resched talked about freeing up 12 TB worth of huge pages and it holding the lock for 150 seconds. The new code is not holding the lock while calling free to buddy, but I wonder how long it would take to remove 12 TB worth of huge pages and add them to a separate list? I do not know how realistic the 12 TB number is. But, I certainly am aware of pools that are a few TB in size. -- Mike Kravetz
Re: [PATCH net] net: dsa: don't assign an error value to tag_ops
On 3/22/2021 1:26 PM, George McCollister wrote: > Use a temporary variable to hold the return value from > dsa_tag_driver_get() instead of assigning it to dst->tag_ops. Leaving > an error value in dst->tag_ops can result in deferencing an invalid > pointer when a deferred switch configuration happens later. > > Fixes: 357f203bb3b5 ("net: dsa: keep a copy of the tagging protocol in the > DSA switch tree") > > Signed-off-by: George McCollister Reviewed-by: Florian Fainelli -- Florian
RE: [PATCH 2/5] cifsd: add server-side procedures for SMB3
> On Mon, Mar 22, 2021 at 09:47:13AM +0300, Dan Carpenter wrote: > > On Mon, Mar 22, 2021 at 02:13:41PM +0900, Namjae Jeon wrote: > > > +static unsigned char > > > +asn1_octet_decode(struct asn1_ctx *ctx, unsigned char *ch) { > > > + if (ctx->pointer >= ctx->end) { > > > + ctx->error = ASN1_ERR_DEC_EMPTY; > > > + return 0; > > > + } > > > + *ch = *(ctx->pointer)++; > > > + return 1; > > > +} > > > > > > Make this bool. > > > > More importantly don't add another ANS1 parser, but use the generic one in > lib/asn1_decoder.c instead. > CIFS should also really use it. Okay, Let me check it, cifs also... Thanks!
Re: [PATCH v3] usb: dwc3: gadget: Prevent EP queuing while stopping transfers
Hi Andy, On 3/22/2021 2:14 PM, Andy Shevchenko wrote: > On Mon, Mar 22, 2021 at 10:06 PM Wesley Cheng wrote: >> >> Hi Andy, >> >> On 3/22/2021 12:34 PM, Andy Shevchenko wrote: >>> On Mon, Mar 22, 2021 at 8:49 PM Wesley Cheng wrote: Hi Andy, On 3/22/2021 5:48 AM, Andy Shevchenko wrote: > On Fri, Mar 12, 2021 at 2:01 AM Wesley Cheng > wrote: >> >> In the situations where the DWC3 gadget stops active transfers, once >> calling the dwc3_gadget_giveback(), there is a chance where a function >> driver can queue a new USB request in between the time where the dwc3 >> lock has been released and re-aquired. This occurs after we've already >> issued an ENDXFER command. When the stop active transfers continues >> to remove USB requests from all dep lists, the newly added request will >> also be removed, while controller still has an active TRB for it. >> This can lead to the controller accessing an unmapped memory address. >> >> Fix this by ensuring parameters to prevent EP queuing are set before >> calling the stop active transfers API. > > > commit f09ddcfcb8c569675066337adac2ac205113471f > Author: Wesley Cheng > Date: Thu Mar 11 15:59:02 2021 -0800 > >usb: dwc3: gadget: Prevent EP queuing while stopping transfers > > effectively broke my gadget setup. > > The output of the kernel (followed by non responsive state of USB > controller): > > [ 195.228586] using random self ethernet address > [ 195.233104] using random host ethernet address > [ 195.245306] usb0: HOST MAC aa:bb:cc:dd:ee:f2 > [ 195.249732] usb0: MAC aa:bb:cc:dd:ee:f1 > # [ 195.773594] IPv6: ADDRCONF(NETDEV_CHANGE): usb0: link becomes ready > [ 195.780585] [ cut here ] > [ 195.785217] dwc3 dwc3.0.auto: No resource for ep2in > [ 195.790162] WARNING: CPU: 0 PID: 217 at > drivers/usb/dwc3/gadget.c:360 dwc3_send_gadget_ep_cmd+0x4b9/0x670 > [ 195.799760] Modules linked in: usb_f_eem u_ether libcomposite > brcmfmac brcmutil mmc_block pwm_lpss_pci pwm_lps > s snd_sof_pci_intel_tng snd_sof_pci snd_sof_acpi_intel_byt > snd_sof_intel_ipc snd_sof_acpi snd_sof snd_sof_nocodec > spi_pxa2xx_platform snd_sof_xtensa_dsp spi_pxa2xx_pci > extcon_intel_mrfld intel_mrfld_adc sdhci_pci cqhci sdhci m > mc_core intel_mrfld_pwrbtn intel_soc_pmic_mrfld hci_uart btbcm btintel > [ 195.835604] CPU: 0 PID: 217 Comm: irq/16-dwc3 Not tainted 5.12.0-rc4+ > #60 > [ 195.842403] Hardware name: Intel Corporation Merrifield/BODEGA BAY, > BIOS 542 2015.01.21:18.19.48 > [ 195.851191] RIP: 0010:dwc3_send_gadget_ep_cmd+0x4b9/0x670 > [ 195.856608] Code: cd 00 00 00 44 89 44 24 20 48 89 4c 24 18 e8 ee > f7 e4 ff 48 8b 4c 24 18 4c 89 f2 48 c7 c7 b9 > ed 4f a0 48 89 c6 e8 ef 24 43 00 <0f> 0b 41 be ea ff ff ff 44 8b 44 24 > 20 e9 80 fc ff ff 41 83 fe 92 > [ 195.875381] RSP: :a53c00373ba8 EFLAGS: 00010086 > [ 195.880617] RAX: RBX: 1387 RCX: > dfff > [ 195.887755] RDX: dfff RSI: ffea RDI: > > [ 195.894893] RBP: 9ce8c8f2b028 R08: a0732288 R09: > 9ffb > [ 195.902034] R10: e000 R11: 3fff R12: > 00041006 > [ 195.909170] R13: a53c00373c24 R14: 9ce8c11dadb0 R15: > 9ce8c2861700 > [ 195.916310] FS: () GS:9ce8fe20() > knlGS: > [ 195.924409] CS: 0010 DS: ES: CR0: 80050033 > [ 195.930161] CR2: f7f694a0 CR3: 38e0c000 CR4: > 001006f0 > [ 195.937300] Call Trace: > [ 195.939755] __dwc3_gadget_ep_enable+0x2d4/0x4e0 > [ 195.944393] ? dwc3_remove_requests.constprop.0+0x86/0x170 Odd that this change would affect the USB enablment path, as they were focused on the pullup disable path. Would you happen to have any downstream changes on top of v5.12-rc4 we could review to see if they are still required? (ie where is the dwc3_remove_requests() coming from during ep enable) >>> >>> You may check my branch [1] on GH. Basically you may be interested in >>> the commit: >>> 0f86df1294ee7523060cc16eafaf4898c693eab0 REVERTME: usb: dwc3: gadget: >>> skip endpoints ep[18]{in,out} >>> Otherwise it's a clean v5.12-rc4 with a revert and another USB PHY >>> suspend fix (which also shouldn't affect this). >> >> Can you link your GH reference? > > Oops, sorry. > Here we are: > > [1]: https://github.com/andy-shev/linux/tree/eds-acpi > Thanks, I took a look and even tried it on my device running 5.12-rc4, but wasn't able to see the same problem. Could you help collect the ftrace after enabling the tracing KCONFIG and running the below sequence? 1. Mount debugfs 2. Set up tracing instance
RE: [PATCH 2/5] cifsd: add server-side procedures for SMB3
> On Mon, Mar 22, 2021 at 02:13:41PM +0900, Namjae Jeon wrote: > > +static unsigned char > > +asn1_octet_decode(struct asn1_ctx *ctx, unsigned char *ch) { > > + if (ctx->pointer >= ctx->end) { > > + ctx->error = ASN1_ERR_DEC_EMPTY; > > + return 0; > > + } > > + *ch = *(ctx->pointer)++; > > + return 1; > > +} > > > Make this bool. Okay. > > > + > > +static unsigned char > > +asn1_tag_decode(struct asn1_ctx *ctx, unsigned int *tag) { > > + unsigned char ch; > > + > > + *tag = 0; > > + > > + do { > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + *tag <<= 7; > > + *tag |= ch & 0x7F; > > + } while ((ch & 0x80) == 0x80); > > + return 1; > > +} > > Bool. Okay. > > > + > > +static unsigned char > > +asn1_id_decode(struct asn1_ctx *ctx, > > + unsigned int *cls, unsigned int *con, unsigned int *tag) { > > + unsigned char ch; > > + > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + > > + *cls = (ch & 0xC0) >> 6; > > + *con = (ch & 0x20) >> 5; > > + *tag = (ch & 0x1F); > > + > > + if (*tag == 0x1F) { > > + if (!asn1_tag_decode(ctx, tag)) > > + return 0; > > + } > > + return 1; > > +} > > > Same. Okay. > > > + > > +static unsigned char > > +asn1_length_decode(struct asn1_ctx *ctx, unsigned int *def, unsigned > > +int *len) { > > + unsigned char ch, cnt; > > + > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + > > + if (ch == 0x80) > > + *def = 0; > > + else { > > + *def = 1; > > + > > + if (ch < 0x80) > > + *len = ch; > > + else { > > + cnt = (unsigned char) (ch & 0x7F); > > + *len = 0; > > + > > + while (cnt > 0) { > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + *len <<= 8; > > + *len |= ch; > > + cnt--; > > + } > > + } > > + } > > + > > + /* don't trust len bigger than ctx buffer */ > > + if (*len > ctx->end - ctx->pointer) > > + return 0; > > + > > + return 1; > > +} > > > Same etc for all. Okay. > > > + > > +static unsigned char > > +asn1_header_decode(struct asn1_ctx *ctx, > > + unsigned char **eoc, > > + unsigned int *cls, unsigned int *con, unsigned int *tag) { > > + unsigned int def = 0; > > + unsigned int len = 0; > > + > > + if (!asn1_id_decode(ctx, cls, con, tag)) > > + return 0; > > + > > + if (!asn1_length_decode(ctx, &def, &len)) > > + return 0; > > + > > + /* primitive shall be definite, indefinite shall be constructed */ > > + if (*con == ASN1_PRI && !def) > > + return 0; > > + > > + if (def) > > + *eoc = ctx->pointer + len; > > + else > > + *eoc = NULL; > > + return 1; > > +} > > + > > +static unsigned char > > +asn1_eoc_decode(struct asn1_ctx *ctx, unsigned char *eoc) { > > + unsigned char ch; > > + > > + if (!eoc) { > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + > > + if (ch != 0x00) { > > + ctx->error = ASN1_ERR_DEC_EOC_MISMATCH; > > + return 0; > > + } > > + > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + > > + if (ch != 0x00) { > > + ctx->error = ASN1_ERR_DEC_EOC_MISMATCH; > > + return 0; > > + } > > + } else { > > + if (ctx->pointer != eoc) { > > + ctx->error = ASN1_ERR_DEC_LENGTH_MISMATCH; > > + return 0; > > + } > > + } > > + return 1; > > +} > > + > > +static unsigned char > > +asn1_subid_decode(struct asn1_ctx *ctx, unsigned long *subid) { > > + unsigned char ch; > > + > > + *subid = 0; > > + > > + do { > > + if (!asn1_octet_decode(ctx, &ch)) > > + return 0; > > + > > + *subid <<= 7; > > + *subid |= ch & 0x7F; > > + } while ((ch & 0x80) == 0x80); > > + return 1; > > +} > > + > > +static int > > +asn1_oid_decode(struct asn1_ctx *ctx, > > + unsigned char *eoc, unsigned long **oid, unsigned int *len) { > > + unsigned long subid; > > + unsigned int size; > > + unsigned long *optr; > > + > > + size = eoc - ctx->pointer + 1; > > + > > + /* first subid actually encodes first two subids */ > > + if (size < 2 || size > UINT_MAX/sizeof(unsigned long)) > > + return 0; > > + > > + *oid = kmalloc(size * sizeof(unsigned long), GFP_KERNEL); > > + if (!*oid) > > + return 0; > > + > > + optr = *oid; > > + > > + if (!asn1_subid_decode(ctx, &subid)) { > > + kfree(*oid); > > + *oid = NULL; > > + return 0; > > +
Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
On Mon, 22 Mar 2021 23:37:26 +0100 Borislav Petkov wrote: > On Tue, Mar 23, 2021 at 11:06:43AM +1300, Kai Huang wrote: > > This path is called by host SGX driver only, so yes this leaking is done by > > host enclaves only. > > Yes, so I was told. > > > This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so > > virtual > > EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently. > > This patch doesn't have intention to bring functional change. I changed the > > error msg because Dave said it worth to give a message saying EPC page is > > leaked, and I thought this minor change won't break anything. > > I read that already - you don't have to repeat it. > > > btw, currently virtual EPC code (patch 5) handles in similar way: There's > > one > > EREMOVE error is expected and virtual EPC code can handle, but for other > > errors, it means kernel bug, and virtual EPC code gives a WARN(), and that > > EPC > > page is leaked too: > > > > + WARN_ONCE(ret != SGX_CHILD_PRESENT, > > + "EREMOVE (EPC page 0x%lx): unexpected error: %d\n", > > + sgx_get_epc_phys_addr(epc_page), ret); > > + return ret; > > > > So to me they are just WARN() to catch kernel bug. > > You don't care about users, do you? Because when that error happens, > they won't come crying to you to fix it. > > Lemme save you some trouble: we don't take half-baked code into the > kernel until stuff has been discussed and analyzed properly. So instead > of trying to downplay this, try answering my questions. > > Here's another one: when does EREMOVE fail? > > /me goes and looks it up > > "The instruction fails if the operand is not properly aligned or does > not refer to an EPC page or the page is in use by another thread, or > other threads are running in the enclave to which the page belongs. In > addition the instruction fails if the operand refers to an SECS with > associations." > > And I guess those conditions will become more in the future. > > Now, let's play. I'm the cloud admin and you're cloud OS customer > support. I say: > > "I got this scary error message while running enclaves on my server > > "EREMOVE returned ... . EPC page leaked. Reboot required to retrieve leaked > pages." > > but I cannot reboot that machine because there are guests running on it > and I'm getting paid for those guests and I might get sued if I do?" > > Your turn, go wild. I suppose admin can migrate those VMs, and then engineers can analyse the root cause of such failure, and then fix it.
Re: [PATCH 02/11] x86: tboot: avoid Wstringop-overread-warning
* Martin Sebor wrote: > > I.e. the real workaround might be to turn off the > > -Wstringop-overread-warning, > > until GCC-11 gets fixed? > > In GCC 10 -Wstringop-overread is a subset of -Wstringop-overflow. > GCC 11 breaks it out as a separate warning to make it easier to > control. Both warnings have caught some real bugs but they both > have a nonzero rate of false positives. Other than bug reports > we don't have enough data to say what their S/N ratio might be > but my sense is that it's fairly high in general. > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overread > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overflow > > In GCC 11, all access warnings expect objects to be either declared > or allocated. Pointers with constant values are taken to point to > nothing valid (as Arnd mentioned above, this is to detect invalid > accesses to members of structs at address zero). > > One possible solution to the known address problem is to extend GCC > attributes address and io that pin an object to a hardwired address > to all targets (at the moment they're supported on just one or two > targets). I'm not sure this can still happen before GCC 11 releases > sometime in April or May. > > Until then, another workaround is to convert the fixed address to > a volatile pointer before using it for the access, along the lines > below. It should have only a negligible effect on efficiency. Thank you for the detailed answer! I think I'll go with Arnd's original patch - which makes the code a slightly bit cleaner by separating out the check_tboot_version() check into a standalone function. The only ugly aspect is the global nature of the 'tboot' pointer - but that's a self-inflicted wound. I'd also guess that the S/N ratio somewhat unfairly penalizes this warning right now, because the kernel had a decade of growing real fixes via other efforts such as static and dynamic instrumentation as well. So the probability of false positive remaining is in fact higher, and going forward we should see a better S/N ratio of this warning. Most of which will never be seen by upstream maintainers, as the mishaps will stay at the individual developer level. :-) Thanks, Ingo
Re: [PATCH] arm64: dts: ls1028a: fix optee node
On Thu, Mar 18, 2021 at 3:36 AM Michael Walle wrote: > > Don't enable the optee node in the SoC include. It is an optional > component and actually, if enabled, breaks boards which doesn't have it. Hi Shawn, Shall we make this a general rule? I see quite a few SoC dtsi files are having the optee node enabled by default. Regards, Leo > > This reverts commit 48787485f8de ("arm64: dts: ls1028a: enable optee > node") and enables the node per board, assuming the intend of the > original author was to enable OPTEE for the LS1028A-RDB and the > LS1028A-QDS. > > Fixes: 48787485f8de ("arm64: dts: ls1028a: enable optee node") > Reported-by: Guillaume Tucker > Reported-by: "kernelci.org bot" > Tested-by: Michael Walle > Signed-off-by: Michael Walle > --- > arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts | 4 > arch/arm64/boot/dts/freescale/fsl-ls1028a-rdb.dts | 4 > arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi| 3 ++- > 3 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts > b/arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts > index fbcba9cb8503..060d3c79244d 100644 > --- a/arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a-qds.dts > @@ -327,6 +327,10 @@ > status = "okay"; > }; > > +&optee { > + status = "okay"; > +}; > + > &sai1 { > status = "okay"; > }; > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a-rdb.dts > b/arch/arm64/boot/dts/freescale/fsl-ls1028a-rdb.dts > index 41ae6e7675ba..1bdf0104d492 100644 > --- a/arch/arm64/boot/dts/freescale/fsl-ls1028a-rdb.dts > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a-rdb.dts > @@ -274,6 +274,10 @@ > status = "okay"; > }; > > +&optee { > + status = "okay"; > +}; > + > &sai4 { > status = "okay"; > }; > diff --git a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi > b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi > index 50d277eb2a54..e2007ebacd69 100644 > --- a/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi > +++ b/arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi > @@ -92,9 +92,10 @@ > }; > > firmware { > - optee { > + optee: optee { > compatible = "linaro,optee-tz"; > method = "smc"; > + status = "disabled"; > }; > }; > > -- > 2.20.1 >
Re: [RFC PATCH 2/8] hugetlb: recompute min_count when dropping hugetlb_lock
On 3/22/21 7:07 AM, Michal Hocko wrote: > On Fri 19-03-21 15:42:03, Mike Kravetz wrote: >> The routine set_max_huge_pages reduces the number of hugetlb_pages, >> by calling free_pool_huge_page in a loop. It does this as long as >> persistent_huge_pages() is above a calculated min_count value. >> However, this loop can conditionally drop hugetlb_lock and in some >> circumstances free_pool_huge_page can drop hugetlb_lock. If the >> lock is dropped, counters could change the calculated min_count >> value may no longer be valid. > > OK, this one looks like a real bug fix introduced by 55f67141a8927. > Unless I am missing something we could release pages which are reserved > already. > >> The routine try_to_free_low has the same issue. >> >> Recalculate min_count in each loop iteration as hugetlb_lock may have >> been dropped. >> >> Signed-off-by: Mike Kravetz >> --- >> mm/hugetlb.c | 25 + >> 1 file changed, 21 insertions(+), 4 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index d5be25f910e8..c537274c2a38 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -2521,11 +2521,20 @@ static void __init report_hugepages(void) >> } >> } >> >> +static inline unsigned long min_hp_count(struct hstate *h, unsigned long >> count) >> +{ >> +unsigned long min_count; >> + >> +min_count = h->resv_huge_pages + h->nr_huge_pages - h->free_huge_pages; >> +return max(count, min_count); > > Just out of curiousity, is compiler allowed to inline this piece of code > and then cache the value? In other words do we need to make these > READ_ONCE or otherwise enforce the no-caching behavior? I honestly do not know if the compiler is allowed to do that. The assembly code generated by my compiler does not cache the value, but that does not guarantee anything. I can add READ_ONCE to make the function look something like: static inline unsigned long min_hp_count(struct hstate *h, unsigned long count) { unsigned long min_count; min_count = READ_ONCE(h->resv_huge_pages) + READ_ONCE(h->nr_huge_pages) - READ_ONCE(h->free_huge_pages); return max(count, min_count); } -- Mike Kravetz
[tip: timers/core] timekeeping, clocksource: Fix various typos in comments
The following commit has been merged into the timers/core branch of tip: Commit-ID: 4bf07f6562a01a488877e05267808da7147f44a5 Gitweb: https://git.kernel.org/tip/4bf07f6562a01a488877e05267808da7147f44a5 Author:Ingo Molnar AuthorDate:Mon, 22 Mar 2021 22:39:03 +01:00 Committer: Ingo Molnar CommitterDate: Mon, 22 Mar 2021 23:06:48 +01:00 timekeeping, clocksource: Fix various typos in comments Fix ~56 single-word typos in timekeeping & clocksource code comments. Signed-off-by: Ingo Molnar Cc: Thomas Gleixner Cc: John Stultz Cc: Stephen Boyd Cc: Daniel Lezcano Cc: linux-kernel@vger.kernel.org --- drivers/clocksource/clksrc-dbx500-prcmu.c | 8 ++--- drivers/clocksource/dw_apb_timer_of.c | 2 +- drivers/clocksource/hyperv_timer.c | 2 +- drivers/clocksource/timer-atmel-tcb.c | 4 +-- drivers/clocksource/timer-fsl-ftm.c | 2 +- drivers/clocksource/timer-microchip-pit64b.c| 2 +- drivers/clocksource/timer-of.c | 4 +-- drivers/clocksource/timer-ti-dm-systimer.c | 2 +- drivers/clocksource/timer-vf-pit.c | 2 +- include/linux/clocksource.h | 2 +- include/linux/timex.h | 2 +- kernel/time/alarmtimer.c| 6 ++-- kernel/time/clocksource.c | 4 +-- kernel/time/hrtimer.c | 18 ++-- kernel/time/jiffies.c | 2 +- kernel/time/ntp.c | 2 +- kernel/time/posix-cpu-timers.c | 6 ++-- kernel/time/tick-broadcast-hrtimer.c| 2 +- kernel/time/tick-broadcast.c| 4 +-- kernel/time/tick-oneshot.c | 2 +- kernel/time/tick-sched.c| 2 +- kernel/time/tick-sched.h| 2 +- kernel/time/time.c | 2 +- kernel/time/timekeeping.c | 10 +++ kernel/time/timer.c | 4 +-- kernel/time/vsyscall.c | 2 +- tools/testing/selftests/timers/clocksource-switch.c | 4 +-- tools/testing/selftests/timers/leap-a-day.c | 2 +- tools/testing/selftests/timers/leapcrash.c | 4 +-- tools/testing/selftests/timers/threadtest.c | 2 +- 30 files changed, 56 insertions(+), 56 deletions(-) diff --git a/drivers/clocksource/clksrc-dbx500-prcmu.c b/drivers/clocksource/clksrc-dbx500-prcmu.c index 996900d..2fc93e4 100644 --- a/drivers/clocksource/clksrc-dbx500-prcmu.c +++ b/drivers/clocksource/clksrc-dbx500-prcmu.c @@ -18,7 +18,7 @@ #define RATE_32K 32768 -#define TIMER_MODE_CONTINOUS 0x1 +#define TIMER_MODE_CONTINUOUS 0x1 #define TIMER_DOWNCOUNT_VAL0x #define PRCMU_TIMER_REF0 @@ -55,13 +55,13 @@ static int __init clksrc_dbx500_prcmu_init(struct device_node *node) /* * The A9 sub system expects the timer to be configured as -* a continous looping timer. +* a continuous looping timer. * The PRCMU should configure it but if it for some reason * don't we do it here. */ if (readl(clksrc_dbx500_timer_base + PRCMU_TIMER_MODE) != - TIMER_MODE_CONTINOUS) { - writel(TIMER_MODE_CONTINOUS, + TIMER_MODE_CONTINUOUS) { + writel(TIMER_MODE_CONTINUOUS, clksrc_dbx500_timer_base + PRCMU_TIMER_MODE); writel(TIMER_DOWNCOUNT_VAL, clksrc_dbx500_timer_base + PRCMU_TIMER_REF); diff --git a/drivers/clocksource/dw_apb_timer_of.c b/drivers/clocksource/dw_apb_timer_of.c index 42e7e43..2b2c3b5 100644 --- a/drivers/clocksource/dw_apb_timer_of.c +++ b/drivers/clocksource/dw_apb_timer_of.c @@ -38,7 +38,7 @@ static int __init timer_get_base_and_rate(struct device_node *np, } /* -* Not all implementations use a periphal clock, so don't panic +* Not all implementations use a peripheral clock, so don't panic * if it's not present */ pclk = of_clk_get_by_name(np, "pclk"); diff --git a/drivers/clocksource/hyperv_timer.c b/drivers/clocksource/hyperv_timer.c index 269a691..a02b0a2 100644 --- a/drivers/clocksource/hyperv_timer.c +++ b/drivers/clocksource/hyperv_timer.c @@ -457,7 +457,7 @@ void __init hv_init_clocksource(void) { /* * Try to set up the TSC page clocksource. If it succeeds, we're -* done. Otherwise, set up the MSR clocksoruce. At least one of +* done. Otherwise, set up the MSR clocksource. At least one of * these will always be available except on very old versions of * Hyper-V on x86. In that case we won't have a Hyper-V * clocksource, but Linux will still run with a
RE: [PATCH v1 1/1] i2c: drivers: Use generic definitions for bus frequencies (part 2)
Thanks. Acked-by: Khalil Blaiech > -Original Message- > From: Andy Shevchenko > Sent: Monday, March 22, 2021 12:54 PM > To: Wolfram Sang ; Khalil Blaiech ; > Loic Poulain ; linux-...@vger.kernel.org; linux- > ker...@vger.kernel.org; linux-arm-...@vger.kernel.org > Cc: Robert Foss ; Andy Shevchenko > ; Wolfram Sang dreams.de> > Subject: [PATCH v1 1/1] i2c: drivers: Use generic definitions for bus > frequencies (part 2) > > Since we have generic definitions for bus frequencies, let's use them. > > Cc: Wolfram Sang > Signed-off-by: Andy Shevchenko > --- > drivers/i2c/busses/i2c-mlxbf.c| 14 -- > drivers/i2c/busses/i2c-qcom-cci.c | 4 ++-- > 2 files changed, 6 insertions(+), 12 deletions(-) > > diff --git a/drivers/i2c/busses/i2c-mlxbf.c b/drivers/i2c/busses/i2c-mlxbf.c > index 2fb0532d8a16..80ab831df349 100644 > --- a/drivers/i2c/busses/i2c-mlxbf.c > +++ b/drivers/i2c/busses/i2c-mlxbf.c > @@ -172,12 +172,6 @@ > #define MLXBF_I2C_SMBUS_THIGH_MAX_TBUF0x14 > #define MLXBF_I2C_SMBUS_SCL_LOW_TIMEOUT 0x18 > > -enum { > - MLXBF_I2C_TIMING_100KHZ = 10, > - MLXBF_I2C_TIMING_400KHZ = 40, > - MLXBF_I2C_TIMING_1000KHZ = 100, > -}; > - > /* > * Defines SMBus operating frequency and core clock frequency. > * According to ADB files, default values are compliant to 100KHz SMBus > @@ -1202,7 +1196,7 @@ static int mlxbf_i2c_init_timings(struct > platform_device *pdev, > > ret = device_property_read_u32(dev, "clock-frequency", > &config_khz); > if (ret < 0) > - config_khz = MLXBF_I2C_TIMING_100KHZ; > + config_khz = I2C_MAX_STANDARD_MODE_FREQ; > > switch (config_khz) { > default: > @@ -1210,15 +1204,15 @@ static int mlxbf_i2c_init_timings(struct > platform_device *pdev, > pr_warn("Illegal value %d: defaulting to 100 KHz\n", > config_khz); > fallthrough; > - case MLXBF_I2C_TIMING_100KHZ: > + case I2C_MAX_STANDARD_MODE_FREQ: > config_idx = MLXBF_I2C_TIMING_CONFIG_100KHZ; > break; > > - case MLXBF_I2C_TIMING_400KHZ: > + case I2C_MAX_FAST_MODE_FREQ: > config_idx = MLXBF_I2C_TIMING_CONFIG_400KHZ; > break; > > - case MLXBF_I2C_TIMING_1000KHZ: > + case I2C_MAX_FAST_MODE_PLUS_FREQ: > config_idx = MLXBF_I2C_TIMING_CONFIG_1000KHZ; > break; > } > diff --git a/drivers/i2c/busses/i2c-qcom-cci.c b/drivers/i2c/busses/i2c-qcom- > cci.c > index 1c259b5188de..c63d5545fc2a 100644 > --- a/drivers/i2c/busses/i2c-qcom-cci.c > +++ b/drivers/i2c/busses/i2c-qcom-cci.c > @@ -569,9 +569,9 @@ static int cci_probe(struct platform_device *pdev) > cci->master[idx].mode = I2C_MODE_STANDARD; > ret = of_property_read_u32(child, "clock-frequency", &val); > if (!ret) { > - if (val == 40) > + if (val == I2C_MAX_FAST_MODE_FREQ) > cci->master[idx].mode = I2C_MODE_FAST; > - else if (val == 100) > + else if (val == I2C_MAX_FAST_MODE_PLUS_FREQ) > cci->master[idx].mode = > I2C_MODE_FAST_PLUS; > } > > -- > 2.30.2
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Mon, Mar 22, 2021 at 11:36:19PM +0100, Uladzislau Rezki wrote: > On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote: > > If we're trying to allocate 4MB of memory, the table will be 8KiB in size > > (1024 pointers * 8 bytes per pointer), which can usually be satisfied > > by a kmalloc (which is significantly faster). Instead of changing this > > open-coded implementation, just use kvmalloc(). > > > > Signed-off-by: Matthew Wilcox (Oracle) > > --- > > mm/vmalloc.c | 7 +-- > > 1 file changed, 1 insertion(+), 6 deletions(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 96444d64129a..32b640a84250 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct > > *area, gfp_t gfp_mask, > > gfp_mask |= __GFP_HIGHMEM; > > > > /* Please note that the recursion is strictly bounded. */ > > - if (array_size > PAGE_SIZE) { > > - pages = __vmalloc_node(array_size, 1, nested_gfp, node, > > + pages = kvmalloc_node_caller(array_size, nested_gfp, node, > > area->caller); > > - } else { > > - pages = kmalloc_node(array_size, nested_gfp, node); > > - } > > - > > if (!pages) { > > free_vm_area(area); > > return NULL; > > -- > > 2.30.2 > Makes sense to me. Though i expected a bigger difference: > > # patch > single CPU, 4MB allocation, loops: 100 avg: 85293854 usec > > # default > single CPU, 4MB allocation, loops: 100 avg: 89275857 usec Well, 4.5% isn't something to leave on the table ... but yeah, I was expecting more in the 10-20% range. It may be more significant if there's contention on the spinlocks (like if this crazy ksmbd is calling vmalloc(4MB) on multiple nodes simultaneously). I suspect the vast majority of the time is spent calling alloc_pages_node() 1024 times. Have you looked at Mel's patch to do ... well, exactly what vmalloc() wants? https://lore.kernel.org/linux-mm/20210322091845.16437-1-mgor...@techsingularity.net/ > One question. Should we care much about fragmentation? I mean > with the patch, allocations > 2MB will do request to SLAB bigger > then PAGE_SIZE. We're pretty good about allocating memory in larger chunks these days. Looking at my laptop's slabinfo, kmalloc-8k 219232 819248 : tunables000 : sla bdata 58 58 0 That's using 8 pages per slab, so that's order-3 allocations. There's a few more of those: $ sudo grep '8 :' /proc/slabinfo |wc 42 6724508 so I have confidence that kvmalloc() will manage to use kmalloc up to 16MB vmalloc allocations, and after that it'll tend to fall back to vmalloc.
[PATCH] rcu: Fix various typos in comments
Hi Paul, Was working on automation to make it a bit more straightforward to fix typos within comments (which we tend to reintroduce during development), and here are the ones it found in the RCU code. Thanks, Ingo => From: Ingo Molnar Date: Mon, 22 Mar 2021 23:57:26 +0100 Subject: [PATCH] rcu: Fix various typos in comments Fix ~12 single-word typos in RCU code comments. Signed-off-by: Ingo Molnar Cc: Paul E. McKenney Cc: linux-kernel@vger.kernel.org --- kernel/rcu/srcutree.c | 4 ++-- kernel/rcu/sync.c | 2 +- kernel/rcu/tasks.h | 8 kernel/rcu/tree.c | 4 ++-- kernel/rcu/tree.h | 2 +- kernel/rcu/tree_plugin.h| 2 +- tools/testing/selftests/rcutorture/formal/srcu-cbmc/src/locks.h | 2 +- 7 files changed, 12 insertions(+), 12 deletions(-) diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c index e26547b34ad3..036ff5499ad5 100644 --- a/kernel/rcu/srcutree.c +++ b/kernel/rcu/srcutree.c @@ -777,9 +777,9 @@ static bool srcu_might_be_idle(struct srcu_struct *ssp) spin_unlock_irqrestore_rcu_node(sdp, flags); /* -* No local callbacks, so probabalistically probe global state. +* No local callbacks, so probabilistically probe global state. * Exact information would require acquiring locks, which would -* kill scalability, hence the probabalistic nature of the probe. +* kill scalability, hence the probabilistic nature of the probe. */ /* First, see if enough time has passed since the last GP. */ diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c index d4558ab7a07d..3eeb871cf0de 100644 --- a/kernel/rcu/sync.c +++ b/kernel/rcu/sync.c @@ -94,7 +94,7 @@ static void rcu_sync_func(struct rcu_head *rhp) rcu_sync_call(rsp); } else { /* -* We're at least a GP after the last rcu_sync_exit(); eveybody +* We're at least a GP after the last rcu_sync_exit(); everybody * will now have observed the write side critical section. * Let 'em rip!. */ diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h index af7c19439f4e..ac3c362e08a3 100644 --- a/kernel/rcu/tasks.h +++ b/kernel/rcu/tasks.h @@ -23,7 +23,7 @@ typedef void (*postgp_func_t)(struct rcu_tasks *rtp); * Definition for a Tasks-RCU-like mechanism. * @cbs_head: Head of callback list. * @cbs_tail: Tail pointer for callback list. - * @cbs_wq: Wait queue allowning new callback to get kthread's attention. + * @cbs_wq: Wait queue allowing new callback to get kthread's attention. * @cbs_lock: Lock protecting callback list. * @kthread_ptr: This flavor's grace-period/callback-invocation kthread. * @gp_func: This flavor's grace-period-wait function. @@ -504,7 +504,7 @@ DEFINE_RCU_TASKS(rcu_tasks, rcu_tasks_wait_gp, call_rcu_tasks, "RCU Tasks"); * or transition to usermode execution. As such, there are no read-side * primitives analogous to rcu_read_lock() and rcu_read_unlock() because * this primitive is intended to determine that all tasks have passed - * through a safe state, not so much for data-strcuture synchronization. + * through a safe state, not so much for data-structure synchronization. * * See the description of call_rcu() for more detailed information on * memory ordering guarantees. @@ -637,7 +637,7 @@ DEFINE_RCU_TASKS(rcu_tasks_rude, rcu_tasks_rude_wait_gp, call_rcu_tasks_rude, * there are no read-side primitives analogous to rcu_read_lock() and * rcu_read_unlock() because this primitive is intended to determine * that all tasks have passed through a safe state, not so much for - * data-strcuture synchronization. + * data-structure synchronization. * * See the description of call_rcu() for more detailed information on * memory ordering guarantees. @@ -1127,7 +1127,7 @@ static void exit_tasks_rcu_finish_trace(struct task_struct *t) * there are no read-side primitives analogous to rcu_read_lock() and * rcu_read_unlock() because this primitive is intended to determine * that all tasks have passed through a safe state, not so much for - * data-strcuture synchronization. + * data-structure synchronization. * * See the description of call_rcu() for more detailed information on * memory ordering guarantees. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index da6f5213fb74..ab5bd5b391e6 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2490,7 +2490,7 @@ int rcutree_dead_cpu(unsigned int cpu) /* * Invoke any RCU callbacks that have made it to the end of their grace - * period. Thottle as specified by rdp->blimit. + * period. Throttle as specified by rdp->blimit. */ static void rcu_do_batch(struct rcu_data
Re: [PATCH v4 18/19] coresight: sink: Add TRBE driver
On 22/03/2021 21:24, Mathieu Poirier wrote: On Fri, Mar 19, 2021 at 11:55:10AM +, Mike Leach wrote: HI Suzuki, On Fri, 19 Mar 2021 at 10:30, Suzuki K Poulose wrote: Hi Mike On 8 Mar 2021, at 17:26, Mike Leach wrote: Hi Suzuki, On Thu, 25 Feb 2021 at 19:36, Suzuki K Poulose wrote: From: Anshuman Khandual Trace Buffer Extension (TRBE) implements a trace buffer per CPU which is accessible via the system registers. The TRBE supports different addressing modes including CPU virtual address and buffer modes including the circular buffer mode. The TRBE buffer is addressed by a base pointer (TRBBASER_EL1), an write pointer (TRBPTR_EL1) and a limit pointer (TRBLIMITR_EL1). But the access to the trace buffer could be prohibited by a higher exception level (EL3 or EL2), indicated by TRBIDR_EL1.P. The TRBE can also generate a CPU private interrupt (PPI) on address translation errors and when the buffer is full. Overall implementation here is inspired from the Arm SPE driver. Cc: Mathieu Poirier Cc: Mike Leach Cc: Suzuki K Poulose Signed-off-by: Anshuman Khandual Signed-off-by: Suzuki K Poulose + +static unsigned long arm_trbe_update_buffer(struct coresight_device *csdev, + struct perf_output_handle *handle, + void *config) +{ + struct trbe_drvdata *drvdata = dev_get_drvdata(csdev->dev.parent); + struct trbe_cpudata *cpudata = dev_get_drvdata(&csdev->dev); + struct trbe_buf *buf = config; + enum trbe_fault_action act; + unsigned long size, offset; + unsigned long write, base, status; + unsigned long flags; + + WARN_ON(buf->cpudata != cpudata); + WARN_ON(cpudata->cpu != smp_processor_id()); + WARN_ON(cpudata->drvdata != drvdata); + if (cpudata->mode != CS_MODE_PERF) + return 0; + + perf_aux_output_flag(handle, PERF_AUX_FLAG_CORESIGHT_FORMAT_RAW); + + /* +* We are about to disable the TRBE. And this could in turn +* fill up the buffer triggering, an IRQ. This could be consumed +* by the PE asynchronously, causing a race here against +* the IRQ handler in closing out the handle. So, let us +* make sure the IRQ can't trigger while we are collecting +* the buffer. We also make sure that a WRAP event is handled +* accordingly. +*/ + local_irq_save(flags); + + /* +* If the TRBE was disabled due to lack of space in the AUX buffer or a +* spurious fault, the driver leaves it disabled, truncating the buffer. +* Since the etm_perf driver expects to close out the AUX buffer, the +* driver skips it. Thus, just pass in 0 size here to indicate that the +* buffer was truncated. +*/ + if (!is_trbe_enabled()) { + size = 0; + goto done; + } + /* +* perf handle structure needs to be shared with the TRBE IRQ handler for +* capturing trace data and restarting the handle. There is a probability +* of an undefined reference based crash when etm event is being stopped +* while a TRBE IRQ also getting processed. This happens due the release +* of perf handle via perf_aux_output_end() in etm_event_stop(). Stopping +* the TRBE here will ensure that no IRQ could be generated when the perf +* handle gets freed in etm_event_stop(). +*/ + trbe_drain_and_disable_local(); + write = get_trbe_write_pointer(); + base = get_trbe_base_pointer(); + + /* Check if there is a pending interrupt and handle it here */ + status = read_sysreg_s(SYS_TRBSR_EL1); + if (is_trbe_irq(status)) { + + /* +* Now that we are handling the IRQ here, clear the IRQ +* from the status, to let the irq handler know that it +* is taken care of. +*/ + clr_trbe_irq(); + isb(); + + act = trbe_get_fault_act(status); + /* +* If this was not due to a WRAP event, we have some +* errors and as such buffer is empty. +*/ + if (act != TRBE_FAULT_ACT_WRAP) { + size = 0; + goto done; + } We are using TRBE FILL mode - which halts capture on a full buffer and triggers the IRQ, without disabling the source first. This means that the mode is inherently lossy (unless by some unlikely co-incidence the last byte that caused the wrap was also the last byte to be sent from an ETE that was in the process of being disabled.) Therefore we must have a perf_aux_output_flag(handle, PERF_AUX_FLAG_TRUNCATED) call in here to signal that some trace was lost, for consistence of operation with ETR etc, and intelpt. I agree that the there is a bit of loss here due to the FILL mode. But it is not
Re: [PATCH v2] audit: log nftables configuration change events once per table
On Mon, Mar 22, 2021 at 04:49:04PM -0400, Richard Guy Briggs wrote: > diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c > index c1eb5cdb3033..42ba44890523 100644 > --- a/net/netfilter/nf_tables_api.c > +++ b/net/netfilter/nf_tables_api.c [...] > @@ -8006,12 +7938,47 @@ static void nft_commit_notify(struct net *net, u32 > portid) > WARN_ON_ONCE(!list_empty(&net->nft.notify_list)); > } > > +void nf_tables_commit_audit_collect(struct list_head *adl, > + struct nft_trans *trans) { nitpick: curly brace starts in the line. > + struct nft_audit_data *adp; > + > + list_for_each_entry(adp, adl, list) { > + if (adp->table == trans->ctx.table) > + goto found; > + } > + adp = kzalloc(sizeof(*adp), GFP_KERNEL); if (!adp) ... > + adp->table = trans->ctx.table; > + INIT_LIST_HEAD(&adp->list); > + list_add(&adp->list, adl); > +found: > + adp->entries++; > + if (!adp->op || adp->op > trans->msg_type) > + adp->op = trans->msg_type; > +} > + > +#define AUNFTABLENAMELEN (NFT_TABLE_MAXNAMELEN + 22) > + > +void nf_tables_commit_audit_log(struct list_head *adl, u32 generation) { ^ same thing here > + struct nft_audit_data *adp, *adn; > + char aubuf[AUNFTABLENAMELEN]; > + > + list_for_each_entry_safe(adp, adn, adl, list) { > + snprintf(aubuf, AUNFTABLENAMELEN, "%s:%u", adp->table->name, > + generation); > + audit_log_nfcfg(aubuf, adp->table->family, adp->entries, > + nft2audit_op[adp->op], GFP_KERNEL); > + list_del(&adp->list); > + kfree(adp); > + } > +} > + > static int nf_tables_commit(struct net *net, struct sk_buff *skb) > { > struct nft_trans *trans, *next; > struct nft_trans_elem *te; > struct nft_chain *chain; > struct nft_table *table; > + LIST_HEAD(adl); > int err; > > if (list_empty(&net->nft.commit_list)) { > @@ -8206,12 +8173,15 @@ static int nf_tables_commit(struct net *net, struct > sk_buff *skb) > } > break; > } > + nf_tables_commit_audit_collect(&adl, trans); > } > > nft_commit_notify(net, NETLINK_CB(skb).portid); > nf_tables_gen_notify(net, skb, NFT_MSG_NEWGEN); > nf_tables_commit_release(net); > > + nf_tables_commit_audit_log(&adl, net->nft.base_seq); > + This looks more self-contained and nicer, thanks.
Re: [PATCH net-next v2 2/2] net: ipa: fix IPA validation
> The solution is to create a user space tool inside the > drivers/net/ipa directory that will link with the kernel > source files and will perform all the basic one-time checks > I want to make. Hi Alex Have you found any other driver doing this? Where do they keep there code? Could this be a selftest, put somewhere in tools/testing/selftests. Or can this be a test kernel module. Eg. we have crypt/testmsg.c which runs a number of tests on the crypto subsystem, ./kernel/time/test_udelay.c which runs times on udelay. Rather than inventing something new, please follow other examples already in the kernel. Andrew
[PATCH 3/4] selftest/arm64/ptrace: add a test for orig_x0
The test creates two processes where one traces another one. The tracee executes a system call, the tracer traps it, changes orig_x0, triggers a signal and checks that the syscall is restarted with the setted argument. Test output: $ ./ptrace_restart_syscall_test 1..3 ok 1 orig_x0: 0x3 ok 2 x0: 0x5 ok 3 The child exited with code 0. # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Andrei Vagin --- tools/testing/selftests/arm64/ptrace/Makefile | 6 + tools/testing/selftests/arm64/ptrace/lib.h| 36 ++ .../ptrace/ptrace_restart_syscall_test.c | 122 ++ 3 files changed, 164 insertions(+) create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile create mode 100644 tools/testing/selftests/arm64/ptrace/lib.h create mode 100644 tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c diff --git a/tools/testing/selftests/arm64/ptrace/Makefile b/tools/testing/selftests/arm64/ptrace/Makefile new file mode 100644 index ..1bc10e2d2ac8 --- /dev/null +++ b/tools/testing/selftests/arm64/ptrace/Makefile @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: GPL-2.0 + +CFLAGS += -g -I../../../../../usr/include/ +TEST_GEN_PROGS := ptrace_restart_syscall_test + +include ../../lib.mk diff --git a/tools/testing/selftests/arm64/ptrace/lib.h b/tools/testing/selftests/arm64/ptrace/lib.h new file mode 100644 index ..14f4737188a3 --- /dev/null +++ b/tools/testing/selftests/arm64/ptrace/lib.h @@ -0,0 +1,36 @@ +// SPDX-License-Identifier: GPL-2.0-only +#ifndef __PTRACE_TEST_LOG_H +#define __PTRACE_TEST_LOG_H + +#define pr_p(func, fmt, ...) func("%s:%d: " fmt ": %m", \ + __func__, __LINE__, ##__VA_ARGS__) + +#define pr_err(fmt, ...) \ + ({ \ + ksft_test_result_error(fmt "\n", ##__VA_ARGS__);\ + -1; \ + }) + +#define pr_fail(fmt, ...) \ + ({ \ + ksft_test_result_fail(fmt "\n", ##__VA_ARGS__); \ + -1; \ + }) + +#define pr_perror(fmt, ...)pr_p(pr_err, fmt, ##__VA_ARGS__) + +static inline int ptrace_and_wait(pid_t pid, int cmd, int sig) +{ + int status; + + /* Stop on syscall-exit. */ + if (ptrace(cmd, pid, 0, 0)) + return pr_perror("Can't resume the child %d", pid); + if (waitpid(pid, &status, 0) != pid) + return pr_perror("Can't wait for the child %d", pid); + if (!WIFSTOPPED(status) || WSTOPSIG(status) != sig) + return pr_err("Unexpected status: %x", status); + return 0; +} + +#endif diff --git a/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c b/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c new file mode 100644 index ..ce59657f41be --- /dev/null +++ b/tools/testing/selftests/arm64/ptrace/ptrace_restart_syscall_test.c @@ -0,0 +1,122 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "../../kselftest.h" +#include "lib.h" + +static int child(int fd) +{ + char c; + + if (read(fd, &c, 1) != 1) + return 1; + + return 0; +} + +int main(int argc, void **argv) +{ + union { + struct user_regs_struct r; + struct { + char __regs[272]; + unsigned long long orig_x0; + unsigned long long orig_x7; + }; + } regs = {}; + struct iovec iov = { + .iov_base = ®s, + .iov_len = sizeof(regs), + }; + int status; + pid_t pid; + int p[2], fdzero; + + ksft_set_plan(3); + + if (pipe(p)) + return pr_perror("Can't create a pipe"); + fdzero = open("/dev/zero", O_RDONLY); + if (fdzero < 0) + return pr_perror("Can't open /dev/zero"); + + pid = fork(); + if (pid == 0) { + kill(getpid(), SIGSTOP); + return child(p[0]); + } + if (pid < 0) + return 1; + + if (ptrace(PTRACE_ATTACH, pid, 0, 0)) + return pr_perror("Can't attach to the child %d", pid); + if (waitpid(pid, &status, 0) != pid) + return pr_perror("Can't wait for the child %d", pid); + /* Skip SIGSTOP */ + if (ptrace_and_wait(pid, PTRACE_CONT, SIGSTOP)) + return 1; + + /* Resume the child to the next system call. */ + if (ptrace_and_wait(pid, PTRACE_SYSCALL, SIGTRAP)) + return 1; + + /* Send a
[PATCH 4/4] selftest/arm64/ptrace: add a test for orig_x7
In system calls, x7 is used to indicate whether a tracee has been stopped on syscall-enter or syscall-exit and the origin value of x7 is saved in orig_x7. Test output: $ ./ptrace_syscall_test 1..4 ok 1 x7: 0 ok 2 x7: 1 ok 3 x7: 686920776f726c64 ok 4 The child exited with code 0. # Totals: pass:4 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Andrei Vagin --- tools/testing/selftests/arm64/ptrace/Makefile | 2 +- .../arm64/ptrace/ptrace_syscall_test.c| 158 ++ 2 files changed, 159 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c diff --git a/tools/testing/selftests/arm64/ptrace/Makefile b/tools/testing/selftests/arm64/ptrace/Makefile index 1bc10e2d2ac8..ea02c1a63806 100644 --- a/tools/testing/selftests/arm64/ptrace/Makefile +++ b/tools/testing/selftests/arm64/ptrace/Makefile @@ -1,6 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 CFLAGS += -g -I../../../../../usr/include/ -TEST_GEN_PROGS := ptrace_restart_syscall_test +TEST_GEN_PROGS := ptrace_restart_syscall_test ptrace_syscall_test include ../../lib.mk diff --git a/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c new file mode 100644 index ..ad55b44ae9f5 --- /dev/null +++ b/tools/testing/selftests/arm64/ptrace/ptrace_syscall_test.c @@ -0,0 +1,158 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include "../../kselftest.h" +#include "lib.h" + +#define X7_TEST_VAL 0x686920776f726c64UL + +static long test_syscall(long *x7) +{ + register long x0 __asm__("x0") = 0; + register long *x1 __asm__("x1") = x7; + register long x8 __asm__("x8") = 0x555; + + __asm__ ( + "ldr x7, [x1, 0]\n" + "svc 0\n" + "str x7, [x1, 0]\n" + : "=r"(x0) + : "r"(x0), "r"(x1), "r"(x8) + : + ); + return x0; +} + +static int child(void) +{ + long val = X7_TEST_VAL; + + if (test_syscall(&val)) { + ksft_print_msg("The test syscall failed\n"); + return 1; + } + if (val != X7_TEST_VAL) { + ksft_print_msg("Unexpected x7: %lx\n", val); + return 1; + } + + if (test_syscall(&val)) { + ksft_print_msg("The test syscall failed\n"); + return 1; + } + if (val != ~X7_TEST_VAL) { + ksft_print_msg("Unexpected x7: %lx\n", val); + return 1; + } + + return 0; +} + +#ifndef PTRACE_SYSEMU +#define PTRACE_SYSEMU 31 +#endif + +int main(int argc, void **argv) +{ + union { + struct user_regs_struct r; + struct { + char __regs[272]; + unsigned long long orig_x0; + unsigned long long orig_x7; + }; + } regs = {}; + struct iovec iov = { + .iov_base = ®s, + .iov_len = sizeof(regs), + }; + int status; + pid_t pid; + + ksft_set_plan(4); + + pid = fork(); + if (pid == 0) { + kill(getpid(), SIGSTOP); + _exit(child()); + } + if (pid < 0) + return 1; + + if (ptrace_and_wait(pid, PTRACE_ATTACH, SIGSTOP)) + return 1; + /* skip SIGSTOP */ + if (ptrace_and_wait(pid, PTRACE_CONT, SIGSTOP)) + return 1; + + /* Resume the child to the next system call. */ + if (ptrace_and_wait(pid, PTRACE_SYSCALL, SIGTRAP)) + return 1; + + /* Check that x7 is 0 on syscall-enter. */ + if (ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov)) + return pr_perror("Can't get child registers"); + if (regs.orig_x7 != X7_TEST_VAL) + return pr_fail("Unexpected orig_x7: %lx", regs.orig_x7); + if (regs.r.regs[7] != 0) + return pr_fail("Unexpected x7: %lx", regs.r.regs[7]); + ksft_test_result_pass("x7: %llx\n", regs.r.regs[7]); + + if (ptrace_and_wait(pid, PTRACE_SYSCALL, SIGTRAP)) + return 1; + + /* Check that x7 is 1 on syscall-exit. */ + if (ptrace(PTRACE_GETREGSET, pid, NT_PRSTATUS, &iov)) + return pr_perror("Can't get child registers"); + if (regs.r.regs[7] != 1) + return pr_fail("Unexpected x7: %lx", regs.r.regs[7]); + ksft_test_result_pass("x7: %llx\n", regs.r.regs[7]); + + /* Check that the child will not see a new value of x7. */ + regs.r.regs[0] = 0; + regs.r.regs[7] = ~X7_TEST_VAL; + if (ptrace(PTRACE_SETREGSET, pid, NT_PRSTATUS, &iov)) + return pr_perror("Can't set child registers"); + + /* Resume the child to th
[PATCH 2/4] arm64/ptrace: introduce orig_x7 in the user_pt_regs structure
We have some ABI weirdness in the way that we handle syscall exit stops because we indicate whether or not the stop has been signalled from syscall entry or syscall exit by clobbering a general purpose register x7 in the tracee and restoring its old value after the stop. This behavior was inherited from ARM and it isn't common for other architectures. Now, we have PTRACE_GET_SYSCALL_INFO that gives all required information about system calls, so the hack with clobbering registers isn't needed anymore. This change instroduces orig_x7 in the user_pt_regs structure that will contains an origin value of the x7 register if the tracee is stopped in a system call.. Signed-off-by: Andrei Vagin --- arch/arm64/include/asm/ptrace.h | 1 + arch/arm64/include/uapi/asm/ptrace.h | 1 + arch/arm64/kernel/ptrace.c | 18 -- 3 files changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h index d4cdf98ac003..1008f0fbc5ea 100644 --- a/arch/arm64/include/asm/ptrace.h +++ b/arch/arm64/include/asm/ptrace.h @@ -184,6 +184,7 @@ struct pt_regs { u64 pc; u64 pstate; u64 orig_x0; + u64 orig_x7; }; }; #ifdef __AARCH64EB__ diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h index 3c118c5b0893..be7583ff5f4d 100644 --- a/arch/arm64/include/uapi/asm/ptrace.h +++ b/arch/arm64/include/uapi/asm/ptrace.h @@ -91,6 +91,7 @@ struct user_pt_regs { __u64 pc; __u64 pstate; __u64 orig_x0; + __u64 orig_x7; }; struct user_fpsimd_state { diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index 170f42fd6101..1ed5b4aa986b 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1750,7 +1750,7 @@ static void tracehook_report_syscall(struct pt_regs *regs, enum ptrace_syscall_dir dir) { int regno; - unsigned long saved_reg; + u64 _saved_reg, *saved_reg; /* * We have some ABI weirdness here in the way that we handle syscall @@ -1768,19 +1768,25 @@ static void tracehook_report_syscall(struct pt_regs *regs, * - Syscall stops behave differently to seccomp and pseudo-step traps * (the latter do not nobble any registers). */ - regno = (is_compat_task() ? 12 : 7); - saved_reg = regs->regs[regno]; + if (is_compat_task()) { + regno = 12; + saved_reg = &_saved_reg; + } else { + regno = 7; + saved_reg = ®s->orig_x7; + } + *saved_reg = regs->regs[regno]; regs->regs[regno] = dir; if (dir == PTRACE_SYSCALL_ENTER) { if (tracehook_report_syscall_entry(regs)) forget_syscall(regs); - regs->regs[regno] = saved_reg; + regs->regs[regno] = *saved_reg; } else if (!test_thread_flag(TIF_SINGLESTEP)) { tracehook_report_syscall_exit(regs, 0); - regs->regs[regno] = saved_reg; + regs->regs[regno] = *saved_reg; } else { - regs->regs[regno] = saved_reg; + regs->regs[regno] = *saved_reg; /* * Signal a pseudo-step exception since we are stepping but -- 2.29.2
[PATCH 1/4] arm64: expose orig_x0 in the user_pt_regs structure
orig_x0 is recorded at the start of the syscall entry and then it is used for resetting the argument back to its original value during syscall restarts. If orig_x0 isn't available from user-space, this makes it tricky to manage arguments of restarted system calls. Cc: Keno Fischer Signed-off-by: Andrei Vagin --- arch/arm64/include/asm/ptrace.h | 2 +- arch/arm64/include/uapi/asm/ptrace.h | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h index e58bca832dff..d4cdf98ac003 100644 --- a/arch/arm64/include/asm/ptrace.h +++ b/arch/arm64/include/asm/ptrace.h @@ -183,9 +183,9 @@ struct pt_regs { u64 sp; u64 pc; u64 pstate; + u64 orig_x0; }; }; - u64 orig_x0; #ifdef __AARCH64EB__ u32 unused2; s32 syscallno; diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h index 758ae984ff97..3c118c5b0893 100644 --- a/arch/arm64/include/uapi/asm/ptrace.h +++ b/arch/arm64/include/uapi/asm/ptrace.h @@ -90,6 +90,7 @@ struct user_pt_regs { __u64 sp; __u64 pc; __u64 pstate; + __u64 orig_x0; }; struct user_fpsimd_state { -- 2.29.2
[PATCH 0/4 v3] arm64/ptrace: allow to get all registers on syscall traps
Here are two known problems with registers when a tracee is stopped in syscall traps. The first problem is about the x7 register that is used to indicate whether or not the stop has been signalled from syscall entry or syscall exit. This means that: - Any writes by the tracer to this register during the stop are ignored/discarded. - The actual value of the register is not available during the stop, so the tracer cannot save it and restore it later. For applications like the user-mode Linux or gVisor, it is critical to have access to the full set of registers in any moment. For example, they need to change values of all registers to emulate rt_sigreturn or execve and they need to have the full set of registers to build a signal frame. The second problem is that orig_x0 isn't exposed to user-space. orig_x0 is recorded at the start of the syscall entry and then it is used for resetting the argument back to its original value during syscall restarts. This series extends the user_pt_regs structure with orig_x0 and orig_x7. This doesn't break the backward compatibility. There are two interfaces how user-space can receive user_pt_regs from the kernel. The first one is PTRACE_{GET,SET}REGSET. In this case, the user-space provides a buffer and its size and the kernel fills only the part that fits the size. The second interface is a core dump file where registers are written in a separate note and the user-space can parse only the part that it knows. Cc: Catalin Marinas Cc: Dave Martin Cc: Keno Fischer Cc: Oleg Nesterov Cc: Will Deacon Andrei Vagin (4): arm64: expose orig_x0 in the user_pt_regs structure arm64/ptrace: introduce orig_x7 in the user_pt_regs structure selftest/arm64/ptrace: add a test for orig_x0 selftest/arm64/ptrace: add a test for orig_x7 v2: use the ptrace option instead of adding a new regset. v3: append orig_x0 and orig_x7 to the user_pt_regs. arch/arm64/include/asm/ptrace.h | 5 + arch/arm64/kernel/ptrace.c| 130 +++- include/uapi/linux/elf.h | 1 + tools/testing/selftests/arm64/Makefile| 2 +- tools/testing/selftests/arm64/ptrace/Makefile | 6 + .../arm64/ptrace/ptrace_syscall_regs_test.c | 142 ++ 6 files changed, 246 insertions(+), 40 deletions(-) create mode 100644 tools/testing/selftests/arm64/ptrace/Makefile create mode 100644 tools/testing/selftests/arm64/ptrace/ptrace_syscall_regs_test.c -- 2.29.2
Re: [PATCH v2 18/18] vfs: remove unused ioctl helpers
On Mon, Mar 22, 2021 at 03:49:16PM +0100, Miklos Szeredi wrote: > Remove vfs_ioc_setflags_prepare(), vfs_ioc_fssetxattr_check() and > simple_fill_fsxattr(), which are no longer used. > > Signed-off-by: Miklos Szeredi Woo hoo, so much boilerplate goes away! Reviewed-by: Darrick J. Wong --D > --- > fs/inode.c | 87 -- > include/linux/fs.h | 12 --- > 2 files changed, 99 deletions(-) > > diff --git a/fs/inode.c b/fs/inode.c > index a047ab306f9a..ae526fd9c0a4 100644 > --- a/fs/inode.c > +++ b/fs/inode.c > @@ -12,7 +12,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -2314,89 +2313,3 @@ struct timespec64 current_time(struct inode *inode) > return timestamp_truncate(now, inode); > } > EXPORT_SYMBOL(current_time); > - > -/* > - * Generic function to check FS_IOC_SETFLAGS values and reject any invalid > - * configurations. > - * > - * Note: the caller should be holding i_mutex, or else be sure that they have > - * exclusive access to the inode structure. > - */ > -int vfs_ioc_setflags_prepare(struct inode *inode, unsigned int oldflags, > - unsigned int flags) > -{ > - /* > - * The IMMUTABLE and APPEND_ONLY flags can only be changed by > - * the relevant capability. > - * > - * This test looks nicer. Thanks to Pauline Middelink > - */ > - if ((flags ^ oldflags) & (FS_APPEND_FL | FS_IMMUTABLE_FL) && > - !capable(CAP_LINUX_IMMUTABLE)) > - return -EPERM; > - > - return fscrypt_prepare_setflags(inode, oldflags, flags); > -} > -EXPORT_SYMBOL(vfs_ioc_setflags_prepare); > - > -/* > - * Generic function to check FS_IOC_FSSETXATTR values and reject any invalid > - * configurations. > - * > - * Note: the caller should be holding i_mutex, or else be sure that they have > - * exclusive access to the inode structure. > - */ > -int vfs_ioc_fssetxattr_check(struct inode *inode, const struct fsxattr > *old_fa, > - struct fsxattr *fa) > -{ > - /* > - * Can't modify an immutable/append-only file unless we have > - * appropriate permission. > - */ > - if ((old_fa->fsx_xflags ^ fa->fsx_xflags) & > - (FS_XFLAG_IMMUTABLE | FS_XFLAG_APPEND) && > - !capable(CAP_LINUX_IMMUTABLE)) > - return -EPERM; > - > - /* > - * Project Quota ID state is only allowed to change from within the init > - * namespace. Enforce that restriction only if we are trying to change > - * the quota ID state. Everything else is allowed in user namespaces. > - */ > - if (current_user_ns() != &init_user_ns) { > - if (old_fa->fsx_projid != fa->fsx_projid) > - return -EINVAL; > - if ((old_fa->fsx_xflags ^ fa->fsx_xflags) & > - FS_XFLAG_PROJINHERIT) > - return -EINVAL; > - } > - > - /* Check extent size hints. */ > - if ((fa->fsx_xflags & FS_XFLAG_EXTSIZE) && !S_ISREG(inode->i_mode)) > - return -EINVAL; > - > - if ((fa->fsx_xflags & FS_XFLAG_EXTSZINHERIT) && > - !S_ISDIR(inode->i_mode)) > - return -EINVAL; > - > - if ((fa->fsx_xflags & FS_XFLAG_COWEXTSIZE) && > - !S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) > - return -EINVAL; > - > - /* > - * It is only valid to set the DAX flag on regular files and > - * directories on filesystems. > - */ > - if ((fa->fsx_xflags & FS_XFLAG_DAX) && > - !(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode))) > - return -EINVAL; > - > - /* Extent size hints of zero turn off the flags. */ > - if (fa->fsx_extsize == 0) > - fa->fsx_xflags &= ~(FS_XFLAG_EXTSIZE | FS_XFLAG_EXTSZINHERIT); > - if (fa->fsx_cowextsize == 0) > - fa->fsx_xflags &= ~FS_XFLAG_COWEXTSIZE; > - > - return 0; > -} > -EXPORT_SYMBOL(vfs_ioc_fssetxattr_check); > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9e7f6a592a70..1e88ace15004 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -3571,18 +3571,6 @@ extern int vfs_fadvise(struct file *file, loff_t > offset, loff_t len, > extern int generic_fadvise(struct file *file, loff_t offset, loff_t len, > int advice); > > -int vfs_ioc_setflags_prepare(struct inode *inode, unsigned int oldflags, > - unsigned int flags); > - > -int vfs_ioc_fssetxattr_check(struct inode *inode, const struct fsxattr > *old_fa, > - struct fsxattr *fa); > - > -static inline void simple_fill_fsxattr(struct fsxattr *fa, __u32 xflags) > -{ > - memset(fa, 0, sizeof(*fa)); > - fa->fsx_xflags = xflags; > -} > - > /* > * Flush file data before changing attributes. Caller must hold any locks > * required to prevent further writes to this file until we're done settin
Re: [PATCH v2 10/18] xfs: convert to miscattr
On Mon, Mar 22, 2021 at 03:49:08PM +0100, Miklos Szeredi wrote: > Use the miscattr API to let the VFS handle locking, permission checking and > conversion. > > Signed-off-by: Miklos Szeredi > Cc: Darrick J. Wong > --- > fs/xfs/libxfs/xfs_fs.h | 4 - > fs/xfs/xfs_ioctl.c | 316 - > fs/xfs/xfs_ioctl.h | 11 ++ > fs/xfs/xfs_ioctl32.c | 2 - > fs/xfs/xfs_ioctl32.h | 2 - > fs/xfs/xfs_iops.c | 7 + > 6 files changed, 107 insertions(+), 235 deletions(-) > > diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h > index 6fad140d4c8e..6bf7d8b7d743 100644 > --- a/fs/xfs/libxfs/xfs_fs.h > +++ b/fs/xfs/libxfs/xfs_fs.h > @@ -770,8 +770,6 @@ struct xfs_scrub_metadata { > /* > * ioctl commands that are used by Linux filesystems > */ > -#define XFS_IOC_GETXFLAGSFS_IOC_GETFLAGS > -#define XFS_IOC_SETXFLAGSFS_IOC_SETFLAGS > #define XFS_IOC_GETVERSION FS_IOC_GETVERSION > > /* > @@ -782,8 +780,6 @@ struct xfs_scrub_metadata { > #define XFS_IOC_ALLOCSP _IOW ('X', 10, struct xfs_flock64) > #define XFS_IOC_FREESP _IOW ('X', 11, struct xfs_flock64) > #define XFS_IOC_DIOINFO _IOR ('X', 30, struct dioattr) > -#define XFS_IOC_FSGETXATTR FS_IOC_FSGETXATTR > -#define XFS_IOC_FSSETXATTR FS_IOC_FSSETXATTR > #define XFS_IOC_ALLOCSP64_IOW ('X', 36, struct xfs_flock64) > #define XFS_IOC_FREESP64 _IOW ('X', 37, struct xfs_flock64) > #define XFS_IOC_GETBMAP _IOWR('X', 38, struct getbmap) > diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c > index 99dfe89a8d08..e27e3ff9a651 100644 > --- a/fs/xfs/xfs_ioctl.c > +++ b/fs/xfs/xfs_ioctl.c > @@ -40,6 +40,7 @@ > > #include > #include > +#include > > /* > * xfs_find_handle maps from userspace xfs_fsop_handlereq structure to > @@ -1053,97 +1054,51 @@ xfs_ioc_ag_geometry( > * Linux extended inode flags interface. > */ > > -STATIC unsigned int > -xfs_merge_ioc_xflags( > - unsigned intflags, > - unsigned intstart) > -{ > - unsigned intxflags = start; > - > - if (flags & FS_IMMUTABLE_FL) > - xflags |= FS_XFLAG_IMMUTABLE; > - else > - xflags &= ~FS_XFLAG_IMMUTABLE; > - if (flags & FS_APPEND_FL) > - xflags |= FS_XFLAG_APPEND; > - else > - xflags &= ~FS_XFLAG_APPEND; > - if (flags & FS_SYNC_FL) > - xflags |= FS_XFLAG_SYNC; > - else > - xflags &= ~FS_XFLAG_SYNC; > - if (flags & FS_NOATIME_FL) > - xflags |= FS_XFLAG_NOATIME; > - else > - xflags &= ~FS_XFLAG_NOATIME; > - if (flags & FS_NODUMP_FL) > - xflags |= FS_XFLAG_NODUMP; > - else > - xflags &= ~FS_XFLAG_NODUMP; > - if (flags & FS_DAX_FL) > - xflags |= FS_XFLAG_DAX; > - else > - xflags &= ~FS_XFLAG_DAX; > - > - return xflags; > -} > - > -STATIC unsigned int > -xfs_di2lxflags( > - uint16_tdi_flags, > - uint64_tdi_flags2) > -{ > - unsigned intflags = 0; > - > - if (di_flags & XFS_DIFLAG_IMMUTABLE) > - flags |= FS_IMMUTABLE_FL; > - if (di_flags & XFS_DIFLAG_APPEND) > - flags |= FS_APPEND_FL; > - if (di_flags & XFS_DIFLAG_SYNC) > - flags |= FS_SYNC_FL; > - if (di_flags & XFS_DIFLAG_NOATIME) > - flags |= FS_NOATIME_FL; > - if (di_flags & XFS_DIFLAG_NODUMP) > - flags |= FS_NODUMP_FL; > - if (di_flags2 & XFS_DIFLAG2_DAX) { > - flags |= FS_DAX_FL; > - } > - return flags; > -} > - > static void > xfs_fill_fsxattr( > struct xfs_inode*ip, > boolattr, > - struct fsxattr *fa) > + struct miscattr *ma) > { > struct xfs_ifork*ifp = attr ? ip->i_afp : &ip->i_df; Hm, could you replace "bool attr" with "int whichfork"? The new signature and first line of code becomes: static void xfs_fill_fsxattr( struct xfs_inode*ip, int whichfork, struct miscattr *ma) { struct xfs_ifork*ifp = XFS_IFORK_PTR(ip, whichfork); ...and then the two wrappers of xfs_fill_fsxattr become: STATIC int xfs_ioc_fsgetxattra( struct xfs_inode*ip, void__user *arg) { struct miscattr ma; xfs_ilock(ip, XFS_ILOCK_SHARED); xfs_fill_fsxattr(ip, XFS_ATTR_FORK, &ma); xfs_iunlock(ip, XFS_ILOCK_SHARED); return fsxattr_copy_to_user(&ma, arg); } int xfs_miscattr_get( struct dentry *dentry, struct miscattr *ma) { struct xfs_inode*ip = XFS_I(d_inode(dentry)); xfs_ilock(ip, XFS_ILOCK_SHARED); xfs_fill_fsxattr(ip, XFS_DATA_FORK, ma); xfs_iunlock(ip, XFS_ILOCK_SHARED); return 0; } This makes it clearer that FSGETXATTRA reports on the extended attributes
Re: [PATCH 02/11] x86: tboot: avoid Wstringop-overread-warning
On Mon, Mar 22, 2021 at 11:09 PM Martin Sebor wrote: > On 3/22/21 2:29 PM, Ingo Molnar wrote: > > * Arnd Bergmann wrote: > > > > I.e. the real workaround might be to turn off the > > -Wstringop-overread-warning, > > until GCC-11 gets fixed? > > In GCC 10 -Wstringop-overread is a subset of -Wstringop-overflow. > GCC 11 breaks it out as a separate warning to make it easier to > control. Both warnings have caught some real bugs but they both > have a nonzero rate of false positives. Other than bug reports > we don't have enough data to say what their S/N ratio might be > but my sense is that it's fairly high in general. > >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overread >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overflow Unfortunately, stringop-overflow is one of the warnings that is completely disabled in the kernel at the moment, rather than enabled at one of the user-selectable higher warning levels. I have a patch series to change that and to pull some of these into the lower W=1 or W=2 levels or even enable them by default. To do this though, I first need to ensure that the actual output is empty for the normal builds. I added -Wstringop-overflow to the list of warnings I wanted to address because it is a new warning and only has a dozen or so occurrences throughout the kernel. > In GCC 11, all access warnings expect objects to be either declared > or allocated. Pointers with constant values are taken to point to > nothing valid (as Arnd mentioned above, this is to detect invalid > accesses to members of structs at address zero). > > One possible solution to the known address problem is to extend GCC > attributes address and io that pin an object to a hardwired address > to all targets (at the moment they're supported on just one or two > targets). I'm not sure this can still happen before GCC 11 releases > sometime in April or May. > > Until then, another workaround is to convert the fixed address to > a volatile pointer before using it for the access, along the lines > below. It should have only a negligible effect on efficiency. > > diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c > index 4c09ba110204..76326b906010 100644 > --- a/arch/x86/kernel/tboot.c > +++ b/arch/x86/kernel/tboot.c > @@ -67,7 +67,9 @@ void __init tboot_probe(void) > /* Map and check for tboot UUID. */ > set_fixmap(FIX_TBOOT_BASE, boot_params.tboot_addr); > tboot = (struct tboot *)fix_to_virt(FIX_TBOOT_BASE); > - if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { > + if (memcmp(&tboot_uuid, > + (*(struct tboot* volatile *)(&tboot))->uuid, > + sizeof(tboot->uuid))) { > pr_warn("tboot at 0x%llx is invalid\n", I think a stray 'volatile' would raise too many red flags here, but I checked that the RELOC_HIDE() macro has the same effect, e.g. @@ -66,7 +67,8 @@ void __init tboot_probe(void) /* Map and check for tboot UUID. */ set_fixmap(FIX_TBOOT_BASE, boot_params.tboot_addr); - tboot = (struct tboot *)fix_to_virt(FIX_TBOOT_BASE); + /* RELOC_HIDE to prevent gcc warnings about NULL pointer */ + tboot = RELOC_HIDE(NULL, fix_to_virt(FIX_TBOOT_BASE)); if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { pr_warn("tboot at 0x%llx is invalid\n", boot_params.tboot_addr); tboot = NULL; Arnd
Re: [PATCH v4 15/19] dts: bindings: Document device tree bindings for ETE
On 22/03/2021 17:28, Rob Herring wrote: On Mon, Mar 22, 2021 at 10:53 AM Suzuki K Poulose wrote: Hi Rob On 06/03/2021 21:06, Rob Herring wrote: On Thu, Feb 25, 2021 at 07:35:39PM +, Suzuki K Poulose wrote: Document the device tree bindings for Embedded Trace Extensions. ETE can be connected to legacy coresight components and thus could optionally contain a connection graph as described by the CoreSight bindings. Cc: devicet...@vger.kernel.org Cc: Mathieu Poirier Cc: Mike Leach Cc: Rob Herring Signed-off-by: Suzuki K Poulose --- + out-ports: +description: | + Output connections from the ETE to legacy CoreSight trace bus. +$ref: /schemas/graph.yaml#/properties/port s/port/ports/ Ok. And then you need: properties: port: description: what this port is $ref: /schemas/graph.yaml#/properties/port Isn't this already covered by the definition of ports ? There are no fixed connections for ETE. It is optional and could be connected to any legacy CoreSight component. i.e, a "ports" object can have port objects inside. 'properties/ports' only defines that you have 'port' nodes within it. Given we have defined out-ports as an object "confirming to the ports" do we need to describe the individual port nodes ? Yes, you have to define what the 'port' nodes are. A port is a data stream and you should know what your hardware has. What the data stream is connected to is outside the scope of the binding. Ok, I have included the above changes for the next version. Thanks Suzuki
RE: [PATCH v2] mmc: sdhci-of-dwcmshc: add ACPI support for BlueField-3 SoC
> -Original Message- > From: Ulf Hansson > Sent: Monday, March 22, 2021 5:51 AM > To: Liming Sun > Cc: Adrian Hunter ; Khalil Blaiech > ; linux-mmc ; Linux > Kernel Mailing List > Subject: Re: [PATCH v2] mmc: sdhci-of-dwcmshc: add ACPI support for > BlueField-3 SoC > > On Fri, 19 Mar 2021 at 21:23, Liming Sun wrote: > > > > Uffe, > > > > Can I confirm whether you meant the 'master' branch or some other > branch? > > I did a rebase of master and didn't see Shawn Lin's changes in the sdhci-of- > dwcmshc.c > > git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc.git next Thanks! Rebased and posted v3. > > [...] > > Kind regards > Uffe
[PATCH v3] mmc: sdhci-of-dwcmshc: add ACPI support for BlueField-3 SoC
This commit adds ACPI support in the sdhci-of-dwcmshc driver for BlueField-3 SoC. It has changes to only use the clock hierarchy for Deviec Tree since the clk is not supported by ACPI. Instead, ACPI can define 'clock-frequency' which is parsed by existing sdhci_get_property(). This clock value will be returned in function dwcmshc_get_max_clock(). Signed-off-by: Liming Sun Reviewed-by: Khalil Blaiech --- v2->v3: Rebase to mmc next. v1->v2: Changes for comments from Adrian Hunter : - Make changes in sdhci-of-dwcmshc instead. v1: Initial version which was done in sdhci-acpi.c --- drivers/mmc/host/sdhci-of-dwcmshc.c | 50 ++--- 1 file changed, 36 insertions(+), 14 deletions(-) diff --git a/drivers/mmc/host/sdhci-of-dwcmshc.c b/drivers/mmc/host/sdhci-of-dwcmshc.c index 0687368..1113a56 100644 --- a/drivers/mmc/host/sdhci-of-dwcmshc.c +++ b/drivers/mmc/host/sdhci-of-dwcmshc.c @@ -7,6 +7,7 @@ * Author: Jisheng Zhang */ +#include #include #include #include @@ -94,6 +95,16 @@ static void dwcmshc_adma_write_desc(struct sdhci_host *host, void **desc, sdhci_adma_write_desc(host, desc, addr, len, cmd); } +static unsigned int dwcmshc_get_max_clock(struct sdhci_host *host) +{ + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(host); + + if (pltfm_host->clk) + return sdhci_pltfm_clk_get_max_clock(host); + else + return pltfm_host->clock; +} + static void dwcmshc_check_auto_cmd23(struct mmc_host *mmc, struct mmc_request *mrq) { @@ -248,7 +259,7 @@ static void dwcmshc_rk3568_set_clock(struct sdhci_host *host, unsigned int clock .set_clock = sdhci_set_clock, .set_bus_width = sdhci_set_bus_width, .set_uhs_signaling = dwcmshc_set_uhs_signaling, - .get_max_clock = sdhci_pltfm_clk_get_max_clock, + .get_max_clock = dwcmshc_get_max_clock, .reset = sdhci_reset, .adma_write_desc= dwcmshc_adma_write_desc, }; @@ -323,8 +334,16 @@ static int dwcmshc_rk3568_init(struct sdhci_host *host, struct dwcmshc_priv *dwc }; MODULE_DEVICE_TABLE(of, sdhci_dwcmshc_dt_ids); +#ifdef CONFIG_ACPI +static const struct acpi_device_id sdhci_dwcmshc_acpi_ids[] = { + { .id = "MLNXBF30" }, + {} +}; +#endif + static int dwcmshc_probe(struct platform_device *pdev) { + struct device *dev = &pdev->dev; struct sdhci_pltfm_host *pltfm_host; struct sdhci_host *host; struct dwcmshc_priv *priv; @@ -347,7 +366,7 @@ static int dwcmshc_probe(struct platform_device *pdev) /* * extra adma table cnt for cross 128M boundary handling. */ - extra = DIV_ROUND_UP_ULL(dma_get_required_mask(&pdev->dev), SZ_128M); + extra = DIV_ROUND_UP_ULL(dma_get_required_mask(dev), SZ_128M); if (extra > SDHCI_MAX_SEGS) extra = SDHCI_MAX_SEGS; host->adma_table_cnt += extra; @@ -355,19 +374,21 @@ static int dwcmshc_probe(struct platform_device *pdev) pltfm_host = sdhci_priv(host); priv = sdhci_pltfm_priv(pltfm_host); - pltfm_host->clk = devm_clk_get(&pdev->dev, "core"); - if (IS_ERR(pltfm_host->clk)) { - err = PTR_ERR(pltfm_host->clk); - dev_err(&pdev->dev, "failed to get core clk: %d\n", err); - goto free_pltfm; - } - err = clk_prepare_enable(pltfm_host->clk); - if (err) - goto free_pltfm; + if (dev->of_node) { + pltfm_host->clk = devm_clk_get(dev, "core"); + if (IS_ERR(pltfm_host->clk)) { + err = PTR_ERR(pltfm_host->clk); + dev_err(dev, "failed to get core clk: %d\n", err); + goto free_pltfm; + } + err = clk_prepare_enable(pltfm_host->clk); + if (err) + goto free_pltfm; - priv->bus_clk = devm_clk_get(&pdev->dev, "bus"); - if (!IS_ERR(priv->bus_clk)) - clk_prepare_enable(priv->bus_clk); + priv->bus_clk = devm_clk_get(dev, "bus"); + if (!IS_ERR(priv->bus_clk)) + clk_prepare_enable(priv->bus_clk); + } err = mmc_of_parse(host->mmc); if (err) @@ -489,6 +510,7 @@ static int dwcmshc_resume(struct device *dev) .name = "sdhci-dwcmshc", .probe_type = PROBE_PREFER_ASYNCHRONOUS, .of_match_table = sdhci_dwcmshc_dt_ids, + .acpi_match_table = ACPI_PTR(sdhci_dwcmshc_acpi_ids), .pm = &dwcmshc_pmops, }, .probe = dwcmshc_probe, -- 1.8.3.1
[PATCH] tracing: Fix various typos in comments
Fix ~59 single-word typos in the tracing code comments. Signed-off-by: Ingo Molnar Cc: Steven Rostedt Cc: linux-kernel@vger.kernel.org --- arch/microblaze/include/asm/ftrace.h | 2 +- arch/nds32/kernel/ftrace.c | 2 +- arch/powerpc/include/asm/ftrace.h| 4 ++-- arch/sh/kernel/ftrace.c | 2 +- arch/sparc/include/asm/ftrace.h | 2 +- fs/tracefs/inode.c | 2 +- include/linux/ftrace.h | 4 ++-- include/linux/trace_events.h | 2 +- include/linux/tracepoint.h | 2 +- include/trace/events/io_uring.h | 2 +- include/trace/events/rcu.h | 2 +- include/trace/events/sched.h | 2 +- include/trace/events/timer.h | 2 +- kernel/trace/bpf_trace.c | 4 ++-- kernel/trace/fgraph.c| 4 ++-- kernel/trace/ftrace.c| 8 kernel/trace/ring_buffer.c | 2 +- kernel/trace/synth_event_gen_test.c | 2 +- kernel/trace/trace.c | 18 +- kernel/trace/trace.h | 4 ++-- kernel/trace/trace_event_perf.c | 2 +- kernel/trace/trace_events.c | 4 ++-- kernel/trace/trace_events_filter.c | 4 ++-- kernel/trace/trace_events_synth.c| 2 +- kernel/trace/trace_functions_graph.c | 2 +- kernel/trace/trace_hwlat.c | 4 ++-- kernel/trace/trace_kprobe.c | 2 +- kernel/trace/trace_probe.c | 6 +++--- kernel/trace/trace_probe.h | 2 +- kernel/trace/trace_probe_tmpl.h | 2 +- kernel/trace/trace_selftest.c| 4 ++-- kernel/trace/trace_seq.c | 12 ++-- 32 files changed, 59 insertions(+), 59 deletions(-) diff --git a/arch/microblaze/include/asm/ftrace.h b/arch/microblaze/include/asm/ftrace.h index 5db7f4489f05..6a92bed37794 100644 --- a/arch/microblaze/include/asm/ftrace.h +++ b/arch/microblaze/include/asm/ftrace.h @@ -13,7 +13,7 @@ extern void ftrace_call_graph(void); #endif #ifdef CONFIG_DYNAMIC_FTRACE -/* reloction of mcount call site is the same as the address */ +/* relocation of mcount call site is the same as the address */ static inline unsigned long ftrace_call_adjust(unsigned long addr) { return addr; diff --git a/arch/nds32/kernel/ftrace.c b/arch/nds32/kernel/ftrace.c index 414f8a780cc3..0e23e3a8df6b 100644 --- a/arch/nds32/kernel/ftrace.c +++ b/arch/nds32/kernel/ftrace.c @@ -236,7 +236,7 @@ void __naked return_to_handler(void) "bal ftrace_return_to_handler\n\t" "move $lp, $r0 \n\t" - /* restore state nedded by the ABI */ + /* restore state needed by the ABI */ "lmw.bim $r0,[$sp],$r1,#0x0 \n\t"); } diff --git a/arch/powerpc/include/asm/ftrace.h b/arch/powerpc/include/asm/ftrace.h index bc76970b6ee5..debe8c4f7062 100644 --- a/arch/powerpc/include/asm/ftrace.h +++ b/arch/powerpc/include/asm/ftrace.h @@ -12,7 +12,7 @@ #ifdef __ASSEMBLY__ -/* Based off of objdump optput from glibc */ +/* Based off of objdump output from glibc */ #define MCOUNT_SAVE_FRAME \ stwur1,-48(r1); \ @@ -52,7 +52,7 @@ extern void _mcount(void); static inline unsigned long ftrace_call_adjust(unsigned long addr) { - /* reloction of mcount call site is the same as the address */ + /* relocation of mcount call site is the same as the address */ return addr; } diff --git a/arch/sh/kernel/ftrace.c b/arch/sh/kernel/ftrace.c index 0646c5961846..295c43315bbe 100644 --- a/arch/sh/kernel/ftrace.c +++ b/arch/sh/kernel/ftrace.c @@ -67,7 +67,7 @@ static unsigned char *ftrace_call_replace(unsigned long ip, unsigned long addr) * Modifying code must take extra care. On an SMP machine, if * the code being modified is also being executed on another CPU * that CPU will have undefined results and possibly take a GPF. - * We use kstop_machine to stop other CPUS from exectuing code. + * We use kstop_machine to stop other CPUS from executing code. * But this does not stop NMIs from happening. We still need * to protect against that. We separate out the modification of * the code to take care of this. diff --git a/arch/sparc/include/asm/ftrace.h b/arch/sparc/include/asm/ftrace.h index d3aa1a524431..e284394cb3aa 100644 --- a/arch/sparc/include/asm/ftrace.h +++ b/arch/sparc/include/asm/ftrace.h @@ -17,7 +17,7 @@ void _mcount(void); #endif #ifdef CONFIG_DYNAMIC_FTRACE -/* reloction of mcount call site is the same as the address */ +/* relocation of mcount call site is the same as the address */ static inline unsigned long ftrace_call_adjust(unsigned long addr) { return addr; diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c index 4b83cbded559..1261e8b41edb 100644 --- a/fs/tracefs/inode.c +++ b/fs/tracefs/inode.c @@ -477,7 +477,7 @@ struct dentry *tracefs_create_dir(const char *name, struct dentry *parent) * * The instances directo
Re: [PATCH] drm/amd: Fix a typo in two different sentences
On 3/22/21 2:06 PM, Bhaskar Chowdhury wrote: > > s/defintion/definition/ .two different places. > > Signed-off-by: Bhaskar Chowdhury Acked-by: Randy Dunlap > --- > drivers/gpu/drm/amd/include/atombios.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/gpu/drm/amd/include/atombios.h > b/drivers/gpu/drm/amd/include/atombios.h > index c1d7b1d0b952..47eb84598b96 100644 > --- a/drivers/gpu/drm/amd/include/atombios.h > +++ b/drivers/gpu/drm/amd/include/atombios.h > @@ -1987,9 +1987,9 @@ typedef struct _PIXEL_CLOCK_PARAMETERS_V6 > #define PIXEL_CLOCK_V6_MISC_HDMI_BPP_MASK 0x0c > #define PIXEL_CLOCK_V6_MISC_HDMI_24BPP 0x00 > #define PIXEL_CLOCK_V6_MISC_HDMI_36BPP 0x04 > -#define PIXEL_CLOCK_V6_MISC_HDMI_36BPP_V6 0x08//for V6, the > correct defintion for 36bpp should be 2 for 36bpp(2:1) > +#define PIXEL_CLOCK_V6_MISC_HDMI_36BPP_V6 0x08//for V6, the > correct definition for 36bpp should be 2 for 36bpp(2:1) > #define PIXEL_CLOCK_V6_MISC_HDMI_30BPP 0x08 > -#define PIXEL_CLOCK_V6_MISC_HDMI_30BPP_V6 0x04//for V6, the > correct defintion for 30bpp should be 1 for 36bpp(5:4) > +#define PIXEL_CLOCK_V6_MISC_HDMI_30BPP_V6 0x04//for V6, the > correct definition for 30bpp should be 1 for 36bpp(5:4) > #define PIXEL_CLOCK_V6_MISC_HDMI_48BPP 0x0c > #define PIXEL_CLOCK_V6_MISC_REF_DIV_SRC 0x10 > #define PIXEL_CLOCK_V6_MISC_GEN_DPREFCLK0x40 > -- -- ~Randy
Re: [PATCH V4 06/10] x86/fault: Adjust WARN_ON for PKey fault
On Mon, Mar 22, 2021 at 09:05:43AM -0700, Sean Christopherson wrote: > On Sun, Mar 21, 2021, ira.we...@intel.com wrote: > > From: Ira Weiny > > > > PKey faults may now happen on kernel mappings if the feature is enabled. > > Remove the warning in the fault path if PKS is enabled. > > When/why can they happen? I read through all the changelogs, as well as the > cover letters for v1 and the RFC, and didn't see any explicit statement about > why pkey faults on supervisor accesses are now "legal". Ok, I have to admit I did not think about documenting this detail... I'll update the commit message a bit more. Prior to this series pkeys were only supported on user page mappings. Therefore seeing a X86_PF_PK error in this path was completely unexpected and warranted the extra WARN_ON to indicate that something went very wrong. > Explaining what happens > later in the page fault handler would also be helpful, e.g. is the flag simply > ignored? Ok I'll do this. But the behavior does not change. The fault is unhandled and results in an Ooops. The only difference is that if PKS is enabled and configured on a kernel mapping the oops is to be expected. > Does it lead directly to OOPS? Yes, the series concludes with it being an ooops unless the test code is running. The behavior does not change from before. I'll more clearly document that... > > Documenting what happens on a PKS #PF in the API patch would be nice to have, > too. Ok, yes, good idea. > > > Reviewed-by: Dan Williams > > Signed-off-by: Ira Weiny > > --- > > arch/x86/mm/fault.c | 9 + > > 1 file changed, 5 insertions(+), 4 deletions(-) > > > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index a73347e2cdfc..731ec90ed413 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -1141,11 +1141,12 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned > > long hw_error_code, > >unsigned long address) > > { > > /* > > -* Protection keys exceptions only happen on user pages. We > > -* have no user pages in the kernel portion of the address > > -* space, so do not expect them here. > > +* PF_PK is expected on kernel addresses when supervisor pkeys are > > "is expected" can be misinterpreted as "PF is expected on all kernel > addresses...". Yes the commit message was more clear by using 'may'. > > This ties in with the lack of an explanation in the changelog. > > > +* enabled. > > It'd be helpful to spell out "Protection keys exceptions" so that random > readers > don't need to search for PF_PK to understand what's up. Maybe even use it as > an > opportunity to introduce "pkeys", e.g. > > /* Protection keys (pkeys) exceptions are ... */ Fair enough. Will do. I've changed this to: /* * X86_PF_PK (Protection key exceptions) may occur on kernel addresses * when PKS (PKeys Supervisor) are enabled. * * If PKS is not enabled an exception should only happen on user pages. * Because, we have no user pages in the kernel portion of the address * space something must have gone very wrong and we should WARN. */ > > > */ > > - WARN_ON_ONCE(hw_error_code & X86_PF_PK); > > + if (!cpu_feature_enabled(X86_FEATURE_PKS)) > > + WARN_ON_ONCE(hw_error_code & X86_PF_PK); > > Does this generate the same code if the whole thing is thrown in the WARN? > E.g. > > WARN_ON_ONCE(!cpu_feature_enabled(X86_FEATURE_PKS) && >(hw_error_code & X86_PF_PK)); I don't know in the general case. But if CONFIG_BUG=n this would be better. I've changed it. Thanks! Ira
Re: [PATCH] scsi: bfa: Fix a typo in two places
On 3/22/21 1:58 PM, Bhaskar Chowdhury wrote: > > s/defintions/definitions/ two different places. > > Signed-off-by: Bhaskar Chowdhury Acked-by: Randy Dunlap > --- > drivers/scsi/bfa/bfa_fc.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/scsi/bfa/bfa_fc.h b/drivers/scsi/bfa/bfa_fc.h > index d536270bbe9f..0314e4b9e1fb 100644 > --- a/drivers/scsi/bfa/bfa_fc.h > +++ b/drivers/scsi/bfa/bfa_fc.h > @@ -1193,7 +1193,7 @@ enum { > }; > > /* > - * defintions for CT reason code > + * definitions for CT reason code > */ > enum { > CT_RSN_INV_CMD = 0x01, > @@ -1240,7 +1240,7 @@ enum { > }; > > /* > - * defintions for the explanation code for all servers > + * definitions for the explanation code for all servers > */ > enum { > CT_EXP_AUTH_EXCEPTION = 0xF1, > -- -- ~Randy
Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
On Tue, Mar 23, 2021 at 11:06:43AM +1300, Kai Huang wrote: > This path is called by host SGX driver only, so yes this leaking is done by > host enclaves only. Yes, so I was told. > This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so > virtual > EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently. > This patch doesn't have intention to bring functional change. I changed the > error msg because Dave said it worth to give a message saying EPC page is > leaked, and I thought this minor change won't break anything. I read that already - you don't have to repeat it. > btw, currently virtual EPC code (patch 5) handles in similar way: There's one > EREMOVE error is expected and virtual EPC code can handle, but for other > errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC > page is leaked too: > > + WARN_ONCE(ret != SGX_CHILD_PRESENT, > + "EREMOVE (EPC page 0x%lx): unexpected error: %d\n", > + sgx_get_epc_phys_addr(epc_page), ret); > + return ret; > > So to me they are just WARN() to catch kernel bug. You don't care about users, do you? Because when that error happens, they won't come crying to you to fix it. Lemme save you some trouble: we don't take half-baked code into the kernel until stuff has been discussed and analyzed properly. So instead of trying to downplay this, try answering my questions. Here's another one: when does EREMOVE fail? /me goes and looks it up "The instruction fails if the operand is not properly aligned or does not refer to an EPC page or the page is in use by another thread, or other threads are running in the enclave to which the page belongs. In addition the instruction fails if the operand refers to an SECS with associations." And I guess those conditions will become more in the future. Now, let's play. I'm the cloud admin and you're cloud OS customer support. I say: "I got this scary error message while running enclaves on my server "EREMOVE returned ... . EPC page leaked. Reboot required to retrieve leaked pages." but I cannot reboot that machine because there are guests running on it and I'm getting paid for those guests and I might get sued if I do?" Your turn, go wild. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
[PATCH] x86/mce: Add Skylake quirk for patrol scrub reported errors
From: Youquan Song Skylake has a mode where the system administrator can use a BIOS setup option to request that the memory controller report uncorrected errors found by the patrol scrubber as corrected. This results in them being signalled using CMCI, which is less disruptive than a machine check. Add a quirk to detect that a "corrected" error is actually a downgraded uncorrected error with model specific checks for the "MSCOD" signature in MCi_STATUS and that the error was reported from a memory controller bank. Adjust the severity to MCE_AO_SEVERITY so that Linux will try to take the affected page offline. [Tony: Wordsmith commit comment] Signed-off-by: Youquan Song Signed-off-by: Tony Luck --- Repost ... looks like this got lost somewhere. V2: Boris: Don't optimize with pointer to quirk function. Just do the vendor/family/model check in the adjust_mce_log() function Tony: Add check for stepping >= 4 --- arch/x86/kernel/cpu/mce/core.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index e9265e2f28c9..2d5fe23adf29 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -673,6 +673,35 @@ static void mce_read_aux(struct mce *m, int i) } } +/* + * Skylake family CPUs have a mode where the user can request that + * the memory controller report uncorrected errors found by the patrol + * scrubber as corrected (MCI_STATUS_UC == 0). This results in them being + * signalled using CMCI, which is less disruptive that a machine check. + * The following quirk detects such errors and adjusts the severity. + */ + +#define MSCOD_UCE_SCRUB(0x0010 << 16) /* UnCorrected Patrol Scrub Error */ +#define MSCOD_MASK GENMASK_ULL(31, 16) + +static void adjust_mce_log(struct mce *m) +{ + struct cpuinfo_x86 *c = &boot_cpu_data; + + if (c->x86_vendor == X86_VENDOR_INTEL && c->x86 == 6 && + c->x86_model == INTEL_FAM6_SKYLAKE_X && c->x86_stepping >= 4) { + /* +* Check the error code to see if this is an uncorrected patrol +* scrub error from one of the memory controller banks. If so, +* then adjust the severity level to MCE_AO_SEVERITY +*/ + if (((m->status & MCACOD_SCRUBMSK) == MCACOD_SCRUB) && + ((m->status & MSCOD_MASK) == MSCOD_UCE_SCRUB) && + m->bank >= 13 && m->bank <= 18) + m->severity = MCE_AO_SEVERITY; + } +} + DEFINE_PER_CPU(unsigned, mce_poll_count); /* @@ -772,6 +801,7 @@ bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b) if (mca_cfg.dont_log_ce && !mce_usable_address(&m)) goto clear_it; + adjust_mce_log(&m); mce_log(&m); clear_it: -- 2.21.1
Re: [PATCH 2/2] mm/vmalloc: Use kvmalloc to allocate the table of pages
On Mon, Mar 22, 2021 at 07:38:20PM +, Matthew Wilcox (Oracle) wrote: > If we're trying to allocate 4MB of memory, the table will be 8KiB in size > (1024 pointers * 8 bytes per pointer), which can usually be satisfied > by a kmalloc (which is significantly faster). Instead of changing this > open-coded implementation, just use kvmalloc(). > > Signed-off-by: Matthew Wilcox (Oracle) > --- > mm/vmalloc.c | 7 +-- > 1 file changed, 1 insertion(+), 6 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 96444d64129a..32b640a84250 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2802,13 +2802,8 @@ static void *__vmalloc_area_node(struct vm_struct > *area, gfp_t gfp_mask, > gfp_mask |= __GFP_HIGHMEM; > > /* Please note that the recursion is strictly bounded. */ > - if (array_size > PAGE_SIZE) { > - pages = __vmalloc_node(array_size, 1, nested_gfp, node, > + pages = kvmalloc_node_caller(array_size, nested_gfp, node, > area->caller); > - } else { > - pages = kmalloc_node(array_size, nested_gfp, node); > - } > - > if (!pages) { > free_vm_area(area); > return NULL; > -- > 2.30.2 Makes sense to me. Though i expected a bigger difference: # patch single CPU, 4MB allocation, loops: 100 avg: 85293854 usec # default single CPU, 4MB allocation, loops: 100 avg: 89275857 usec One question. Should we care much about fragmentation? I mean with the patch, allocations > 2MB will do request to SLAB bigger then PAGE_SIZE. Thanks! -- Vlad Rezki
Re: [PATCH v2 01/18] vfs: add miscattr ops
On Mon, Mar 22, 2021 at 03:48:59PM +0100, Miklos Szeredi wrote: > There's a substantial amount of boilerplate in filesystems handling > FS_IOC_[GS]ETFLAGS/ FS_IOC_FS[GS]ETXATTR ioctls. > > Also due to userspace buffers being involved in the ioctl API this is > difficult to stack, as shown by overlayfs issues related to these ioctls. > > Introduce a new internal API named "miscattr" (fsxattr can be confused with > xattr, xflags is inappropriate, since this is more than just flags). > > There's significant overlap between flags and xflags and this API handles > the conversions automatically, so filesystems may choose which one to use. > > In ->miscattr_get() a hint is provided to the filesystem whether flags or > xattr are being requested by userspace, but in this series this hint is > ignored by all filesystems, since generating all the attributes is cheap. > > If a filesystem doesn't implemement the miscattr API, just fall back to > f_op->ioctl(). When all filesystems are converted, the fallback can be > removed. > > 32bit compat ioctls are now handled by the generic code as well. > > Signed-off-by: Miklos Szeredi > --- > Documentation/filesystems/locking.rst | 5 + > Documentation/filesystems/vfs.rst | 15 ++ > fs/ioctl.c| 329 ++ > include/linux/fs.h| 4 + > include/linux/miscattr.h | 53 + > 5 files changed, 406 insertions(+) > create mode 100644 include/linux/miscattr.h > > diff --git a/Documentation/filesystems/locking.rst > b/Documentation/filesystems/locking.rst > index b7dcc86c92a4..a5aa2046d48f 100644 > --- a/Documentation/filesystems/locking.rst > +++ b/Documentation/filesystems/locking.rst > @@ -80,6 +80,9 @@ prototypes:: > struct file *, unsigned open_flag, > umode_t create_mode); > int (*tmpfile) (struct inode *, struct dentry *, umode_t); > + int (*miscattr_set)(struct user_namespace *mnt_userns, > + struct dentry *dentry, struct miscattr *ma); > + int (*miscattr_get)(struct dentry *dentry, struct miscattr *ma); > > locking rules: > all may block > @@ -107,6 +110,8 @@ fiemap: no > update_time: no > atomic_open: shared (exclusive if O_CREAT is set in open flags) > tmpfile: no > +miscattr_get:no or exclusive > +miscattr_set:exclusive > = > > > diff --git a/Documentation/filesystems/vfs.rst > b/Documentation/filesystems/vfs.rst > index 2049bbf5e388..f125ce6c3b47 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -441,6 +441,9 @@ As of kernel 2.6.22, the following members are defined: > unsigned open_flag, umode_t create_mode); > int (*tmpfile) (struct user_namespace *, struct inode *, struct > dentry *, umode_t); > int (*set_acl)(struct user_namespace *, struct inode *, struct > posix_acl *, int); > + int (*miscattr_set)(struct user_namespace *mnt_userns, > + struct dentry *dentry, struct miscattr *ma); > + int (*miscattr_get)(struct dentry *dentry, struct miscattr *ma); > }; > > Again, all methods are called without any locks being held, unless > @@ -588,6 +591,18 @@ otherwise noted. > atomically creating, opening and unlinking a file in given > directory. > > +``miscattr_get`` I wish this wasn't named "misc" because miscellaneous is vague. fileattr_get, perhaps? (FWIW I'm not /that/ passionate about starting a naming bikeshed, feel free to ignore.) > + called on ioctl(FS_IOC_GETFLAGS) and ioctl(FS_IOC_FSGETXATTR) to > + retrieve miscellaneous filesystem flags and attributes. Also "...miscellaneous *file* flags and attributes." > + called before the relevant SET operation to check what is being > + changed (in this case with i_rwsem locked exclusive). If unset, > + then fall back to f_op->ioctl(). > + > +``miscattr_set`` > + called on ioctl(FS_IOC_SETFLAGS) and ioctl(FS_IOC_FSSETXATTR) to > + change miscellaneous filesystem flags and attributes. Callers hold Same here. > + i_rwsem exclusive. If unset, then fall back to f_op->ioctl(). > + > > The Address Space Object > > diff --git a/fs/ioctl.c b/fs/ioctl.c > index 4e6cc0a7d69c..e5f3820809a4 100644 > --- a/fs/ioctl.c > +++ b/fs/ioctl.c > @@ -19,6 +19,9 @@ > #include > #include > #include > +#include > +#include > +#include > > #include "internal.h" > > @@ -657,6 +660,311 @@ static int ioctl_file_dedupe_range(struct file *file, > return ret; > } > > +/** > + * miscattr_fill_xflags - initialize miscattr with xflags > + * @ma: miscattr pointer > + * @xflags: FS_XFLAG_* flags > + * > + * Set ->fsx_xflags, ->xattr_valid and ->flags (translated xflags
[PATCH] hfs/hfsplus: use WARN_ON for sanity check
From: Arnd Bergmann gcc warns about a couple of instances in which a sanity check exists but the author wasn't sure how to react to it failing, which makes it look like a possible bug: fs/hfsplus/inode.c: In function 'hfsplus_cat_read_inode': fs/hfsplus/inode.c:503:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 503 | /* panic? */; | ^ fs/hfsplus/inode.c:524:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 524 | /* panic? */; | ^ fs/hfsplus/inode.c: In function 'hfsplus_cat_write_inode': fs/hfsplus/inode.c:582:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 582 | /* panic? */; | ^ fs/hfsplus/inode.c:608:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 608 | /* panic? */; | ^ fs/hfs/inode.c: In function 'hfs_write_inode': fs/hfs/inode.c:464:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 464 | /* panic? */; | ^ fs/hfs/inode.c:485:37: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body] 485 | /* panic? */; | ^ panic() is probably not the correct choice here, but a WARN_ON seems appropriate and avoids the compile-time warning. Signed-off-by: Arnd Bergmann --- fs/hfs/inode.c | 6 ++ fs/hfsplus/inode.c | 12 2 files changed, 6 insertions(+), 12 deletions(-) diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c index 3fc5cb346586..4c5610b5356f 100644 --- a/fs/hfs/inode.c +++ b/fs/hfs/inode.c @@ -460,8 +460,7 @@ int hfs_write_inode(struct inode *inode, struct writeback_control *wbc) goto out; if (S_ISDIR(main_inode->i_mode)) { - if (fd.entrylength < sizeof(struct hfs_cat_dir)) - /* panic? */; + WARN_ON(fd.entrylength < sizeof(struct hfs_cat_dir)); hfs_bnode_read(fd.bnode, &rec, fd.entryoffset, sizeof(struct hfs_cat_dir)); if (rec.type != HFS_CDR_DIR || @@ -481,8 +480,7 @@ int hfs_write_inode(struct inode *inode, struct writeback_control *wbc) hfs_bnode_write(fd.bnode, &rec, fd.entryoffset, sizeof(struct hfs_cat_file)); } else { - if (fd.entrylength < sizeof(struct hfs_cat_file)) - /* panic? */; + WARN_ON(fd.entrylength < sizeof(struct hfs_cat_file)); hfs_bnode_read(fd.bnode, &rec, fd.entryoffset, sizeof(struct hfs_cat_file)); if (rec.type != HFS_CDR_FIL || diff --git a/fs/hfsplus/inode.c b/fs/hfsplus/inode.c index 078c5c8a5156..6bcb0d935472 100644 --- a/fs/hfsplus/inode.c +++ b/fs/hfsplus/inode.c @@ -499,8 +499,7 @@ int hfsplus_cat_read_inode(struct inode *inode, struct hfs_find_data *fd) if (type == HFSPLUS_FOLDER) { struct hfsplus_cat_folder *folder = &entry.folder; - if (fd->entrylength < sizeof(struct hfsplus_cat_folder)) - /* panic? */; + WARN_ON(fd->entrylength < sizeof(struct hfsplus_cat_folder)); hfs_bnode_read(fd->bnode, &entry, fd->entryoffset, sizeof(struct hfsplus_cat_folder)); hfsplus_get_perms(inode, &folder->permissions, 1); @@ -520,8 +519,7 @@ int hfsplus_cat_read_inode(struct inode *inode, struct hfs_find_data *fd) } else if (type == HFSPLUS_FILE) { struct hfsplus_cat_file *file = &entry.file; - if (fd->entrylength < sizeof(struct hfsplus_cat_file)) - /* panic? */; + WARN_ON(fd->entrylength < sizeof(struct hfsplus_cat_file)); hfs_bnode_read(fd->bnode, &entry, fd->entryoffset, sizeof(struct hfsplus_cat_file)); @@ -578,8 +576,7 @@ int hfsplus_cat_write_inode(struct inode *inode) if (S_ISDIR(main_inode->i_mode)) { struct hfsplus_cat_folder *folder = &entry.folder; - if (fd.entrylength < sizeof(struct hfsplus_cat_folder)) - /* panic? */; + WARN_ON(fd.entrylength < sizeof(struct hfsplus_cat_folder)); hfs_bnode_read(fd.bnode, &entry, fd.entryoffset, sizeof(struct hfsplus_cat_folder)); /* simple node checks? */ @@ -604,8 +601,7 @@ int hfsplus_cat_write_inode(struct inode *inode) } else { struct hfsplus_cat_file *file = &
Re: [PATCH 1/2] mtd: spi-nor: sfdp: save a copy of the SFDP data
Am 2021-03-22 19:42, schrieb Pratyush Yadav: On 22/03/21 04:32PM, Michael Walle wrote: Am 2021-03-22 15:21, schrieb Pratyush Yadav: > On 18/03/21 10:24AM, Michael Walle wrote: > > + > > + sfdp->num_dwords = DIV_ROUND_UP(sfdp_size, sizeof(*sfdp->dwords)); > > The SFDP spec says that Parameter Table Pointer should be DWORD aligned > and Parameter Table length is specified in number of DWORDs. So, > sfdp_size should always be a multiple of 4. Any SFDP table where this is > not true is an invalid one. > > Also, the spec says "Device behavior when the Read SFDP command crosses > the SFDP structure boundary is not defined". > > So I think this should be a check for alignment instead of a round-up. Well, that woundn't help for debugging. I.e. you also want the SFDP data in cases like this. IMHO we should try hard enough to actually get a reasonable dump. OTOH we also rely on the header and the pointers in the header. Any other ideas, but just to chicken out? Honestly, I don't think reading past the SFDP boundary would be too bad. It probably will just be some garbage data. But if you want to avoid that, you can always round down instead of up. Like I said, while the storage will be rounded up to a multiple of DWORDs, only sfdp_size is transferred. Thus it case a pointer is not DWORD aligned, we end up with zeros at the end. I'll add a comment. This way you will only miss the last DWORD at most. In either case, a warning should be printed so this problem can be brought to the user's attention. I was about to add a warning/debug message. But its the wrong place. It should really be checked in the for loop which iterates over the headers before parsing them. You could check sfdp_size but then two unaligned param pointers might cancel each other out. This can be a seperate patch, besides adding a warning, should there be any other things to do, e.g. stop parsing and error out? .. > > + goto exit; > > + } > > + > > + err = spi_nor_read_sfdp_dma_unsafe(nor, 0, sfdp_size, sfdp->dwords); Btw, this can be spi_nor_read_sfdp(). But I'm not sure, what this whole dma capable buffer should be. Is kmalloc(GFP_KERNEL) considered DMA safe? The buffer ends in spi_nor_read_data(), which is also called from mtdcore: spi_nor_read_sfdp() spi_nor_read_raw() spi_nor_read_data() mtd_read() mtd_read_oob() mtd_read_oob_std() spi_nor_read() spi_nor_read_data() Is the buffer passed from mtd_read() also DMA-safe? Doesn't the SPI drivers allocate DMA safe buffers if they need them? -michael
Re: [PATCH v4 05/19] kvm: arm64: Disable guest access to trace filter controls
Hi Marc, On 25/02/2021 19:35, Suzuki K Poulose wrote: Disable guest access to the Trace Filter control registers. We do not advertise the Trace filter feature to the guest (ID_AA64DFR0_EL1: TRACE_FILT is cleared) already, but the guest can still access the TRFCR_EL1 unless we trap it. This will also make sure that the guest cannot fiddle with the filtering controls set by a nvhe host. Cc: Marc Zyngier Cc: Will Deacon Cc: Mark Rutland Cc: Catalin Marinas Signed-off-by: Suzuki K Poulose We have already have the v8.4 self hosted tracing support in 5.12-rcX. Do you think you can pick this up for this 5.12 ? Cheers Suzuki --- New patch --- arch/arm64/include/asm/kvm_arm.h | 1 + arch/arm64/kvm/debug.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/arch/arm64/include/asm/kvm_arm.h b/arch/arm64/include/asm/kvm_arm.h index 4e90c2debf70..94d4025acc0b 100644 --- a/arch/arm64/include/asm/kvm_arm.h +++ b/arch/arm64/include/asm/kvm_arm.h @@ -278,6 +278,7 @@ #define CPTR_EL2_DEFAULT CPTR_EL2_RES1 /* Hyp Debug Configuration Register bits */ +#define MDCR_EL2_TTRF (1 << 19) #define MDCR_EL2_TPMS (1 << 14) #define MDCR_EL2_E2PB_MASK(UL(0x3)) #define MDCR_EL2_E2PB_SHIFT (UL(12)) diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c index 7a7e425616b5..dbc890511631 100644 --- a/arch/arm64/kvm/debug.c +++ b/arch/arm64/kvm/debug.c @@ -89,6 +89,7 @@ void kvm_arm_reset_debug_ptr(struct kvm_vcpu *vcpu) * - Debug ROM Address (MDCR_EL2_TDRA) * - OS related registers (MDCR_EL2_TDOSA) * - Statistical profiler (MDCR_EL2_TPMS/MDCR_EL2_E2PB) + * - Self-hosted Trace Filter controls (MDCR_EL2_TTRF) * * Additionally, KVM only traps guest accesses to the debug registers if * the guest is not actively using them (see the KVM_ARM64_DEBUG_DIRTY @@ -112,6 +113,7 @@ void kvm_arm_setup_debug(struct kvm_vcpu *vcpu) vcpu->arch.mdcr_el2 = __this_cpu_read(mdcr_el2) & MDCR_EL2_HPMN_MASK; vcpu->arch.mdcr_el2 |= (MDCR_EL2_TPM | MDCR_EL2_TPMS | + MDCR_EL2_TTRF | MDCR_EL2_TPMCR | MDCR_EL2_TDRA | MDCR_EL2_TDOSA);
Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
> > Btw, I probably have seen this and forgotten again so pls remind me, > is the amount of pages available for SGX use static and limited by, > I believe BIOS, or can a leakage in EPC pages cause system memory > shortage? > Yes EPC size is fixed and configured in BIOS. Leaking EPC pages may cause EPC shortage, but not system memory.
Re: [RFC PATCH v2 net-next 14/16] net: dsa: don't set skb->offload_fwd_mark when not offloading the bridge
On Mon, Mar 22, 2021 at 04:04:01PM +0800, DENG Qingfang wrote: > On Fri, Mar 19, 2021 at 6:49 PM Vladimir Oltean wrote: > > Why would you even want to look at the source net device for forwarding? > > I'd say that if dp->bridge_dev is NULL in the xmit function, you certainly > > want to bypass address learning if you can. Maybe also for link-local > > traffic. > > Also for trapped traffic (snooping, tc-flower trap action) if the CPU > sends them back. This sounds line an interesting use case, please tell me more about what commands I could run to reinject trapped packets into the hardware data path.
Re: [PATCH v2 00/10] Rid W=1 warnings from OF
On Thu, Mar 18, 2021 at 4:40 AM Lee Jones wrote: > > This set is part of a larger effort attempting to clean-up W=1 > kernel builds, which are currently overwhelmingly riddled with > niggly little warnings. > > v2: > - Provided some descriptions to exported functions > > Lee Jones (10): > of: device: Fix function name in header and provide missing > descriptions > of: dynamic: Fix incorrect parameter name and provide missing > descriptions > of: platform: Demote kernel-doc abuse > of: base: Fix some formatting issues and provide missing descriptions > of: property: Provide missing member description and remove excess > param > of: address: Provide descriptions for 'of_address_to_resource's params > of: fdt: Demote kernel-doc abuses and fix function naming > of: of_net: Provide function name and param description > of: overlay: Fix function name disparity > of: of_reserved_mem: Demote kernel-doc abuses > > drivers/of/address.c | 3 +++ > drivers/of/base.c| 16 +++- > drivers/of/device.c | 7 ++- > drivers/of/dynamic.c | 4 +++- > drivers/of/fdt.c | 23 --- > drivers/of/of_net.c | 3 +++ > drivers/of/of_reserved_mem.c | 6 +++--- > drivers/of/overlay.c | 2 +- > drivers/of/platform.c| 2 +- > drivers/of/property.c| 2 +- > 10 files changed, 44 insertions(+), 24 deletions(-) I still see some warnings (note this is with DT files added to doc build). Can you send follow-up patches: ../include/linux/of.h:1193: warning: Function parameter or member 'output' not described in 'of_property_read_string_index' ../include/linux/of.h:1193: warning: Excess function parameter 'out_string' description in 'of_property_read_string_index' ../include/linux/of.h:1461: warning: cannot understand function prototype: 'enum of_overlay_notify_action ' ../drivers/of/base.c:1781: warning: Excess function parameter 'prob' description in '__of_add_property' ../drivers/of/base.c:1804: warning: Excess function parameter 'prob' description in 'of_add_property' ../drivers/of/base.c:1855: warning: Function parameter or member 'prop' not described in 'of_remove_property' ../drivers/of/base.c:1855: warning: Excess function parameter 'prob' description in 'of_remove_property' BTW, there some more which I guess W=1 doesn't find: /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:906: WARNING: Block quote ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1465: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1469: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1473: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1517: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1521: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1526: WARNING: Unexpected indentation. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1528: WARNING: Block quote ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1529: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1533: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:19: ../drivers/of/base.c:1705: WARNING: Definition list ends without a blank line; unexpected unindent. /home/rob/proj/git/linux-dt/Documentation/driver-api/devicetree:49: ../drivers/of/overlay.c:1183: WARNING: Inline emphasis start-string without end-string. Rob
Re: [syzbot] KASAN: use-after-free Read in disk_part_iter_next (2)
On 3/22/21 12:18 AM, Christoph Hellwig wrote: I've been running the reproducer on a KASAN enable VM for about 15 minutes now, but haven't been able to reproduce it. Is there a way to inject this proposed fix into the syzbot queue? diff --git a/block/partitions/core.c b/block/partitions/core.c index 1a7558917c47d6..f5d5872b89d57e 100644 --- a/block/partitions/core.c +++ b/block/partitions/core.c @@ -288,15 +288,12 @@ struct device_type part_type = { void delete_partition(struct block_device *part) { xa_erase(&part->bd_disk->part_tbl, part->bd_partno); - kobject_put(part->bd_holder_dir); - device_del(&part->bd_device); - - /* -* Remove the block device from the inode hash, so that it cannot be -* looked up any more even when openers still hold references. -*/ remove_inode_hash(part->bd_inode); + synchronize_rcu(); + + kobject_put(part->bd_holder_dir); + device_del(&part->bd_device); put_device(&part->bd_device); } Hi Christoph, disk_part_iter_next() calls bdgrab() before returning a pointer to a certain partition. 'part' is only freed if its reference count drops to zero. The function that frees the partition information, bdev_free_inode(), is invoked via call_rcu(). bdgrab() fails if the refcount of a partition is zero. Does that mean that it is not necessary to call synchronize_rcu() between xa_erase() and put_device()? Thanks, Bart.
Re: [PATCH v4 03/19] kvm: arm64: Hide system instruction access to Trace registers
Will, Catalin, On 25/02/2021 19:35, Suzuki K Poulose wrote: Currently we advertise the ID_AA6DFR0_EL1.TRACEVER for the guest, when the trace register accesses are trapped (CPTR_EL2.TTA == 1). So, the guest will get an undefined instruction, if trusts the ID registers and access one of the trace registers. Lets be nice to the guest and hide the feature to avoid unexpected behavior. Even though this can be done at KVM sysreg emulation layer, we do this by removing the TRACEVER from the sanitised feature register field. This is fine as long as the ETM drivers can handle the individual trace units separately, even when there are differences among the CPUs. Cc: Marc Zyngier Cc: Will Deacon Cc: Catalin Marinas Cc: Mark Rutland Signed-off-by: Suzuki K Poulose --- New patch --- arch/arm64/kernel/cpufeature.c | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 066030717a4c..a4698f09bf32 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -383,7 +383,6 @@ static const struct arm64_ftr_bits ftr_id_aa64dfr0[] = { * of support. */ S_ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_EXACT, ID_AA64DFR0_PMUVER_SHIFT, 4, 0), - ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64DFR0_TRACEVER_SHIFT, 4, 0), ARM64_FTR_BITS(FTR_HIDDEN, FTR_STRICT, FTR_EXACT, ID_AA64DFR0_DEBUGVER_SHIFT, 4, 0x6), ARM64_FTR_END, }; Are you happy to pick this patch for 5.12 as a fix ? Suzuki
Re: [PATCH 1/5] cifsd: add server handler and tranport layers
On Mon, Mar 22, 2021 at 02:13:40PM +0900, Namjae Jeon wrote: > +#define RESPONSE_BUF(w) ((void *)(w)->response_buf) > +#define REQUEST_BUF(w) ((void *)(w)->request_buf) Why do you do this obfuscation? > +#define RESPONSE_BUF_NEXT(w) \ > + ((void *)((w)->response_buf + (w)->next_smb2_rsp_hdr_off)) > +#define REQUEST_BUF_NEXT(w) \ > + ((void *)((w)->request_buf + (w)->next_smb2_rcv_hdr_off)) These obfuscations aren't even used; delete them > +#define RESPONSE_SZ(w) ((w)->response_sz) > + > +#define INIT_AUX_PAYLOAD(w) ((w)->aux_payload_buf = NULL) > +#define HAS_AUX_PAYLOAD(w) ((w)->aux_payload_sz != 0) I mean, do you really find it clearer to write: if (HAS_AUX_PAYLOAD(work)) than if (work->aux_payload_sz) The unobfuscated version is actually shorter!
Re: [PATCH 0/3] Apple M1 DART IOMMU driver
Hi Mark, On Sun, Mar 21, 2021, at 19:35, Mark Kettenis wrote: > > Guess we do need to understand a little bit better how the USB DART > actually works. My hypothesis (based on our discussion on #asahi) is > that the XHCI host controller and the peripheral controller of the > DWC3 block use different DMA "streams" that are handled by the > different sub-DARTs. I've done some more experiments and the situation is unfortunately more complicated: Most DMA transfers are translated with the first DART. But sometimes (and I have not been able to figure out the exact conditions) transfers instead have to go through the second DART. This happens e.g. with one of my USB keyboards after a stop EP command is issued: Suddenly the xhci_ep_ctx struct must be translated through the second DART. What this likely means is that we'll need to point both DARTs to the same pagetables and just issue the TLB maintenance operations as a group. > > The Corellium folks use a DART + sub-DART model in their driver and a > single node in the device tree that represents both. That might sense > since the error registers and interrupts are shared. Maybe it would > make sense to select the appropriate sub-DART based on the DMA stream > ID? dwc3 specifically seems to require stream id #1 from the DART at <0x5 0x02f0> and stream id #0 from the DART at <0x5 0x02f8>. Both of these only share a IRQ line but are otherwise completely independent. Each has their own error registers, etc. and we need some way to specify these two DARTs + the appropriate stream ID. Essentially we have three options to represent this now: 1) Add both DARTs as separate regs, use #iommu-cells = <2> and have the first cell select the DART and the second one the stream ID. We could allow #iommu-cells = <1> in case only one reg is specified for the PCIe DART: usb_dart1@502f0 { compatible = "apple,t8103-dart"; reg = <0x5 0x02f0 0x0 0x4000>, <0x5 0x02f8 0x0 0x4000>; #iommu-cells = <2>; ... }; usb1 { iommus = <&usb_dart1 0 1>, <&usb_dart1 1 0>; ... }; I prefer this option because we fully describe the DART in a single device node here. It also feels natural to group them like this because they need to share some properties (like dma-window and the interrupt) anyway. 2) Create two DART nodes which share the same IRQ line and attach them both to the master node: usb_dart1a@502f0 { compatible = "apple,t8103-dart"; reg = <0x5 0x02f0 0x0 0x4000>; #iommu-cells = <1>; ... }; usb_dart1b@502f8 { compatible = "apple,t8103-dart"; reg = <0x5 0x02f8 0x0 0x4000>; #iommu-cells = <1>; ... }; usb1 { iommus = <&usb_dart1a 1>, <&usb_dart1b 0>; ... }; I dislike this one because attaching those two DARTs to a single device seems rather unusual. We'd also have to duplicate the dma-window setting, make sure it's the same for both DARTs and there are probably even more complications I can't think of right now. It seems like this would also make the device tree very verbose and the implementation itself more complicated. 3) Introduce another property and let the DART driver take care of mirroring the pagetables. I believe this would be similar to the sid-remap property: usb_dart1@502f0 { compatible = "apple,t8103-dart"; reg = <0x5 0x02f0 0x0 0x4000>, <0x5 0x02f8 0x0 0x4000>; #iommu-cells = <1>; sid-remap = <0 1>; }; usb1 { iommus = <&usb_dart1 0>; }; I slightly dislike this one because we now specify which stream id to use in two places: Once in the device node and another time in the new property in the DART node. I also don't think the binding is much simpler than the first one. > > where #dma-address-cells and #dma-size-cells default to > > #address-cells and #size-cells respectively if I understand > > the code correctly. That way we could also just always use > > a 64bit address and size in the DT, e.g. > > > > pcie_dart { > > [ ... ] > > dma-window = <0 0x10 0 0x3fe0>; > > [ ... ] > > }; > > That sounds like a serious contender to me! Hopefully one of the > Linux kernel developers can give this some sort of blessing. > > I think it would make sense for us to just rely on the #address-cells > and #size-cells defaults for the M1 device tree. > Agreed. Best, Sven
Re: [PATCH 1/2] zram: fix crashes due to use of cpu hotplug multistate
On Mon, Mar 22, 2021 at 08:41:56PM +, Luis Chamberlain wrote: > On Mon, Mar 22, 2021 at 09:37:17AM -0700, Minchan Kim wrote: > > On Fri, Mar 19, 2021 at 07:09:24PM +, Luis Chamberlain wrote: > > > Indeed one issue is a consequence of the other but a bit better > > > description can be put together for both separately. > > > > > > The warning comes up when cpu hotplug detects that the callback > > > is being removed, but yet "instances" for multistate are left behind, > > > ie, not removed. CPU hotplug multistate allows us to dedicate a callback > > > for zram so that it gets called every time a CPU hot plugs or unplugs. > > > In the zram driver's case we want to trigger the callback per each > > > struct zcomp, one is used per supported and used supported compression > > > > you meant "each one is used per supported compression"? > > Yup. > > > > algorithm. The zram driver allocates a struct zcomp for an compression > > > algorithm on a need basis. The default compressor is CONFIG_ZRAM_DEF_COMP > > > which today is CONFIG_ZRAM_DEF_COMP_LZORLE ("lzo-rle"). Although we may > > > have different zram devices, zram devices which use the same compression > > > algorithm share the same struct zcomp. The "multi" in CPU hotplug > > > multstate > > > > It allocates each own zcomp, not sharing even though zram devices > > use the same compression algorithm. > > Right. > > > > refers to these different struct zcomp instances, each of these will > > > have the callback routine registered run for it. The kernel's CPU > > > hotplug multistate keeps a linked list of these different structures > > > so that it will iterate over them on CPU transitions. By default at > > > driver initialization we will create just one zram device (num_devices=1) > > > and a zcomp structure then set for the lzo-rle comrpession algorithm. > > > At driver removal we first remove each zram device, and so we destroy > > > the struct zcomp. But since we expose sysfs attributes to create new > > > devices or reset / initialize existing ones, we can easily end up > > > re-initializing a struct zcomp before the exit routine of the module > > > removes the cpu hotplug callback. When this happens the kernel's > > > CPU hotplug will detect that at least one instance (struct zcomp for > > > us) exists. > > > > It's very helpful to understand. Thanks a lot!S > > > > So IIUC, it's fundamentally race between destroy_devices(i.e., module > > unload) and every sysfs knobs in zram. Isn't it? > > I would not call it *every* syfs knob as not all deal with things which > are related to CPU hotplug multistate, right? Note that using just > try_module_get() alone (that is the second patch only, does not fix the > race I am describing above). It wouldn't be CPU hotplug multistate issue but I'd like to call it as more "zram instance race" bug. What happens in this case? CPU 1CPU 2 destroy_devices .. compact_store() zram = dev_to_zram(dev); idr_for_each(zram_remove_cb zram_remove .. kfree(zram) down_read(&zram->init_lock); CPU 1CPU 2 hot_remove_store compact_store() zram = dev_to_zram(dev); zram_remove kfree(zram) down_read(&zram->init_lock); So, for me we need to close the zram instance create/removal with sysfs rather than focusing on CPU hotplug issue. Maybe, we could reuse zram_index_mutex with modifying it with rw_semaphore. What do you think? > > There are two issues. > > > Below specific example is one of them and every sysfs code in zram > > could access freed object(e.g., zram->something). > > And you are claiming there isn't good way to fix it in kernfs(please > > add it in the description, too) even though it's general problem. > > Correct, the easier route would have been through the block layer as > its the one adding our syfs attributes for us but even it canno deal > with this race on behalf of drivers given the currently exposed > semantics on kernfs. > > > (If we had, we may detect the race and make zram_remove_cb fails > > so unloading modulies fails, finally) > > > > So, shouldn't we introcude a global rw_semaphore? > > > > destroy_devices > > class_unregister > > down_write(&zram_unload); > > idr_for_each(zram_remove_cb); > > up_write(&zram_unload); > > > > And every sysfs code hold the lock with down_read in the first place > > and release the lock right before return. So, introduce a zram sysfs > > wrapper to centralize all of locking logic. > > Actually that does work but only if we use write lock attempts so to > be able to knock two birds with one stone, so to address the deadlock > with sysfs attribute removal. We're not asking politely to read some > value off of a zram devices with th
[PATCH v4 4/4] ioctl_userfaultfd.2: Add write-protect mode docs
Userfaultfd write-protect mode is supported starting from Linux 5.7. Signed-off-by: Peter Xu --- man2/ioctl_userfaultfd.2 | 84 ++-- 1 file changed, 81 insertions(+), 3 deletions(-) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index d4a8375b8..5419687a6 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -234,6 +234,11 @@ operation is supported. The .B UFFDIO_UNREGISTER operation is supported. +.TP +.B 1 << _UFFDIO_WRITEPROTECT +The +.B UFFDIO_WRITEPROTECT +operation is supported. .PP This .BR ioctl (2) @@ -322,9 +327,6 @@ Track page faults on missing pages. .B UFFDIO_REGISTER_MODE_WP Track page faults on write-protected pages. .PP -Currently, the only supported mode is -.BR UFFDIO_REGISTER_MODE_MISSING . -.PP If the operation is successful, the kernel modifies the .I ioctls bit-mask field to indicate which @@ -443,6 +445,16 @@ operation: .TP .B UFFDIO_COPY_MODE_DONTWAKE Do not wake up the thread that waits for page-fault resolution +.TP +.B UFFDIO_COPY_MODE_WP +Copy the page with read-only permission. +This allows the user to trap the next write to the page, +which will block and generate another write-protect userfault message. +This is only used when both +.B UFFDIO_REGISTER_MODE_MISSING +and +.B UFFDIO_REGISTER_MODE_WP +modes are enabled for the registered range. .PP The .I copy @@ -654,6 +666,72 @@ field of the structure was not a multiple of the system page size; or .I len was zero; or the specified range was otherwise invalid. +.SS UFFDIO_WRITEPROTECT (Since Linux 5.7) +Write-protect or write-unprotect an userfaultfd registered memory range +registered with mode +.BR UFFDIO_REGISTER_MODE_WP . +.PP +The +.I argp +argument is a pointer to a +.I uffdio_range +structure as shown below: +.PP +.in +4n +.EX +struct uffdio_writeprotect { +struct uffdio_range range; /* Range to change write permission */ +__u64 mode; /* Mode to change write permission */ +}; +.EE +.in +There're two mode bits that are supported in this structure: +.TP +.B UFFDIO_WRITEPROTECT_MODE_WP +When this mode bit is set, the ioctl will be a write-protect operation upon the +memory range specified by +.IR range . +Otherwise it'll be a write-unprotect operation upon the specified range, +which can be used to resolve an userfaultfd write-protect page fault. +.TP +.B UFFDIO_WRITEPROTECT_MODE_DONTWAKE +When this mode bit is set, +do not wake up any thread that waits for page-fault resolution after the operation. +This could only be specified if +.B UFFDIO_WRITEPROTECT_MODE_WP +is not specified. +.PP +This +.BR ioctl (2) +operation returns 0 on success. +On error, \-1 is returned and +.I errno +is set to indicate the error. +Possible errors include: +.TP +.B EINVAL +The +.I start +or the +.I len +field of the +.I ufdio_range +structure was not a multiple of the system page size; or +.I len +was zero; or the specified range was otherwise invalid. +.TP +.B EAGAIN +The process was interrupted and need to retry. +.TP +.B ENOENT +The range specified in +.I range +is not valid. +For example, the virtual address does not exist, +or not registered with userfaultfd write-protect mode. +.TP +.B EFAULT +Encountered a generic fault during processing. .SH RETURN VALUE See descriptions of the individual operations, above. .SH ERRORS -- 2.26.2
[PATCH v4 3/4] ioctl_userfaultfd.2: Add UFFD_FEATURE_THREAD_ID docs
UFFD_FEATURE_THREAD_ID is supported in Linux 4.14. Signed-off-by: Peter Xu --- man2/ioctl_userfaultfd.2 | 5 + 1 file changed, 5 insertions(+) diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 index 47ae5f473..d4a8375b8 100644 --- a/man2/ioctl_userfaultfd.2 +++ b/man2/ioctl_userfaultfd.2 @@ -208,6 +208,11 @@ signal will be sent to the faulting process. Applications using this feature will not require the use of a userfaultfd monitor for processing memory accesses to the regions registered with userfaultfd. +.TP +.BR UFFD_FEATURE_THREAD_ID " (since Linux 4.14)" +If this feature bit is set, +.I uffd_msg.pagefault.feat.ptid +will be set to the faulted thread ID for each page fault message. .PP The returned .I ioctls -- 2.26.2
[PATCH v4 1/4] userfaultfd.2: Add UFFD_FEATURE_THREAD_ID docs
UFFD_FEATURE_THREAD_ID is supported since Linux 4.14. Signed-off-by: Peter Xu --- man2/userfaultfd.2 | 13 + 1 file changed, 13 insertions(+) diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index e7dc9f813..555e37409 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -77,6 +77,13 @@ When the last file descriptor referring to a userfaultfd object is closed, all memory ranges that were registered with the object are unregistered and unread events are flushed. .\" +.PP +Since Linux 4.14, userfaultfd page fault message can selectively embed faulting +thread ID information into the fault message. +One needs to enable this feature explicitly using the +.BR UFFD_FEATURE_THREAD_ID +feature bit when initializing the userfaultfd context. +By default, thread ID reporting is diabled. .SS Usage The userfaultfd mechanism is designed to allow a thread in a multithreaded program to perform user-space paging for the other threads in the process. @@ -229,6 +236,9 @@ struct uffd_msg { struct { __u64 flags;/* Flags describing fault */ __u64 address; /* Faulting address */ +union { +__u32 ptid; /* Thread ID of the fault */ +} feat; } pagefault; struct {/* Since Linux 4.11 */ @@ -358,6 +368,9 @@ otherwise it is a read fault. .\" UFFD_PAGEFAULT_FLAG_WP is not yet supported. .RE .TP +.I pagefault.feat.pid +The thread ID that triggered the page fault. +.TP .I fork.ufd The file descriptor associated with the userfault object created for the child created by -- 2.26.2
[PATCH v4 2/4] userfaultfd.2: Add write-protect mode
Write-protect mode is supported starting from Linux 5.7. Signed-off-by: Peter Xu --- man2/userfaultfd.2 | 104 - 1 file changed, 102 insertions(+), 2 deletions(-) diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 index 555e37409..8ad4a71b5 100644 --- a/man2/userfaultfd.2 +++ b/man2/userfaultfd.2 @@ -78,6 +78,32 @@ all memory ranges that were registered with the object are unregistered and unread events are flushed. .\" .PP +Userfaultfd supports two modes of registration: +.TP +.BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)" +When registered with +.B UFFDIO_REGISTER_MODE_MISSING +mode, the userspace will receive a page fault message +when a missing page is accessed. +The faulted thread will be stopped from execution until the page fault is +resolved from the userspace by either an +.B UFFDIO_COPY +or an +.B UFFDIO_ZEROPAGE +ioctl. +.TP +.BR UFFDIO_REGISTER_MODE_WP " (since 5.7)" +When registered with +.B UFFDIO_REGISTER_MODE_WP +mode, the userspace will receive a page fault message +when a write-protected page is written. +The faulted thread will be stopped from execution +until the userspace un-write-protect the page using an +.B UFFDIO_WRITEPROTECT +ioctl. +.PP +Multiple modes can be enabled at the same time for the same memory range. +.PP Since Linux 4.14, userfaultfd page fault message can selectively embed faulting thread ID information into the fault message. One needs to enable this feature explicitly using the @@ -144,6 +170,17 @@ single threaded non-cooperative userfaultfd manager implementations. .\" and limitations remaining in 4.11 .\" Maybe it's worth adding a dedicated sub-section... .\" +.PP +Starting from Linux 5.7, userfaultfd is able to do +synchronous page dirty tracking using the new write-protection register mode. +One should check against the feature bit +.B UFFD_FEATURE_PAGEFAULT_FLAG_WP +before using this feature. +Similar to the original userfaultfd missing mode, the write-protect mode will +generate an userfaultfd message when the protected page is written. +The user needs to resolve the page fault by unprotecting the faulted page and +kick the faulted thread to continue. +For more information, please refer to "Userfaultfd write-protect mode" section. .SS Userfaultfd operation After the userfaultfd object is created with .BR userfaultfd (), @@ -219,6 +256,65 @@ userfaultfd can be used only with anonymous private memory mappings. Since Linux 4.11, userfaultfd can be also used with hugetlbfs and shared memory mappings. .\" +.SS Userfaultfd write-protect mode (since 5.7) +Since Linux 5.7, userfaultfd supports write-protect mode. +The user needs to first check availability of this feature using +.B UFFDIO_API +ioctl against the feature bit +.B UFFD_FEATURE_PAGEFAULT_FLAG_WP +before using this feature. +.PP +To register with userfaultfd write-protect mode, the user needs to initiate the +.B UFFDIO_REGISTER +ioctl with mode +.B UFFDIO_REGISTER_MODE_WP +set. +Note that it's legal to monitor the same memory range with multiple modes. +For example, the user can do +.B UFFDIO_REGISTER +with the mode set to +.BR "UFFDIO_REGISTER_MODE_MISSING | UFFDIO_REGISTER_MODE_WP" . +When there is only +.B UFFDIO_REGISTER_MODE_WP +registered, the userspace will +.I not +receive any message when a missing page is written. +Instead, the userspace will only receive a write-protect page fault message +when an existing but write-protected page got written. +.PP +After the +.B UFFDIO_REGISTER +ioctl completed with +.B UFFDIO_REGISTER_MODE_WP +mode set, +the user can write-protect any existing memory within the range using the ioctl +.B UFFDIO_WRITEPROTECT +where +.I uffdio_writeprotect.mode +should be set to +.BR UFFDIO_WRITEPROTECT_MODE_WP . +.PP +When a write-protect event happens, +the userspace will receive a page fault message whose +.I uffd_msg.pagefault.flags +will be with +.B UFFD_PAGEFAULT_FLAG_WP +flag set. +Note: since only writes can trigger such kind of fault, +write-protect messages will always be with +.B UFFD_PAGEFAULT_FLAG_WRITE +bit set too along with bit +.BR UFFD_PAGEFAULT_FLAG_WP . +.PP +To resolve a write-protection page fault, the user should initiate another +.B UFFDIO_WRITEPROTECT +ioctl, whose +.I uffd_msg.pagefault.flags +should have the flag +.B UFFDIO_WRITEPROTECT_MODE_WP +cleared upon the faulted page or range. +.PP +Write-protect mode only supports private anonymous memory. .SS Reading from the userfaultfd structure Each .BR read (2) @@ -364,8 +460,12 @@ flag (see .BR ioctl_userfaultfd (2)) and this flag is set, this a write fault; otherwise it is a read fault. -.\" -.\" UFFD_PAGEFAULT_FLAG_WP is not yet supported. +.TP +.B UFFD_PAGEFAULT_FLAG_WP +If the address is in a range that was registered with the +.B UFFDIO_REGISTER_MODE_WP +flag, when this bit is set it means it's a write-protect fault. +Otherwise it's a page missing fault. .RE .TP .I pagefault.feat.pid -- 2.26.2
[PATCH v4 0/4] man2: udpate mm/userfaultfd manpages to latest
v4: - Fixed a few "subordinate clauses" (SC) cases [Alex] - Reword in ioctl_userfaultfd.2 to use bold font for the two modes referenced, so as to be clear on what is "both" referring to [Alex] v3: - Don't use "Currently", instead add "(since x.y)" mark where proper [Alex] - Always use semantic newlines across the whole patchset [Alex] - Use quote when possible, rather than escapes [Alex] - Fix one missing replacement of ".BR" -> ".B" [Alex] - Some other trivial rephrases here and there when fixing up above v2 changes: - Fix wordings as suggested [MikeR] - convert ".BR" to ".B" where proper for the patchset [Alex] - rearrange a few lines in the last two patches where they got messed up - document more things, e.g. UFFDIO_COPY_MODE_WP; and also on how to resolve a wr-protect page fault. There're two features missing in current manpage, namely: (1) Userfaultfd Thread-ID feature (2) Userfaultfd write protect mode There's also a 3rd one which was just contributed from Axel - Axel, I think it would be great if you can add that part too, probably after the whole hugetlbfs/shmem minor mode reaches the linux master branch. Please review, thanks. Peter Xu (4): userfaultfd.2: Add UFFD_FEATURE_THREAD_ID docs userfaultfd.2: Add write-protect mode ioctl_userfaultfd.2: Add UFFD_FEATURE_THREAD_ID docs ioctl_userfaultfd.2: Add write-protect mode docs man2/ioctl_userfaultfd.2 | 89 - man2/userfaultfd.2 | 117 ++- 2 files changed, 201 insertions(+), 5 deletions(-) -- 2.26.2
Re: [PATCH 02/11] x86: tboot: avoid Wstringop-overread-warning
On 3/22/21 2:29 PM, Ingo Molnar wrote: * Arnd Bergmann wrote: From: Arnd Bergmann gcc-11 warns about using string operations on pointers that are defined at compile time as offsets from a NULL pointer. Unfortunately that also happens on the result of fix_to_virt(), which is a compile-time constant for a constantn input: arch/x86/kernel/tboot.c: In function 'tboot_probe': arch/x86/kernel/tboot.c:70:13: error: '__builtin_memcmp_eq' specified bound 16 exceeds source size 0 [-Werror=stringop-overread] 70 | if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { | ^~ I hope this can get addressed in gcc-11 before the release. As a workaround, split up the tboot_probe() function in two halves to separate the pointer generation from the usage. This is a bit ugly, and hopefully gcc understands that the code is actually correct before it learns to peek into the noinline function. Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99578 Signed-off-by: Arnd Bergmann --- arch/x86/kernel/tboot.c | 44 - 1 file changed, 26 insertions(+), 18 deletions(-) diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c index 4c09ba110204..f9af561c3cd4 100644 --- a/arch/x86/kernel/tboot.c +++ b/arch/x86/kernel/tboot.c @@ -49,6 +49,30 @@ bool tboot_enabled(void) return tboot != NULL; } +/* noinline to prevent gcc from warning about dereferencing constant fixaddr */ +static noinline __init bool check_tboot_version(void) +{ + if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { + pr_warn("tboot at 0x%llx is invalid\n", boot_params.tboot_addr); + return false; + } + + if (tboot->version < 5) { + pr_warn("tboot version is invalid: %u\n", tboot->version); + return false; + } + + pr_info("found shared page at phys addr 0x%llx:\n", + boot_params.tboot_addr); + pr_debug("version: %d\n", tboot->version); + pr_debug("log_addr: 0x%08x\n", tboot->log_addr); + pr_debug("shutdown_entry: 0x%x\n", tboot->shutdown_entry); + pr_debug("tboot_base: 0x%08x\n", tboot->tboot_base); + pr_debug("tboot_size: 0x%x\n", tboot->tboot_size); + + return true; +} + void __init tboot_probe(void) { /* Look for valid page-aligned address for shared page. */ @@ -66,25 +90,9 @@ void __init tboot_probe(void) /* Map and check for tboot UUID. */ set_fixmap(FIX_TBOOT_BASE, boot_params.tboot_addr); - tboot = (struct tboot *)fix_to_virt(FIX_TBOOT_BASE); - if (memcmp(&tboot_uuid, &tboot->uuid, sizeof(tboot->uuid))) { - pr_warn("tboot at 0x%llx is invalid\n", boot_params.tboot_addr); + tboot = (void *)fix_to_virt(FIX_TBOOT_BASE); + if (!check_tboot_version()) tboot = NULL; - return; - } - if (tboot->version < 5) { - pr_warn("tboot version is invalid: %u\n", tboot->version); - tboot = NULL; - return; - } - - pr_info("found shared page at phys addr 0x%llx:\n", - boot_params.tboot_addr); - pr_debug("version: %d\n", tboot->version); - pr_debug("log_addr: 0x%08x\n", tboot->log_addr); - pr_debug("shutdown_entry: 0x%x\n", tboot->shutdown_entry); - pr_debug("tboot_base: 0x%08x\n", tboot->tboot_base); - pr_debug("tboot_size: 0x%x\n", tboot->tboot_size); This is indeed rather ugly - and the other patch that removes a debug check seems counterproductive as well. Do we know how many genuine bugs -Wstringop-overread-warning has caught or is about to catch? I.e. the real workaround might be to turn off the -Wstringop-overread-warning, until GCC-11 gets fixed? In GCC 10 -Wstringop-overread is a subset of -Wstringop-overflow. GCC 11 breaks it out as a separate warning to make it easier to control. Both warnings have caught some real bugs but they both have a nonzero rate of false positives. Other than bug reports we don't have enough data to say what their S/N ratio might be but my sense is that it's fairly high in general. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overread https://gcc.gnu.org/bugzilla/show_bug.cgi?id=wstringop-overflow In GCC 11, all access warnings expect objects to be either declared or allocated. Pointers with constant values are taken to point to nothing valid (as Arnd mentioned above, this is to detect invalid accesses to members of structs at address zero). One possible solution to the known address problem is to extend GCC attributes address and io that pin an object to a hardwired address to all targets (at the moment they're supported on just one or two targets). I'm not sure this can still happen before GCC 11 releases sometime in April or May. Until then, another workaround is to convert the fixed address to a volatile pointer
Re: [PATCH v6] close_range.2: new page documenting close_range(2)
On Sun, 21 Mar 2021 16:38:59 +0100, "Michael Kerrisk (man-pages)" wrote: > On 3/9/21 8:53 PM, Stephen Kitt wrote: > > On Thu, 28 Jan 2021 21:50:23 +0100, "Michael Kerrisk (man-pages)" > > wrote: > >> Thanks for your patch revision. I've merged it, and have > >> done some light editing, but I still have a question: > > > > Does this need anything more? I don’t see it in the man-pages repo. > > Sorry, Stephen. It's just me being slow. I've made a few edits, > replaced the example program with another that more clearly allows > the user to see what's going on, and pushed to Git. Thanks, your example program is indeed much better! Regards, Stephen pgpCskU70zyVN.pgp Description: OpenPGP digital signature
Re: [PATCH v3 03/25] x86/sgx: Wipe out EREMOVE from sgx_free_epc_page()
On Mon, 22 Mar 2021 22:06:45 +0100 Borislav Petkov wrote: > On Mon, Mar 22, 2021 at 12:37:02PM -0700, Sean Christopherson wrote: > > Yes. Note, it's still true if you strike out the "too", KVM support is > > completely > > orthogonal to this code. The purpose of this patch is to separate out the > > EREMOVE > > path used for host enclaves (/dev/sgx_enclave), because EPC virtualization > > for > > KVM will have non-buggy scenarios where EREMOVE can fail. But the virt EPC > > code > > is designed to handle that gracefully. > > "gracefully" as it won't leak EPC pages which would require a host reboot? > That > leaking is done by host enclaves only? This path is called by host SGX driver only, so yes this leaking is done by host enclaves only. This patch is purpose is to break EREMOVE out of sgx_free_epc_page() so virtual EPC code can use sgx_free_epc_page(), and handle EREMOVE logic differently. This patch doesn't have intention to bring functional change. I changed the error msg because Dave said it worth to give a message saying EPC page is leaked, and I thought this minor change won't break anything. Perpahps we can avoid changing error message but stick to existing SGX driver behavior? > > > Hmm. I don't think it warrants BUG. At worst, leaking EPC pages is fatal > > only > > to SGX. > > Fatal how? If it keeps leaking, at some point it won't have any pages > for EPC pages anymore? > > Btw, I probably have seen this and forgotten again so pls remind me, > is the amount of pages available for SGX use static and limited by, > I believe BIOS, or can a leakage in EPC pages cause system memory > shortage? > > > If the underlying bug caused other fallout, e.g. didn't release a > > lock, then obviously that could be fatal to the kernel. But I don't > > think there's ever a case where SGX being unusuable would prevent the > > kernel from functioning. > > This kinda replies my question above but still... > > > Probably something in between. Odds are good SGX will eventually become > > unusuable, e.g. either kernel SGX support is completely hosted, or it will > > soon > > leak the majority of EPC pages. Something like this? > > > > "EREMOVE returned %d (0x%x), kernel bug likely. EPC page leaked, SGX may > > become unusuable. Reboot recommended to continue using SGX." > > So all this handwaving I'm doing is to provoke a proper response from > you guys as to how a EPC page leaking is supposed to be handled by the > users of the technology: > > 1. Issue a warning message and forget about it, eventual reboot This is the existing SGX driver behavior IMHO. It just gives a WARN() saying EREMOVE failed with the error code printed. The EPC page is leaked w/o any message to user. I can live with it, and it is existing code anyway. btw, currently virtual EPC code (patch 5) handles in similar way: There's one EREMOVE error is expected and virtual EPC code can handle, but for other errors, it means kernel bug, and virtual EPC code gives a WARN(), and that EPC page is leaked too: + WARN_ONCE(ret != SGX_CHILD_PRESENT, + "EREMOVE (EPC page 0x%lx): unexpected error: %d\n", + sgx_get_epc_phys_addr(epc_page), ret); + return ret; So to me they are just WARN() to catch kernel bug. > > 2. Really scary message to make users reboot sooner > > 3. Detect when host enclaves are run while guest enclaves are running > and issue a warning then. This code path has nothing to do with guest enclaves. > > 4. Fall on knees and pray to not get sued by customers because their > enclaves are not working anymore. > > > > Btw, 4. needs to be considered properly so that people can cover asses. If we are talking about CSPs being unable to provide correct services to customers due to kernel bug, I think this is a bigger question but not just related to SGX, since other kernel bug can also cause similar problem, for instance, VM or SGX process itself being killed. > > Oh and whatever we end up deciding, we should document that in > Documentation/... somewhere and point users to it in that warning > message where a longer treatise is explaining the whole deal properly. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
Re: [PATCH v3 4/4] ioctl_userfaultfd.2: Add write-protect mode docs
On Fri, Mar 19, 2021 at 11:37:20PM +0100, Alejandro Colomar (man-pages) wrote: > Hi Peter, Hi, Alex, > > +generate another write-protect userfault message. > > +This is only used in conjunction with write-protect mode when both missing > > and > > "when both missing" > > both what? I modified it to: This is only used when both .B UFFDIO_REGISTER_MODE_MISSING and .B UFFDIO_REGISTER_MODE_WP modes are enabled for the registered range. And I fixed all the rest, including all the comments in patch 3. Thanks for looking, I'll repost shortly. -- Peter Xu
Re: [PATCH] of: overlay: fix for_each_child.cocci warnings
On 3/22/21 1:21 PM, Julia Lawall wrote: > From: kernel test robot > > Function "for_each_child_of_node" should have of_node_put() before goto. > > Generated by: scripts/coccinelle/iterators/for_each_child.cocci > > Fixes: 82c2d81361ec ("coccinelle: iterators: Add for_each_child.cocci script") > CC: Sumera Priyadarsini > Reported-by: kernel test robot > Signed-off-by: kernel test robot > Signed-off-by: Julia Lawall > --- > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git > master > head: 812da4d39463a060738008a46cfc9f775e4bfcf6 > commit: 82c2d81361ecd142a54e84a9da1e287113314a4f coccinelle: iterators: Add > for_each_child.cocci script > :: branch date: 13 hours ago > :: commit date: 5 months ago > > overlay.c |1 + > 1 file changed, 1 insertion(+) > > --- a/drivers/of/overlay.c > +++ b/drivers/of/overlay.c > @@ -796,6 +796,7 @@ static int init_overlay_changeset(struct > if (!fragment->target) { > of_node_put(fragment->overlay); > ret = -EINVAL; > + of_node_put(node); > goto err_free_fragments; > } > Reviewed-by: Frank Rowand Tested-by: Frank Rowand While reading through the code touched by the patch I noticed that the clean up at label err_free_fragments does not do the required of_node_put() calls. I'll add creating a patch to fix that to my todo list. -Frank
Re: [PATCH] ftrace: shut up -Wcast-function-type warning for ftrace_ops_no_ops
On Mon, 22 Mar 2021 22:49:58 +0100 Arnd Bergmann wrote: > From: Arnd Bergmann > > With 'make W=1', gcc warns about casts between incompatible function > types: > > kernel/trace/ftrace.c:128:31: error: cast between incompatible function types > from 'void (*)(long unsigned int, long unsigned int)' to 'void (*)(long > unsigned int, long unsigned int, struct ftrace_ops *, struct ftrace_regs > *)' [-Werror=cast-function-type] > 128 | #define ftrace_ops_list_func ((ftrace_func_t)ftrace_ops_no_ops) > | ^ > > As the commet here explains, this one was intentional, so shut up the > warning harder by using a double cast. Bonus points for reading the comment ;-) I'll take this patch for the next merge window, thanks! -- Steve
[PATCH] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs
Transparent huge pages are supported for read-only non-shmem filesystems, but are only used for vmas with VM_DENYWRITE. This condition ensures that file THPs are protected from writes while an application is running (ETXTBSY). Any existing file THPs are then dropped from the page cache when a file is opened for write in do_dentry_open(). Since sys_mmap ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas produced by execve(). Systems that make heavy use of shared libraries (e.g. Android) are unable to apply VM_DENYWRITE through the dynamic linker, preventing them from benefiting from the resultant reduced contention on the TLB. This patch reduces the constraint on file THPs allowing use with any executable mapping from a file not opened for write (see inode_is_open_for_write()). It also introduces additional conditions to ensure that files opened for write will never be backed by file THPs. Restricting the use of THPs to executable mappings eliminates the risk that a read-only file later opened for write would encounter significant latencies due to page cache truncation. The ld linker flag '-z max-page-size=(hugepage size)' can be used to produce executables with the necessary layout. The dynamic linker must map these file's segments at a hugepage size aligned vma for the mapping to be backed with THPs. Signed-off-by: Collin Fijalkovich --- fs/open.c | 13 +++-- mm/khugepaged.c | 16 +++- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/fs/open.c b/fs/open.c index e53af13b5835..f76e960d10ea 100644 --- a/fs/open.c +++ b/fs/open.c @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f, * XXX: Huge page cache doesn't support writing yet. Drop all page * cache for this file before processing writes. */ - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping)) - truncate_pagecache(inode, 0); + if (f->f_mode & FMODE_WRITE) { + /* +* Paired with smp_mb() in collapse_file() to ensure nr_thps +* is up to date and the update to i_writecount by +* get_write_access() is visible. Ensures subsequent insertion +* of THPs into the page cache will fail. +*/ + smp_mb(); + if (filemap_nr_thps(inode->i_mapping)) + truncate_pagecache(inode, 0); + } return 0; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index a7d6cb912b05..4c7cc877d5e3 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -459,7 +459,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma, /* Read-only file mappings need to be aligned for THP to work. */ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && - (vm_flags & VM_DENYWRITE)) { + !inode_is_open_for_write(vma->vm_file->f_inode) && + (vm_flags & VM_EXEC)) { return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, HPAGE_PMD_NR); } @@ -1872,6 +1873,19 @@ static void collapse_file(struct mm_struct *mm, else { __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); filemap_nr_thps_inc(mapping); + /* +* Paired with smp_mb() in do_dentry_open() to ensure +* i_writecount is up to date and the update to nr_thps is +* visible. Ensures the page cache will be truncated if the +* file is opened writable. +*/ + smp_mb(); + if (inode_is_open_for_write(mapping->host)) { + result = SCAN_FAIL; + __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); + filemap_nr_thps_dec(mapping); + goto xa_locked; + } } if (nr_none) { -- 2.31.0.rc2.261.g7f71774620-goog
Re: [PATCH next v1 2/3] printk: remove safe buffers
On 2021-03-22, Petr Mladek wrote: > On Mon 2021-03-22 12:16:15, John Ogness wrote: >> On 2021-03-21, Sergey Senozhatsky wrote: >> >> @@ -369,7 +70,10 @@ __printf(1, 0) int vprintk_func(const char *fmt, >> >> va_list args) >> >>* Use the main logbuf even in NMI. But avoid calling console >> >>* drivers that might have their own locks. >> >>*/ >> >> - if ((this_cpu_read(printk_context) & PRINTK_NMI_DIRECT_CONTEXT_MASK)) { >> >> + if (this_cpu_read(printk_context) & >> >> + (PRINTK_NMI_DIRECT_CONTEXT_MASK | >> >> + PRINTK_NMI_CONTEXT_MASK | >> >> + PRINTK_SAFE_CONTEXT_MASK)) { >> > >> > Do we need printk_nmi_direct_enter/exit() and >> > PRINTK_NMI_DIRECT_CONTEXT_MASK? Seems like all printk_safe() paths >> > are now DIRECT - we store messages to the prb, but don't call console >> > drivers. >> >> I was planning on waiting until the kthreads are introduced, in which >> case printk_safe.c is completely removed. > > You want to keep printk_safe() context because it prevents calling > consoles even in normal context. Namely, it prevents deadlock by > recursively taking, for example, sem->lock in console_lock() or > console_owner_lock in console_trylock_spinning(). Am I right? Correct. >> But I suppose I could switch >> the 1 printk_nmi_direct_enter() user to printk_nmi_enter() so that >> PRINTK_NMI_DIRECT_CONTEXT_MASK can be removed now. I would do this in a >> 4th patch of the series. > > Yes, please unify the PRINTK_NMI_CONTEXT. One is enough. Agreed. (But I'll go even further. See below.) > I wonder if it would make sense to go even further at this stage. > There will still be 4 contexts that modify the printk behavior after > this patchset: > > + printk_count set by printk_enter()/exit() > + prevents: infinite recursion > + context: any context > + action: skips entire printk at 3rd recursion level > > + prink_context set by printk_safe_enter()/exit() > + prevents: dead lock caused by recursion into some > console code in any context > + context: any > + action: skips console call at 1st recursion level Technically, at this point printk_safe_enter() behavior is identical to printk_nmi_enter(). Namely, prevent any recursive printk calls from calling into the console code. > + printk_context set by printk_nmi_enter()/exit() > + prevents: dead lock caused by any console lock recursion > + context: NMI > + action: skips console calls at 0th recursion level > > + kdb_trap_printk > + redirects printk() to kdb_printk() in kdb context > > > What is possible? > > 1. We could get rid of printk_nmi_enter()/exit() and >PRINTK_NMI_CONTEXT completely already now. It is enough >to check in_nmi() in printk_func(). > >printk_nmi_enter() was added by the commit 42a0bb3f71383b457a7db362 >("printk/nmi: generic solution for safe printk in NMI"). It was >really needed to modify @printk_func pointer. > >We did not remove it later when printk_function became a real >function. The idea was to track all printk contexts in a single >variable. But we never added kdb context. > >It might make sense to remove it now. Peter Zijstra would be happy. >There already were some churns with tracking printk_context in NMI. >For example, see >https://lore.kernel.org/r/20200219150744.428764...@infradead.org > >IMHO, it does not make sense to wait until the entire console-stuff >rework is done in this case. Agreed. in_nmi() within vprintk_emit() is enough to detect if the console code should be skipped: if (!in_sched && !in_nmi()) { ... } > 2. I thought about unifying printk_safe_enter()/exit() and >printk_enter()/exit(). They both count recursion with >IRQs disabled, have similar name. But they are used >different way. > >But better might be to rename printk_safe_enter()/exit() to >console_enter()/exit() or to printk_deferred_enter()/exit(). >It would make more clear what it does now. And it might help >to better distinguish it from the new printk_enter()/exit(). > >This patchset actually splits the original printk_safe() >functionality into two: > >+ printk_count prevents infinite recursion >+ printk_deferred_enter() deffers console handling. > >I am not sure if it is worth it. But it might help people (even me) >when digging into the printk history. Different name will help to >understand the functionality at the given time. I am also not sure if it is worth the extra "noise" just to give the function a more appropriate name. The plan is to remove it completely soon anyway. My vote is to leave the name as it is. But I am willing to do the rename in an addtional patch if you want. printk_deferred_enter() sounds fine to me. Please confirm if you want me to do this. John Ogness
Re: [Ksummit-discuss] RFC: create mailing list "linux-issues" focussed on issues/bugs and regressions
On Mon, Mar 22, 2021 at 08:25:15PM +0100, Thorsten Leemhuis wrote: > I agree to the last point and yeah, maybe regressions are the more > important problem we should work on – at least from the perspective of > kernel development. But from the users perspective (and > reporting-issues.rst is written for that perspective) it feel a bit > unsatisfying to not have a solution to query for existing report, > regressions or not. H... First of all, thanks for working on reporting-issues.rst. If we can actually get users to *read* it, I think it's going to save kernel developers a huge amount of time and frustration. I wonder if it might be useful to have a form which users could be encouraged to fill out so that (a) the information is available in a structured format so it's easier for developers to find the relevant information, (b) so it is easier for programs to parse, for easier reporting or indexing, and (c) as a nudge so that users remember to report critical bits of information such as the hardware configuration, the exact kernel version, which distribution userspace was in use, etc. There could also be something in the text form which would make it easier for lore.kernel.org searches to identify bug reports. (e.g., "LINUX KERNEL BUG REPORTER TEMPLATE") - Ted
[tip:core/entry] BUILD SUCCESS 97258ce902d1e1c396a4d7c38f6ae7085adb73c5
parisc allyesconfig s390defconfig sparcallyesconfig sparc defconfig i386 tinyconfig mips allyesconfig mips allmodconfig powerpc allyesconfig powerpc allmodconfig powerpc allnoconfig i386 randconfig-a004-20210322 i386 randconfig-a003-20210322 i386 randconfig-a001-20210322 i386 randconfig-a002-20210322 i386 randconfig-a006-20210322 i386 randconfig-a005-20210322 x86_64 randconfig-a012-20210322 x86_64 randconfig-a015-20210322 x86_64 randconfig-a013-20210322 x86_64 randconfig-a014-20210322 x86_64 randconfig-a016-20210322 x86_64 randconfig-a011-20210322 i386 randconfig-a014-20210322 i386 randconfig-a011-20210322 i386 randconfig-a015-20210322 i386 randconfig-a016-20210322 i386 randconfig-a012-20210322 i386 randconfig-a013-20210322 riscvnommu_k210_defconfig riscvnommu_virt_defconfig riscv allnoconfig riscv defconfig riscv rv32_defconfig x86_64rhel-7.6-kselftests x86_64 defconfig x86_64 rhel-8.3 x86_64 rhel-8.3-kbuiltin x86_64 kexec clang tested configs: x86_64 randconfig-a002-20210322 x86_64 randconfig-a003-20210322 x86_64 randconfig-a001-20210322 x86_64 randconfig-a006-20210322 x86_64 randconfig-a004-20210322 x86_64 randconfig-a005-20210322 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org
Re: [PATCH 4.4 00/14] 4.4.263-rc1 review
On Mon, Mar 22, 2021 at 01:28:54PM +0100, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 4.4.263 release. > There are 14 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Wed, 24 Mar 2021 12:19:09 +. > Anything received after that time might be too late. > Build results: total: 163 pass: 163 fail: 0 Qemu test results: total: 329 pass: 329 fail: 0 Tested-by: Guenter Roeck Guenter
Re: [PATCH 4.9 00/25] 4.9.263-rc1 review
On Mon, Mar 22, 2021 at 01:28:50PM +0100, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 4.9.263 release. > There are 25 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Wed, 24 Mar 2021 12:19:09 +. > Anything received after that time might be too late. > Build results: total: 166 pass: 166 fail: 0 Qemu test results: total: 382 pass: 382 fail: 0 Tested-by: Guenter Roeck Guenter
Re: [PATCH 4.14 00/43] 4.14.227-rc1 review
On Mon, Mar 22, 2021 at 01:28:41PM +0100, Greg Kroah-Hartman wrote: > This is the start of the stable review cycle for the 4.14.227 release. > There are 43 patches in this series, all will be posted as a response > to this one. If anyone has any issues with these being applied, please > let me know. > > Responses should be made by Wed, 24 Mar 2021 12:19:09 +. > Anything received after that time might be too late. > Build results: total: 168 pass: 168 fail: 0 Qemu test results: total: 406 pass: 406 fail: 0 Tested-by: Guenter Roeck Guenter