Re: [PATCH] mm/compaction:let proactive compaction order configurable
Hello. On Mon, Apr 12, 2021 at 05:05:30PM +0800, chukaiping wrote: > Currently the proactive compaction order is fixed to > COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of > normal 4KB memory, but it's too high for the machines with small > normal memory, for example the machines with most memory configured > as 1GB hugetlbfs huge pages. In these machines the max order of > free pages is often below 9, and it's always below 9 even with hard > compaction. This will lead to proactive compaction be triggered very > frequently. In these machines we only care about order of 3 or 4. > This patch export the oder to proc and let it configurable > by user, and the default value is still COMPACTION_HPAGE_ORDER. > > Signed-off-by: chukaiping > --- > include/linux/compaction.h |1 + > kernel/sysctl.c| 10 ++ > mm/compaction.c|7 --- > 3 files changed, 15 insertions(+), 3 deletions(-) > > diff --git a/include/linux/compaction.h b/include/linux/compaction.h > index ed4070e..151ccd1 100644 > --- a/include/linux/compaction.h > +++ b/include/linux/compaction.h > @@ -83,6 +83,7 @@ static inline unsigned long compact_gap(unsigned int order) > #ifdef CONFIG_COMPACTION > extern int sysctl_compact_memory; > extern unsigned int sysctl_compaction_proactiveness; > +extern unsigned int sysctl_compaction_order; > extern int sysctl_compaction_handler(struct ctl_table *table, int write, > void *buffer, size_t *length, loff_t *ppos); > extern int sysctl_extfrag_threshold; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 62fbd09..277df31 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -114,6 +114,7 @@ > static int __maybe_unused neg_one = -1; > static int __maybe_unused two = 2; > static int __maybe_unused four = 4; > +static int __maybe_unused ten = 10; ^^ does the upper limit have to be hard-coded like this? > static unsigned long zero_ul; > static unsigned long one_ul = 1; > static unsigned long long_max = LONG_MAX; > @@ -2871,6 +2872,15 @@ int proc_do_static_key(struct ctl_table *table, int > write, > .extra2 = &one_hundred, > }, > { > + .procname = "compaction_order", > + .data = &sysctl_compaction_order, > + .maxlen = sizeof(sysctl_compaction_order), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, I wonder what happens if this knob is set to 0. Have you tested such a corner case? > + .extra2 = &ten, > + }, > + { > .procname = "extfrag_threshold", > .data = &sysctl_extfrag_threshold, > .maxlen = sizeof(int), > diff --git a/mm/compaction.c b/mm/compaction.c > index e04f447..a192996 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -1925,16 +1925,16 @@ static bool kswapd_is_running(pg_data_t *pgdat) > > /* > * A zone's fragmentation score is the external fragmentation wrt to the > - * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100]. > + * sysctl_compaction_order. It returns a value in the range [0, 100]. > */ > static unsigned int fragmentation_score_zone(struct zone *zone) > { > - return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > + return extfrag_for_order(zone, sysctl_compaction_order); > } > > /* > * A weighted zone's fragmentation score is the external fragmentation > - * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It > + * wrt to the sysctl_compaction_order scaled by the zone's size. It > * returns a value in the range [0, 100]. > * > * The scaling factor ensures that proactive compaction focuses on larger > @@ -2666,6 +2666,7 @@ static void compact_nodes(void) > * background. It takes values in the range [0, 100]. > */ > unsigned int __read_mostly sysctl_compaction_proactiveness = 20; > +unsigned int __read_mostly sysctl_compaction_order = COMPACTION_HPAGE_ORDER; > > /* > * This is the entry point for compacting all nodes via > -- > 1.7.1 > -- Oleksandr Natalenko (post-factum)
Re: [igb] netconsole triggers warning in netpoll_poll_dev
Hello. On Tue, Apr 06, 2021 at 11:48:02AM -0700, Jakub Kicinski wrote: > On Tue, 6 Apr 2021 14:36:19 +0200 Oleksandr Natalenko wrote: > > Hello. > > > > I've raised this here [1] first, but was suggested to engage igb devs, > > so here we are. > > > > I'm experiencing the following woes while using netconsole regularly: > > > > ``` > > [22038.710800] [ cut here ] > > [22038.710801] igb_poll+0x0/0x1440 [igb] exceeded budget in poll > > [22038.710802] WARNING: CPU: 12 PID: 40362 at net/core/netpoll.c:155 > > netpoll_poll_dev+0x18a/0x1a0 > > [22038.710802] Modules linked in: ... > > [22038.710835] CPU: 12 PID: 40362 Comm: systemd-sleep Not tainted > > 5.11.0-pf7 #1 > > [22038.710836] Hardware name: ASUS System Product Name/Pro WS X570-ACE, > > BIOS 3302 03/05/2021 > > [22038.710836] RIP: 0010:netpoll_poll_dev+0x18a/0x1a0 > > [22038.710837] Code: 6e ff 80 3d d2 9d f8 00 00 0f 85 5c ff ff ff 48 8b 73 > > 28 48 c7 c7 0c b8 21 84 89 44 24 04 c6 05 b6 9d f8 00 01 e8 84 21 1c 00 > > <0f> 0b 8b 54 24 04 e9 36 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 > > [22038.710838] RSP: 0018:b24106e37ba0 EFLAGS: 00010086 > > [22038.710838] RAX: RBX: 9599d2929c50 RCX: > > 959f8ed1ac30 > > [22038.710839] RDX: RSI: 0023 RDI: > > 959f8ed1ac28 > > [22038.710839] RBP: 9598981d4058 R08: 0019 R09: > > b24206e3796d > > [22038.710839] R10: R11: b24106e37968 R12: > > 959887e51ec8 > > [22038.710840] R13: 000c R14: R15: > > 9599d2929c60 > > [22038.710840] FS: 7f3ade370a40() GS:959f8ed0() > > knlGS: > > [22038.710841] CS: 0010 DS: ES: CR0: 80050033 > > [22038.710841] CR2: CR3: 0003017b CR4: > > 00350ee0 > > [22038.710841] Call Trace: > > [22038.710842] netpoll_send_skb+0x185/0x240 > > [22038.710842] write_msg+0xe5/0x100 [netconsole] > > [22038.710842] console_unlock+0x37d/0x640 > > [22038.710842] ? __schedule+0x2e5/0xc90 > > [22038.710843] suspend_devices_and_enter+0x2ac/0x7f0 > > [22038.710843] pm_suspend.cold+0x321/0x36c > > [22038.710843] state_store+0xa6/0x140 > > [22038.710844] kernfs_fop_write_iter+0x124/0x1b0 > > [22038.710844] new_sync_write+0x16a/0x200 > > [22038.710844] vfs_write+0x21c/0x2e0 > > [22038.710844] __x64_sys_write+0x6d/0xf0 > > [22038.710845] do_syscall_64+0x33/0x40 > > [22038.710845] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [22038.710845] RIP: 0033:0x7f3adece10f7 > > [22038.710846] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f > > 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 > > <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 > > [22038.710847] RSP: 002b:7ffc51c555b8 EFLAGS: 0246 ORIG_RAX: > > 0001 > > [22038.710847] RAX: ffda RBX: 0004 RCX: > > 7f3adece10f7 > > [22038.710848] RDX: 0004 RSI: 7ffc51c556a0 RDI: > > 0004 > > [22038.710848] RBP: 7ffc51c556a0 R08: 55ea374302a0 R09: > > 7f3aded770c0 > > [22038.710849] R10: 7f3aded76fc0 R11: 0246 R12: > > 0004 > > [22038.710849] R13: 55ea3742c430 R14: 0004 R15: > > 7f3adedb3700 > > [22038.710849] ---[ end trace 6eae54fbf23807f8 ]--- > > ``` > > > > This one happened during suspend/resume cycle (on resume), followed by: > > > > ``` > > [22038.868669] igb :05:00.0 enp5s0: Reset adapter > > [22040.998673] igb :05:00.0 enp5s0: Reset adapter > > [22043.819198] igb :05:00.0 enp5s0: igb: enp5s0 NIC Link is Up 1000 > > Mbps Full Duplex, Flow Control: RX > > ``` > > > > I've bumped into a similar issue in BZ 211911 [2] (see c#16), > > and in c#17 it was suggested it was a separate unrelated issue, > > hence I'm raising a new concern. > > > > Please help in finding out why this woe happens and in fixing it. > > > > Thanks. > > > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=212573 > > [2] https://bugzilla.kernel.org/show_bug.cgi?id=211911 > > Looks like igb_clean_tx_irq() should not return true if budget is 0, > ever, otherwise we risk hitting the min(work, budget - 1) which may > go negative. > > So something like this? > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c &
[igb] netconsole triggers warning in netpoll_poll_dev
9198] igb :05:00.0 enp5s0: igb: enp5s0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX ``` I've bumped into a similar issue in BZ 211911 [2] (see c#16), and in c#17 it was suggested it was a separate unrelated issue, hence I'm raising a new concern. Please help in finding out why this woe happens and in fixing it. Thanks. [1] https://bugzilla.kernel.org/show_bug.cgi?id=212573 [2] https://bugzilla.kernel.org/show_bug.cgi?id=211911 -- Oleksandr Natalenko (post-factum)
Page fault in cgroup_get_e_css
Hello. >From time to time I'm experiencing the following: ``` [64924.105071] BUG: unable to handle page fault for address: 04000190 [64924.105080] #PF: supervisor read access in kernel mode [64924.105083] #PF: error_code(0x) - not-present page [64924.105085] PGD 0 P4D 0 [64924.105088] Oops: [#1] PREEMPT SMP NOPTI [64924.105091] CPU: 3 PID: 1103 Comm: bluetoothd Tainted: GW 5.11.0-pf7 #1 [64924.105094] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021 [64924.105097] RIP: 0010:cgroup_get_e_css+0x27/0xe0 [64924.105102] Code: b9 eb da 0f 1f 44 00 00 41 54 55 48 89 f5 53 48 89 fb e8 2c 8a fb ff 49 89 dc 48 85 ed 74 10 48 63 85 94 00 00 00 48 83 c0 2e <4c> 8b 64 c3 08 4d 85 e4 74 4d 41 f6 44 24 54 01 74 0d e8 32 db fb [64924.105105] RSP: 0018:c07f02bafad8 EFLAGS: 00010012 [64924.105108] RAX: 0031 RBX: 0400 RCX: [64924.105111] RDX: 000a RSI: b93398a0 RDI: 0400 [64924.105113] RBP: b93398a0 R08: 0040 R09: c07f02bafcd8 [64924.105115] R10: 0476 R11: 000c R12: 0400 [64924.105117] R13: 9f88c538e700 R14: 9f88c538e400 R15: edfb96b629c0 [64924.105119] FS: 7f234fe867c0() GS:9f8fceac() knlGS: [64924.105122] CS: 0010 DS: ES: CR0: 80050033 [64924.105124] CR2: 04000190 CR3: 00010c89c000 CR4: 00350ee0 [64924.105127] Call Trace: [64924.105130] wb_get_create+0x8d/0x640 [64924.105136] ? xfs_bmap_add_extent_hole_real+0x60a/0x950 [xfs] [64924.105188] __inode_attach_wb+0x8c/0x250 [64924.105192] account_page_dirtied+0x16d/0x1b0 [64924.105196] __set_page_dirty+0x50/0xc0 [64924.105199] iomap_set_page_dirty+0x50/0x90 [64924.105203] iomap_write_end+0x73/0x280 [64924.105206] ? iov_iter_copy_from_user_atomic+0xc7/0x340 [64924.105210] iomap_write_actor+0xed/0x190 [64924.105213] iomap_apply+0x106/0x300 [64924.105216] ? iomap_write_begin+0x5b0/0x5b0 [64924.105271] iomap_file_buffered_write+0x5c/0x80 [64924.105274] ? iomap_write_begin+0x5b0/0x5b0 [64924.105277] xfs_file_buffered_aio_write+0xe7/0x350 [xfs] [64924.105333] new_sync_write+0x16a/0x200 [64924.105337] vfs_write+0x21c/0x2e0 [64924.105341] __x64_sys_write+0x6d/0xf0 [64924.105344] do_syscall_64+0x33/0x40 [64924.105347] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [64924.105352] RIP: 0033:0x7f23504a40f7 [64924.105355] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24 [64924.105357] RSP: 002b:7ffcc0fdc988 EFLAGS: 0246 ORIG_RAX: 0001 ``` I'm not quite positive about having an exact reproducer, unfortunately. Have you got an idea on what could go wrong here? Thanks. -- Oleksandr Natalenko (post-factum)
Re: [PATCH] init: add support for zstd compressed modules
Hello. On Wed, Mar 31, 2021 at 07:21:07PM +, Nick Terrell wrote: > > > > On Mar 31, 2021, at 10:48 AM, Oleksandr Natalenko > > wrote: > > > > Hello. > > > > On Wed, Mar 31, 2021 at 05:39:25PM +, Nick Terrell wrote: > >> > >> > >>> On Mar 30, 2021, at 4:50 AM, Oleksandr Natalenko > >>> wrote: > >>> > >>> On Tue, Mar 30, 2021 at 01:32:35PM +0200, Piotr Gorski wrote: > >>>> kmod 28 supports modules compressed in zstd format so let's add this > >>>> possibility to kernel. > >>>> > >>>> Signed-off-by: Piotr Gorski > >>>> --- > >>>> Makefile | 7 +-- > >>>> init/Kconfig | 9 ++--- > >>>> 2 files changed, 11 insertions(+), 5 deletions(-) > >>>> > >>>> diff --git a/Makefile b/Makefile > >>>> index 5160ff8903c1..82f4f4cc2955 100644 > >>>> --- a/Makefile > >>>> +++ b/Makefile > >>>> @@ -1156,8 +1156,8 @@ endif # INSTALL_MOD_STRIP > >>>> export mod_strip_cmd > >>>> > >>>> # CONFIG_MODULE_COMPRESS, if defined, will cause module to be compressed > >>>> -# after they are installed in agreement with CONFIG_MODULE_COMPRESS_GZIP > >>>> -# or CONFIG_MODULE_COMPRESS_XZ. > >>>> +# after they are installed in agreement with > >>>> CONFIG_MODULE_COMPRESS_GZIP, > >>>> +# CONFIG_MODULE_COMPRESS_XZ, or CONFIG_MODULE_COMPRESS_ZSTD. > >>>> > >>>> mod_compress_cmd = true > >>>> ifdef CONFIG_MODULE_COMPRESS > >>>> @@ -1167,6 +1167,9 @@ ifdef CONFIG_MODULE_COMPRESS > >>>> ifdef CONFIG_MODULE_COMPRESS_XZ > >>>>mod_compress_cmd = $(XZ) --lzma2=dict=2MiB -f > >>>> endif # CONFIG_MODULE_COMPRESS_XZ > >>>> + ifdef CONFIG_MODULE_COMPRESS_ZSTD > >>>> +mod_compress_cmd = $(ZSTD) -T0 --rm -f -q > >> > >> This will use the default zstd level, level 3. I think it would make more > >> sense to use a high > >> compression level. Level 19 would probably be a good choice. That will > >> choose a window > >> size of up to 8MB, meaning the decompressor needs to allocate that much > >> memory. If that > >> is unacceptable, you could use `zstd -T0 --rm -f -q -19 --zstd=wlog=21`, > >> which will use a > >> window size of up to 2MB, to match the XZ command. Note that if the file > >> is smaller than > >> the window size, it will be shrunk to the smallest power of two at least > >> as large as the file. > > > > Please no. We've already done that with initramfs in Arch, and it > > increased the time to generate it enormously. > > > > I understand that building a kernel is a more rare operation than > > regenerating initramfs, but still I'd go against hard-coding the level. > > And if it should be specified anyway, I'd opt in for an explicit > > configuration option. Remember, not all the kernel are built on > > build farms... > > > > FWIW, Piotr originally used level 9 which worked okay, but I insisted > > on sending the patch initially without specifying level at all like it is > > done for other compressors. If this is a wrong approach, then oh meh, > > mea culpa ;). > > > > Whatever default non-standard compression level you choose, I'm fine > > as long as I can change it without editing Makefile. > > That makes sense to me. I have a deep seated need to compress files as > efficiently as possible for widely distributed packages. But, I understand > that > slow compression significantly impacts build times for quick iteration. I’d be > happy with a compression level parameter that defaults to a happy middle. > > I’m also fine with taking this patch as-is if it is easier, and I can put up > another > patch that adds a compression level parameter, since I don’t want to block > merging this. Well, it seems Andrew already took this into his tree, so feel free to drop another one on top of that! > > Best, > Nick Terrell > > > Thanks! > > > >> > >> Best, > >> Nick Terrell > >> > >>>> + endif # CONFIG_MODULE_COMPRESS_ZSTD > >>>> endif # CONFIG_MODULE_COMPRESS > >>>> export mod_compress_cmd > >>>> > >>>> diff --git a/init/Kconfig b/init/Kconfig > >>>> index 8c2cfd88f6ef..86a452bc2747 100644 > >>>
Re: [PATCH] init: add support for zstd compressed modules
Hello. On Wed, Mar 31, 2021 at 05:39:25PM +, Nick Terrell wrote: > > > > On Mar 30, 2021, at 4:50 AM, Oleksandr Natalenko > > wrote: > > > > On Tue, Mar 30, 2021 at 01:32:35PM +0200, Piotr Gorski wrote: > >> kmod 28 supports modules compressed in zstd format so let's add this > >> possibility to kernel. > >> > >> Signed-off-by: Piotr Gorski > >> --- > >> Makefile | 7 +-- > >> init/Kconfig | 9 ++--- > >> 2 files changed, 11 insertions(+), 5 deletions(-) > >> > >> diff --git a/Makefile b/Makefile > >> index 5160ff8903c1..82f4f4cc2955 100644 > >> --- a/Makefile > >> +++ b/Makefile > >> @@ -1156,8 +1156,8 @@ endif # INSTALL_MOD_STRIP > >> export mod_strip_cmd > >> > >> # CONFIG_MODULE_COMPRESS, if defined, will cause module to be compressed > >> -# after they are installed in agreement with CONFIG_MODULE_COMPRESS_GZIP > >> -# or CONFIG_MODULE_COMPRESS_XZ. > >> +# after they are installed in agreement with CONFIG_MODULE_COMPRESS_GZIP, > >> +# CONFIG_MODULE_COMPRESS_XZ, or CONFIG_MODULE_COMPRESS_ZSTD. > >> > >> mod_compress_cmd = true > >> ifdef CONFIG_MODULE_COMPRESS > >> @@ -1167,6 +1167,9 @@ ifdef CONFIG_MODULE_COMPRESS > >> ifdef CONFIG_MODULE_COMPRESS_XZ > >> mod_compress_cmd = $(XZ) --lzma2=dict=2MiB -f > >> endif # CONFIG_MODULE_COMPRESS_XZ > >> + ifdef CONFIG_MODULE_COMPRESS_ZSTD > >> +mod_compress_cmd = $(ZSTD) -T0 --rm -f -q > > This will use the default zstd level, level 3. I think it would make more > sense to use a high > compression level. Level 19 would probably be a good choice. That will choose > a window > size of up to 8MB, meaning the decompressor needs to allocate that much > memory. If that > is unacceptable, you could use `zstd -T0 --rm -f -q -19 --zstd=wlog=21`, > which will use a > window size of up to 2MB, to match the XZ command. Note that if the file is > smaller than > the window size, it will be shrunk to the smallest power of two at least as > large as the file. Please no. We've already done that with initramfs in Arch, and it increased the time to generate it enormously. I understand that building a kernel is a more rare operation than regenerating initramfs, but still I'd go against hard-coding the level. And if it should be specified anyway, I'd opt in for an explicit configuration option. Remember, not all the kernel are built on build farms... FWIW, Piotr originally used level 9 which worked okay, but I insisted on sending the patch initially without specifying level at all like it is done for other compressors. If this is a wrong approach, then oh meh, mea culpa ;). Whatever default non-standard compression level you choose, I'm fine as long as I can change it without editing Makefile. Thanks! > > Best, > Nick Terrell > > >> + endif # CONFIG_MODULE_COMPRESS_ZSTD > >> endif # CONFIG_MODULE_COMPRESS > >> export mod_compress_cmd > >> > >> diff --git a/init/Kconfig b/init/Kconfig > >> index 8c2cfd88f6ef..86a452bc2747 100644 > >> --- a/init/Kconfig > >> +++ b/init/Kconfig > >> @@ -2250,8 +2250,8 @@ config MODULE_COMPRESS > >>bool "Compress modules on installation" > >>help > >> > >> -Compresses kernel modules when 'make modules_install' is run; gzip or > >> -xz depending on "Compression algorithm" below. > >> +Compresses kernel modules when 'make modules_install' is run; gzip, > >> +xz, or zstd depending on "Compression algorithm" below. > >> > >> module-init-tools MAY support gzip, and kmod MAY support gzip and xz. > >> > >> @@ -2273,7 +2273,7 @@ choice > >> This determines which sort of compression will be used during > >> 'make modules_install'. > >> > >> -GZIP (default) and XZ are supported. > >> +GZIP (default), XZ, and ZSTD are supported. > >> > >> config MODULE_COMPRESS_GZIP > >> bool "GZIP" > >> @@ -2281,6 +2281,9 @@ config MODULE_COMPRESS_GZIP > >> config MODULE_COMPRESS_XZ > >>bool "XZ" > >> > >> +config MODULE_COMPRESS_ZSTD > >> + bool "ZSTD" > >> + > >> endchoice > >> > >> config MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS > >> -- > >> 2.31.0.97.g1424303384 > >> > > > > Great! > > > > Reviewed-by: Oleksandr Natalenko > > > > This works perfectly fine in Arch Linux if accompanied by the > > following mkinitcpio amendment: [1]. > > > > I'm also Cc'ing other people from get_maintainers output just > > to make this submission more visible. > > > > Thanks. > > > > [1] https://github.com/archlinux/mkinitcpio/pull/43 > > > > -- > > Oleksandr Natalenko (post-factum) > -- Oleksandr Natalenko (post-factum)
Re: [GIT PULL][PATCH v9 0/3] Update to zstd-1.4.10
he > test. > > v2 -> v3: > * (3/9) Silence warnings by Kernel Test Robot: > https://github.com/facebook/zstd/pull/2324 > Stack size warnings remain, but these aren't new, and the functions it > warns on > are either unused or not in the maximum stack path. This patchset reduces > zstd > compression stack usage by 1 KB overall. I've gotten the low hanging fruit, > and > more stack reduction would require significant changes that have the > potential > to introduce new bugs. However, I do hope to continue to reduce zstd stack > usage in future versions. > > v3 -> v4: > * (3/9) Fix errors and warnings reported by Kernel Test Robot: > https://github.com/facebook/zstd/pull/2326 > - Replace mem.h with a custom kernel implementation that matches the current > lib/zstd/mem.h in the kernel. This avoids calls to __builtin_bswap*() > which > don't work on certain architectures, as exposed by the Kernel Test Robot. > - Remove ASAN/MSAN (un)poisoning code which doesn't work in the kernel, as > exposed by the Kernel Test Robot. > - I've fixed all of the valid cppcheck warnings reported, but there were > many > false positives, where cppcheck was incorrectly analyzing the situation, > which I did not fix. I don't believe it is reasonable to expect that > upstream > zstd silences all the static analyzer false positives. Upstream zstd uses > clang scan-build for its static analysis. We find that supporting multiple > static analysis tools multiplies the burden of silencing false positives, > without providing enough marginal value over running a single static > analysis > tool. > > v4 -> v5: > * Rebase onto v5.10-rc2 > * (6/9) Merge with other F2FS changes (no functional change in patch). > > v5 -> v6: > * Rebase onto v5.10-rc6. > * Switch to using a kernel style wrapper API as suggested by Cristoph. > > v6 -> v7: > * Expose the upstream library header as `include/linux/zstd_lib.h`. > Instead of creating new structs mirroring the upstream zstd structs > use upstream's structs directly with a typedef to get a kernel style name. > This removes the memcpy cruft. > * (1/3) Undo ZSTD_WINDOWLOG_MAX and handle_zstd_error changes. > * (3/3) Expose zstd_errors.h as `include/linux/zstd_errors.h` because it > is needed by the kernel wrapper API. > > v7 -> v8: > * (1/3) Fix typo in EXPORT_SYMBOL(). > * (1/3) Fix typo in zstd.h comments. > * (3/3) Update to latest zstd release: 1.4.6 -> 1.4.10 > This includes ~1KB of stack space reductions. > > v8 -> v9: > * (1/3) Rebase onto v5.12-rc5 > * (1/3) Add zstd_min_clevel() & zstd_max_clevel() and use in f2fs. > Thanks to Oleksandr Natalenko for spotting it! > * (1/3) Move lib/decompress_unzstd.c usage of ZSTD_getErrorCode() > to zstd_get_error_code(). > * (1/3) Update modified zstd headers to yearless copyright. > * (2/3) Add copyright/license header to decompress_sources.h for consistency. > * (3/3) Update to yearless copyright for all zstd files. Thanks to > Mike Dolan for spotting it! > > Nick Terrell (3): > lib: zstd: Add kernel-specific API > lib: zstd: Add decompress_sources.h for decompress_unzstd > lib: zstd: Upgrade to latest upstream zstd version 1.4.10 > > crypto/zstd.c | 28 +- > fs/btrfs/zstd.c | 68 +- > fs/f2fs/compress.c| 56 +- > fs/f2fs/super.c |2 +- > fs/pstore/platform.c |2 +- > fs/squashfs/zstd_wrapper.c| 16 +- > include/linux/zstd.h | 1252 +--- > include/linux/zstd_errors.h | 77 + > include/linux/zstd_lib.h | 2432 > lib/decompress_unzstd.c | 48 +- > lib/zstd/Makefile | 44 +- > lib/zstd/bitstream.h | 380 -- > lib/zstd/common/bitstream.h | 437 ++ > lib/zstd/common/compiler.h| 151 + > lib/zstd/common/cpu.h | 194 + > lib/zstd/common/debug.c | 24 + > lib/zstd/common/debug.h | 101 + > lib/zstd/common/entropy_common.c | 357 ++ > lib/zstd/common/error_private.c | 56 + > lib/zstd/common/error_private.h | 66 + > lib/zstd/common/fse.h | 710 +++ > lib/zstd/common/fse_decompress.c | 390 ++ > lib/zstd/common/huf.h | 356 ++ > lib/zstd/common
Re: [PATCH] init: add support for zstd compressed modules
On Tue, Mar 30, 2021 at 01:32:35PM +0200, Piotr Gorski wrote: > kmod 28 supports modules compressed in zstd format so let's add this > possibility to kernel. > > Signed-off-by: Piotr Gorski > --- > Makefile | 7 +-- > init/Kconfig | 9 ++--- > 2 files changed, 11 insertions(+), 5 deletions(-) > > diff --git a/Makefile b/Makefile > index 5160ff8903c1..82f4f4cc2955 100644 > --- a/Makefile > +++ b/Makefile > @@ -1156,8 +1156,8 @@ endif # INSTALL_MOD_STRIP > export mod_strip_cmd > > # CONFIG_MODULE_COMPRESS, if defined, will cause module to be compressed > -# after they are installed in agreement with CONFIG_MODULE_COMPRESS_GZIP > -# or CONFIG_MODULE_COMPRESS_XZ. > +# after they are installed in agreement with CONFIG_MODULE_COMPRESS_GZIP, > +# CONFIG_MODULE_COMPRESS_XZ, or CONFIG_MODULE_COMPRESS_ZSTD. > > mod_compress_cmd = true > ifdef CONFIG_MODULE_COMPRESS > @@ -1167,6 +1167,9 @@ ifdef CONFIG_MODULE_COMPRESS >ifdef CONFIG_MODULE_COMPRESS_XZ > mod_compress_cmd = $(XZ) --lzma2=dict=2MiB -f >endif # CONFIG_MODULE_COMPRESS_XZ > + ifdef CONFIG_MODULE_COMPRESS_ZSTD > +mod_compress_cmd = $(ZSTD) -T0 --rm -f -q > + endif # CONFIG_MODULE_COMPRESS_ZSTD > endif # CONFIG_MODULE_COMPRESS > export mod_compress_cmd > > diff --git a/init/Kconfig b/init/Kconfig > index 8c2cfd88f6ef..86a452bc2747 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -2250,8 +2250,8 @@ config MODULE_COMPRESS > bool "Compress modules on installation" > help > > - Compresses kernel modules when 'make modules_install' is run; gzip or > - xz depending on "Compression algorithm" below. > + Compresses kernel modules when 'make modules_install' is run; gzip, > + xz, or zstd depending on "Compression algorithm" below. > > module-init-tools MAY support gzip, and kmod MAY support gzip and xz. > > @@ -2273,7 +2273,7 @@ choice > This determines which sort of compression will be used during > 'make modules_install'. > > - GZIP (default) and XZ are supported. > + GZIP (default), XZ, and ZSTD are supported. > > config MODULE_COMPRESS_GZIP > bool "GZIP" > @@ -2281,6 +2281,9 @@ config MODULE_COMPRESS_GZIP > config MODULE_COMPRESS_XZ > bool "XZ" > > +config MODULE_COMPRESS_ZSTD > + bool "ZSTD" > + > endchoice > > config MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS > -- > 2.31.0.97.g1424303384 > Great! Reviewed-by: Oleksandr Natalenko This works perfectly fine in Arch Linux if accompanied by the following mkinitcpio amendment: [1]. I'm also Cc'ing other people from get_maintainers output just to make this submission more visible. Thanks. [1] https://github.com/archlinux/mkinitcpio/pull/43 -- Oleksandr Natalenko (post-factum)
Re: [PATCH v8 1/3] lib: zstd: Add kernel-specific API
Hello. On Sat, Mar 27, 2021 at 05:48:01PM +0800, kernel test robot wrote: > >> ERROR: modpost: "ZSTD_maxCLevel" [fs/f2fs/f2fs.ko] undefined! Since f2fs can be built as a module, the following correction seems to be needed: ``` diff --git a/lib/zstd/compress/zstd_compress.c b/lib/zstd/compress/zstd_compress.c index 9c998052a0e5..584c92c51169 100644 --- a/lib/zstd/compress/zstd_compress.c +++ b/lib/zstd/compress/zstd_compress.c @@ -4860,6 +4860,7 @@ size_t ZSTD_endStream(ZSTD_CStream* zcs, ZSTD_outBuffer* output) #define ZSTD_MAX_CLEVEL 22 int ZSTD_maxCLevel(void) { return ZSTD_MAX_CLEVEL; } +EXPORT_SYMBOL(ZSTD_maxCLevel); int ZSTD_minCLevel(void) { return (int)-ZSTD_TARGETLENGTH_MAX; } static const ZSTD_compressionParameters ZSTD_defaultCParameters[4][ZSTD_MAX_CLEVEL+1] = { ``` Not sure if the same should be done for `ZSTD_minCLevel()` since I don't see it being used anywhere else. -- Oleksandr Natalenko (post-factum)
Re: WARNING: AMDGPU DRM warning in 5.11.9
Hello. On Thu, Mar 25, 2021 at 07:57:33AM +0200, Ilkka Prusi wrote: > On 24.3.2021 16.16, Chris Rankin wrote: > > Hi, > > > > Theee warnings ares not present in my dmesg log from 5.11.8: > > > > [ 43.390159] [ cut here ] > > [ 43.393574] WARNING: CPU: 2 PID: 1268 at > > drivers/gpu/drm/ttm/ttm_bo.c:517 ttm_bo_release+0x172/0x282 [ttm] > > [ 43.401940] Modules linked in: nf_nat_ftp nf_conntrack_ftp cfg80211 > > Changing WARN_ON to WARN_ON_ONCE in drivers/gpu/drm/ttm/ttm_bo.c > ttm_bo_release() reduces the flood of messages into single splat. > > This warning appears to come from 57fcd550eb15bce ("drm/ttm: Warn on pinning > without holding a reference)" and reverting it might be one choice. > > > > > > There are others, but I am assuming there is a common cause here. > > > > Cheers, > > Chris > > > > diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c > index a76eb2c14e8c..50b53355b265 100644 > --- a/drivers/gpu/drm/ttm/ttm_bo.c > +++ b/drivers/gpu/drm/ttm/ttm_bo.c > @@ -514,7 +514,7 @@ static void ttm_bo_release(struct kref *kref) > * shrinkers, now that they are queued for > * destruction. > */ > - if (WARN_ON(bo->pin_count)) { > + if (WARN_ON_ONCE(bo->pin_count)) { > bo->pin_count = 0; > ttm_bo_del_from_lru(bo); > ttm_bo_add_mem_to_lru(bo, &bo->mem); > > > > -- > - Ilkka > WARN_ON_ONCE() will just hide the underlying problem. Do we know why this happens at all? Same for me, BTW, with v5.11.9: ``` [~]> lspci | grep VGA 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) [ 3676.033140] [ cut here ] [ 3676.033153] WARNING: CPU: 7 PID: 1318 at drivers/gpu/drm/ttm/ttm_bo.c:517 ttm_bo_release+0x375/0x500 [ttm] … [ 3676.033340] Hardware name: ASUS System Product Name/Pro WS X570-ACE, BIOS 3302 03/05/2021 … [ 3676.033469] Call Trace: [ 3676.033473] ttm_bo_move_accel_cleanup+0x1ab/0x3a0 [ttm] [ 3676.033478] amdgpu_bo_move+0x334/0x860 [amdgpu] [ 3676.033580] ttm_bo_validate+0x1f1/0x2d0 [ttm] [ 3676.033585] amdgpu_cs_bo_validate+0x9b/0x1c0 [amdgpu] [ 3676.033665] amdgpu_cs_list_validate+0x115/0x150 [amdgpu] [ 3676.033743] amdgpu_cs_ioctl+0x873/0x20a0 [amdgpu] [ 3676.033960] drm_ioctl_kernel+0xb8/0x140 [drm] [ 3676.033977] drm_ioctl+0x222/0x3c0 [drm] [ 3676.034071] amdgpu_drm_ioctl+0x49/0x80 [amdgpu] [ 3676.034145] __x64_sys_ioctl+0x83/0xb0 [ 3676.034149] do_syscall_64+0x33/0x40 … [ 3676.034171] ---[ end trace 66e9865b027112f3 ]--- ``` Thanks. -- Oleksandr Natalenko (post-factum)
Re: [PATCH BUGFIX/IMPROVEMENT V2 0/6] revised version of third and last batch of patches
Hello. On Thu, Mar 04, 2021 at 06:46:21PM +0100, Paolo Valente wrote: > Hi, > this is the V2 for the third and last batches of patches that I > proposed recently [1]. > > I've tried to address all issues raised in [1]. > > In more detail, main changes for V1 are: > 1. I've improved code as requested in "block, bfq: merge bursts of > newly-created queues" > 2. I've improved comments as requested in "block, bfq: put reqs of > waker and woken in dispatch list" > > Thanks, > Paolo > > [1] https://www.spinics.net/lists/linux-block/msg64333.html > > Paolo Valente (6): > block, bfq: always inject I/O of queues blocked by wakers > block, bfq: put reqs of waker and woken in dispatch list > block, bfq: make shared queues inherit wakers > block, bfq: fix weight-raising resume with !low_latency > block, bfq: keep shared queues out of the waker mechanism > block, bfq: merge bursts of newly-created queues > > block/bfq-cgroup.c | 2 + > block/bfq-iosched.c | 399 +--- > block/bfq-iosched.h | 15 ++ > block/bfq-wf2q.c| 8 + > 4 files changed, 402 insertions(+), 22 deletions(-) > > -- > 2.20.1 I'm running the kernel with this submission applied on multiple machines for 3 weeks now and haven't encountered any visible issues. Tested-by: Oleksandr Natalenko Thanks. -- Oleksandr Natalenko (post-factum)
Re: [GIT PULL] ext4 fixes for v5.12
Hi. On Sun, Mar 21, 2021 at 11:36:10PM -0400, Theodore Ts'o wrote: > On Mon, Mar 22, 2021 at 11:05:13AM +0800, Gao Xiang wrote: > > I think the legel name would be "Zhang Yi" (family name goes first [1]) > > according to > > The Chinese phonetic alphabet spelling rules for Chinese names [2]. > > > > Indeed, that is also what the legel name is written in alphabet on our > > passport or credit/debit cards. > > > > Also, many official English-written materials use it in that way, for > > example, a somewhat famous bastetball player Yao Ming [3][4][5]. > > > > That is what I wrote my own name as this but I also noticed the western > > ordering of names is quite common for Chinese people in Linux kernel. > > Anyway, it's just my preliminary personal thought (might be just my > > own perference) according to (I think, maybe) formal occasions. > > Yeah, there doesn't seem to be a lot of consistency with the ordering > of Chinese names when they are written in Roman characters. Some > people put the family name first, and other people will put the > personal (first) name first. In some cases it may be because the > developer in question is living in America, and so they've decided to > use the American naming convention. (Two example of this are former > ext4 developers Mingming Cao and Jiaying Zhang, who live in Portland > and Los Angelos, and their family names are Cao and Zhang, > respectively.) > > My personal opinion is people should use whatever name they are > comfortable with, using whatever characters they prefer. The one > thing that would be helpful for me is for people to give a hint about > how they would prefer to be called --- for example, would you prefer > that start an e-mail with the salutation, "Hi Gao", "Hi Xiang", or "Hi > Gao Xiang"? Is there a common way to indicate that? Like, erm, SPDX, but for names? Saying, instead of just writing "Oleksandr Natalenko" in the email footer, I'll write "Oleksandr Natalenko / RMNA:NS" meaning "read my name as name-surname". Hm? > > And if I don't know, and I guess wrong, please feel free to correct > me, either privately, or publically on the e-mail list, if you think > it would be helpful for more people to understand how you'd prefer to > be called. > > Cheers, > > - Ted -- Oleksandr Natalenko (post-factum)
Re: [PATCH v24 09/10] fs/ntfs3: Add NTFS3 in fs/Kconfig and fs/Makefile
On Fri, Mar 19, 2021 at 09:52:09PM +0300, Konstantin Komarov wrote: > This adds NTFS3 in fs/Kconfig and fs/Makefile > > Signed-off-by: Konstantin Komarov > --- > fs/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/Kconfig b/fs/Kconfig > index a55bda4233bb..f61330e4efc0 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -145,6 +145,7 @@ menu "DOS/FAT/EXFAT/NT Filesystems" > source "fs/fat/Kconfig" > source "fs/exfat/Kconfig" > source "fs/ntfs/Kconfig" > +source "fs/ntfs3/Kconfig" > > endmenu > endif # BLOCK > -- > 2.25.4 > It seems fs/Makefile modification has been dropped from this patch for some reason. Mistake? -- Oleksandr Natalenko (post-factum)
Re: [PATCH v23 02/10] fs/ntfs3: Add initialization of super block
Hi. On Mon, Mar 15, 2021 at 05:44:06PM +0300, Konstantin Komarov wrote: > This adds initialization of super block > > ...SNIP... > > + > +/* > + * Helper for ntfs_loadlog_and_replay > + * fill on-disk logfile range by (-1) > + * this means empty logfile > + */ > +int ntfs_bio_fill_1(struct ntfs_sb_info *sbi, const struct runs_tree *run) > +{ > + int err = 0; > + struct super_block *sb = sbi->sb; > + struct block_device *bdev = sb->s_bdev; > + u8 cluster_bits = sbi->cluster_bits; > + struct bio *new, *bio = NULL; > + CLST lcn, clen; > + u64 lbo, len; > + size_t run_idx; > + struct page *fill; > + void *kaddr; > + struct blk_plug plug; > + > + fill = alloc_page(GFP_KERNEL); > + if (!fill) > + return -ENOMEM; > + > + kaddr = kmap_atomic(fill); > + memset(kaddr, -1, PAGE_SIZE); > + kunmap_atomic(kaddr); > + flush_dcache_page(fill); > + lock_page(fill); > + > + if (!run_lookup_entry(run, 0, &lcn, &clen, &run_idx)) { > + err = -ENOENT; > + goto out; > + } > + > + /* > + * TODO: try blkdev_issue_write_same > + */ > + blk_start_plug(&plug); > + do { > + lbo = (u64)lcn << cluster_bits; > + len = (u64)clen << cluster_bits; > +new_bio: > + new = ntfs_alloc_bio(BIO_MAX_PAGES); ^ this was renamed to BIO_MAX_VECS recently. > + if (!new) { > + err = -ENOMEM; > + break; > + } > + if (bio) { > + bio_chain(bio, new); > + submit_bio(bio); > + } > + bio = new; > + bio_set_dev(bio, bdev); > + bio->bi_opf = REQ_OP_WRITE; > + bio->bi_iter.bi_sector = lbo >> 9; > + > + for (;;) { > + u32 add = len > PAGE_SIZE ? PAGE_SIZE : len; > + > + if (bio_add_page(bio, fill, add, 0) < add) > + goto new_bio; > + > + lbo += add; > + if (len <= add) > + break; > + len -= add; > + } > + } while (run_get_entry(run, ++run_idx, NULL, &lcn, &clen)); > + > + if (bio) { > + if (!err) > + err = submit_bio_wait(bio); > + bio_put(bio); > + } > + blk_finish_plug(&plug); > +out: > + unlock_page(fill); > + put_page(fill); > + > + return err; > +} > > ...SNIP... > -- Oleksandr Natalenko (post-factum)
Re: [PATCH v7 0/3] Update to zstd-1.4.6
de 100644 lib/zstd/decompress/zstd_ddict.h > create mode 100644 lib/zstd/decompress/zstd_decompress.c > create mode 100644 lib/zstd/decompress/zstd_decompress_block.c > create mode 100644 lib/zstd/decompress/zstd_decompress_block.h > create mode 100644 lib/zstd/decompress/zstd_decompress_internal.h > create mode 100644 lib/zstd/decompress_sources.h > delete mode 100644 lib/zstd/entropy_common.c > delete mode 100644 lib/zstd/error_private.h > delete mode 100644 lib/zstd/fse.h > delete mode 100644 lib/zstd/fse_compress.c > delete mode 100644 lib/zstd/fse_decompress.c > delete mode 100644 lib/zstd/huf.h > delete mode 100644 lib/zstd/huf_compress.c > delete mode 100644 lib/zstd/huf_decompress.c > delete mode 100644 lib/zstd/mem.h > delete mode 100644 lib/zstd/zstd_common.c > create mode 100644 lib/zstd/zstd_compress_module.c > create mode 100644 lib/zstd/zstd_decompress_module.c > delete mode 100644 lib/zstd/zstd_internal.h > delete mode 100644 lib/zstd/zstd_opt.h > > -- > 2.29.2 > So, what's the fate of this submission please? Thanks. -- Oleksandr Natalenko (post-factum)
Re: [PATCH v21 00/10] NTFS read-write driver GPL implementation by Paragon Software
Hi. On Fri, Feb 12, 2021 at 07:24:06PM +0300, Konstantin Komarov wrote: > This patch adds NTFS Read-Write driver to fs/ntfs3. > … > v21: > - fixes for clang CFI checks > - fixed sb->s_maxbytes for 32bit clusters > - user.DOSATTRIB is no more intercepted by ntfs3 > - corrected xattr limits; is used > - corrected CONFIG_NTFS3_64BIT_CLUSTER usage > - info about current build is added into module info and printing > on insmod (by Andy Lavr's request) > note: v21 is applicable for 'linux-next' not older than 2021.01.28 For those who use this on v5.10/v5.11, there's an extra patch available that applies on top of this submission: [1]. Hanabishi, babam (both in Cc), here [2] you've reported some issues with accessing some files and with hidden attributes. You may reply to this email of mine with detailed description of your issues, and maybe developers will answer you. Thanks. [1] https://gitlab.com/post-factum/pf-kernel/-/commit/e487427ef07c735fdc711a56d1ceac6629c34dcf.patch [2] https://aur.archlinux.org/packages/ntfs3-dkms/ -- Oleksandr Natalenko (post-factum)
Re: kernel BUG at mm/zswap.c:1275! (rc6 - git 61556703b610)
Hello. On Thu, Feb 11, 2021 at 10:43:18AM +, Song Bao Hua (Barry Song) wrote: > Are you using zsmalloc? There is a known bug on the combination > of zsmalloc and zswap, fixed by patches of tiantao: > > mm: set the sleep_mapped to true for zbud and z3fold > mm/zswap: fix variable 'entry' is uninitialized when used > mm/zswap: fix potential memory leak > mm/zswap: add the flag can_sleep_mapped > > at Linux-next: > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log/?qt=author&q=tiantao6%40hisilicon.com Is this a future stable-5.11 material (and/or, potentially, older stable branches as well)? -- Oleksandr Natalenko (post-factum)
Re: [PATCH 2/2] bfq: amend the function name of bfq_may_expire_for_budg_timeout()
On Wed, Feb 10, 2021 at 12:13:59PM +0100, Paolo Valente wrote: > > > > Il giorno 29 gen 2021, alle ore 11:51, Chunguang Xu > > ha scritto: > > > > From: Chunguang Xu > > > > The function name bfq_may_expire_for_budg_timeout() may be misspelled, > > try to fix it. > > > > Ok for me to make this name longer. > > Thanks, > Paolo > > > Signed-off-by: Chunguang Xu > > --- > > block/bfq-iosched.c | 6 +++--- > > 1 file changed, 3 insertions(+), 3 deletions(-) > > > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > > index 9e4eb0f..4f40c61 100644 > > --- a/block/bfq-iosched.c > > +++ b/block/bfq-iosched.c > > @@ -4061,7 +4061,7 @@ static bool bfq_bfqq_budget_timeout(struct bfq_queue > > *bfqq) > > * condition does not hold, or if the queue is slow enough to deserve > > * only to be kicked off for preserving a high throughput. > > */ > > -static bool bfq_may_expire_for_budg_timeout(struct bfq_queue *bfqq) > > +static bool bfq_may_expire_for_budget_timeout(struct bfq_queue *bfqq) > > { > > bfq_log_bfqq(bfqq->bfqd, bfqq, > > "may_budget_timeout: wait_request %d left %d timeout %d", > > @@ -4350,7 +4350,7 @@ static struct bfq_queue *bfq_select_queue(struct > > bfq_data *bfqd) > > * on the case where bfq_bfqq_must_idle() returns true, in > > * bfq_completed_request(). > > */ > > - if (bfq_may_expire_for_budg_timeout(bfqq) && > > + if (bfq_may_expire_for_budget_timeout(bfqq) && > > !bfq_bfqq_must_idle(bfqq)) > > goto expire; > > > > @@ -5706,7 +5706,7 @@ static void bfq_completed_request(struct bfq_queue > > *bfqq, struct bfq_data *bfqd) > > * of its reserved service guarantees. > > */ > > return; > > - } else if (bfq_may_expire_for_budg_timeout(bfqq)) > > + } else if (bfq_may_expire_for_budget_timeout(bfqq)) > > bfq_bfqq_expire(bfqd, bfqq, false, > > BFQQE_BUDGET_TIMEOUT); > > else if (RB_EMPTY_ROOT(&bfqq->sort_list) && > > -- > > 1.8.3.1 > > > Was this sent to some mailing list? I don't see an original email with this patch. -- Oleksandr Natalenko (post-factum)
Re: [PATCH 1/2] bfq: remove some useless logic of bfq_update_next_in_service()
On Wed, Feb 10, 2021 at 12:13:29PM +0100, Paolo Valente wrote: > > > > Il giorno 29 gen 2021, alle ore 11:51, Chunguang Xu > > ha scritto: > > > > From: Chunguang Xu > > > > The if statement at the end of the function is obviously useless, > > maybe we can delete it. > > > > Thanks for spotting this mistake. > > Acked-by: Paolo Valente > > > Signed-off-by: Chunguang Xu > > --- > > block/bfq-wf2q.c | 3 --- > > 1 file changed, 3 deletions(-) > > > > diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c > > index 26776bd..070e34a 100644 > > --- a/block/bfq-wf2q.c > > +++ b/block/bfq-wf2q.c > > @@ -137,9 +137,6 @@ static bool bfq_update_next_in_service(struct > > bfq_sched_data *sd, > > > > sd->next_in_service = next_in_service; > > > > - if (!next_in_service) > > - return parent_sched_may_change; > > - Unless I'm missing something, this has already been fixed here: https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.12/block&id=1a23e06cdab2be07cbda460c6417d7de564c48e6 > > return parent_sched_may_change; > > } > > > > -- > > 1.8.3.1 > > > -- Oleksandr Natalenko (post-factum)
Re: [PATCH v20 00/10] NTFS read-write driver GPL implementation by Paragon Software
On Sat, Feb 06, 2021 at 05:57:45PM +0100, Oleksandr Natalenko wrote: > On Sat, Feb 06, 2021 at 01:43:10AM +0500, Hanabishi Recca wrote: > > Can't even build v20 due to compilation errors. > > Try this please: http://ix.io/2OwR Slightly reworked version: http://ix.io/2Oxa -- Oleksandr Natalenko (post-factum)
Re: [PATCH v20 00/10] NTFS read-write driver GPL implementation by Paragon Software
On Sat, Feb 06, 2021 at 01:43:10AM +0500, Hanabishi Recca wrote: > Can't even build v20 due to compilation errors. Try this please: http://ix.io/2OwR -- Oleksandr Natalenko (post-factum)
Re: [PATCH v20 00/10] NTFS read-write driver GPL implementation by Paragon Software
On Sat, Feb 06, 2021 at 01:43:10AM +0500, Hanabishi Recca wrote: > Can't even build v20 due to compilation errors. I think this submission is based against linux-next branch where idmapped mounts are introduced, hence it is not applicable to v5.10 and v5.11 any more. -- Oleksandr Natalenko (post-factum)
Re: [PATCH 5.10 637/717] drm/amd/display: Fix memory leaks in S3 resume
On Mon, Jan 04, 2021 at 09:10:17PM +0100, Oleksandr Natalenko wrote: > On Mon, Jan 04, 2021 at 08:04:08PM +0100, Andre Tomt wrote: > > On 28.12.2020 13:50, Greg Kroah-Hartman wrote: > > > From: Stylon Wang > > > > > > commit a135a1b4c4db1f3b8cbed9676a40ede39feb3362 upstream. > > > > > > EDID parsing in S3 resume pushes new display modes > > > to probed_modes list but doesn't consolidate to actual > > > mode list. This creates a race condition when > > > amdgpu_dm_connector_ddc_get_modes() re-initializes the > > > list head without walking the list and results in memory leak. > > > > This commit is causing me problems on 5.10.4: when I turn off the display (a > > LG TV in this case), and turn it back on again later there is no video > > output and I get the following in the kernel log: > > > > [ 8245.259628] [drm:dm_restore_drm_connector_state [amdgpu]] *ERROR* > > Restoring old state failed with -12 > > Uh, it seems you've just saved me a ton of gray hair. I have the very > same issue and I'm going to revert this patch now in order to check > whether it makes any difference. Confirmed, reverting this patch makes my monitor light back after turning off/on. Also, during testing, I've noticed that with the stock v5.10.4 kernel once reboot sequence is initiated and xorg gets killed, the monitor also lights back and shows the console. My HW: 0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa PRO [Radeon 540/540X/550/550X / RX 540X/550/550X] (rev c7) > > Thanks! > > > > > I've found another report on this commit as well: > > https://bugzilla.kernel.org/show_bug.cgi?id=211033 > > > > And I suspect this is the same: > > https://bugs.archlinux.org/task/69202 > > > > Reverting it from 5.10.4 makes things behave again. > > > > Have not tested 5.4.86 or 5.11-rc. > > > > I'm using a RX570 Polaris based card. -- Oleksandr Natalenko (post-factum)
Re: [PATCH 5.10 637/717] drm/amd/display: Fix memory leaks in S3 resume
On Mon, Jan 04, 2021 at 08:04:08PM +0100, Andre Tomt wrote: > On 28.12.2020 13:50, Greg Kroah-Hartman wrote: > > From: Stylon Wang > > > > commit a135a1b4c4db1f3b8cbed9676a40ede39feb3362 upstream. > > > > EDID parsing in S3 resume pushes new display modes > > to probed_modes list but doesn't consolidate to actual > > mode list. This creates a race condition when > > amdgpu_dm_connector_ddc_get_modes() re-initializes the > > list head without walking the list and results in memory leak. > > This commit is causing me problems on 5.10.4: when I turn off the display (a > LG TV in this case), and turn it back on again later there is no video > output and I get the following in the kernel log: > > [ 8245.259628] [drm:dm_restore_drm_connector_state [amdgpu]] *ERROR* > Restoring old state failed with -12 Uh, it seems you've just saved me a ton of gray hair. I have the very same issue and I'm going to revert this patch now in order to check whether it makes any difference. Thanks! > > I've found another report on this commit as well: > https://bugzilla.kernel.org/show_bug.cgi?id=211033 > > And I suspect this is the same: > https://bugs.archlinux.org/task/69202 > > Reverting it from 5.10.4 makes things behave again. > > Have not tested 5.4.86 or 5.11-rc. > > I'm using a RX570 Polaris based card. -- Oleksandr Natalenko (post-factum)
min_filelist_kbytes vs file_is_tiny
Hello, Mandeep, Guenter et al. I came across the out-of-tree patch [1] that apparently is still alive after 10 years of residing in the Chromium OS tree, and I have a couple of questions if you don't mind spending your time answering them. 1. is this knob really necessary given there's an explicit bailout mechanism relying on `file_is_tiny` which depends on the sum of high watermarks across the zones? Wouldn't increasing `vm.min_free_kbytes` achieve basically the same? 2. if `vm.min_free_kbytes` is not an option, would setting `file_is_tiny` based on your `min_filelist_kbytes` knob achieve the same? 3. if not, is `memory.min` cgroup2 knob supposed to work in a similar manner? it looks to be unavailable for the root cgroup, though. What I'm looking for, basically, is to achieve the effect of the mentioned patch using mechanisms that are already available in the upstream kernel. Thank you. [1] https://chromium.googlesource.com/chromiumos/third_party/kernel-next/+/545e2917dbd863760a51379de8c26631e667c563^!/ -- Oleksandr Natalenko (post-factum)
Re: [Bug 202453] TRACE irq/18-i801_smb Tainted when enabled threadirqs in kernel commandline.
On Sat, Dec 05, 2020 at 05:19:18PM +0100, Thomas Gleixner wrote: > On Fri, Dec 04 2020 at 21:19, Oleksandr Natalenko wrote: > > On Thu, Dec 03, 2020 at 07:04:00PM +, > > bugzilla-dae...@bugzilla.kernel.org wrote: > >>2) Have a wrapper around handle_generic_irq() which ensures that > >> interrupts are disabled before invoking it. > > > The question is whether it's guaranteed under all circumstances > > including forced irq threading. The i801 driver has assumptions about > > this, so I wouldn't be surprised if there are more. > > Assuming that a final answer might take some time, the below which > implements #2 will make it at least work for now. > > Thanks, > > tglx > --- > Subject: genirq, i2c: Provide and use generic_dispatch_irq() > From: Thomas Gleixner > Date: Thu, 03 Dec 2020 19:12:24 +0100 > > Carlos reported that on his system booting with 'threadirqs' on the command > line result in the following warning: > > irq 31 handler irq_default_primary_handler+0x0/0x10 enabled interrupts > WARNING: CPU: 2 PID: 989 at kernel/irq/handle.c:153 > __handle_irq_event_percpu+0x19f/0x1b0 > > The reason is in the i2c stack: > > i801_isr() > i801_host_notify_isr() > i2c_handle_smbus_host_notify() > generic_handle_irq() > > and that explodes with forced interrupt threading because it's called with > interrupts enabled. > > It would be possible to set IRQF_NO_THREAD on the i801 interrupt to exclude > it from force threading, but that would break on RT and require a larger > update. > > It's also unclear whether there are other drivers which can reach that code > path via i2c_slave_host_notify_cb(). As there are enough i2c drivers which > use threaded interrupt handlers by default it seems not completely > impossible that this can happen even without force threaded interrupts. > > For a quick fix provide a wrapper around generic_handle_irq() which has a > local_irq_save/restore() around the invocation and use it in the i2c code. > > Reported-by: Carlos Jimenez > Signed-off-by: Thomas Gleixner > Link: https://bugzilla.kernel.org/show_bug.cgi?id=202453 > --- > drivers/i2c/i2c-core-base.c |2 +- > include/linux/irqdesc.h |1 + > kernel/irq/irqdesc.c| 20 > 3 files changed, 22 insertions(+), 1 deletion(-) > > --- a/drivers/i2c/i2c-core-base.c > +++ b/drivers/i2c/i2c-core-base.c > @@ -1385,7 +1385,7 @@ int i2c_handle_smbus_host_notify(struct > if (irq <= 0) > return -ENXIO; > > - generic_handle_irq(irq); > + generic_dispatch_irq(irq); > > return 0; > } > --- a/include/linux/irqdesc.h > +++ b/include/linux/irqdesc.h > @@ -153,6 +153,7 @@ static inline void generic_handle_irq_de > } > > int generic_handle_irq(unsigned int irq); > +int generic_dispatch_irq(unsigned int irq); > > #ifdef CONFIG_HANDLE_DOMAIN_IRQ > /* > --- a/kernel/irq/irqdesc.c > +++ b/kernel/irq/irqdesc.c > @@ -652,6 +652,26 @@ int generic_handle_irq(unsigned int irq) > } > EXPORT_SYMBOL_GPL(generic_handle_irq); > > +/** > + * generic_dispatch_irq - Dispatch an interrupt from an interrupt handler > + * @irq: The irq number to handle > + * > + * A wrapper around generic_handle_irq() which ensures that interrupts are > + * disabled when the primary handler of the dispatched irq is invoked. > + * This is useful for interrupt handlers with dispatching to be safe for > + * the forced threaded case. > + */ > +int generic_dispatch_irq(unsigned int irq) > +{ > + unsigned long flags; > + int ret; > + > + local_irq_save(&flags); > + ret = generic_handle_irq(irq); > + local_irq_restore(&flags); FWIW, for me &flags explodes build on v5.10-rc6. I had to change it to local_irq_save/restore(flags) (without taking an address via &). > + return ret; > +} > + > #ifdef CONFIG_HANDLE_DOMAIN_IRQ > /** > * __handle_domain_irq - Invoke the handler for a HW irq belonging to a > domain -- Oleksandr Natalenko (post-factum)
Re: [Bug 202453] TRACE irq/18-i801_smb Tainted when enabled threadirqs in kernel commandline.
On Thu, Dec 03, 2020 at 07:04:00PM +, bugzilla-dae...@bugzilla.kernel.org wrote: > https://bugzilla.kernel.org/show_bug.cgi?id=202453 > > --- Comment #7 from Thomas Gleixner (t...@linutronix.de) --- > On Tue, Aug 06 2019 at 14:07, bugzilla-daemon wrote: > > Jean Delvare (jdelv...@suse.de) changed: > > > > Is this happening with vanilla kernels or gentoo kernels? > > > > Thomas, I'm not sure if this is a bug in the i2c-i801 driver or something > > more > > general in how we handle the interrupts under threadirqs. Any suggestion how > > to > > investigate this further? > > Bah. What happens is that the i2c-i801 driver interrupt handler does: > > i801_isr() > > ... > return i801_host_notify_isr(priv); > > which invokes: > > i2c_handle_smbus_host_notify() > > which in turn invokes > > generic_handle_irq() > > and that explodes with forced interrupt threading because it's called > with interrupts enabled. > > The root of all evil is the usage of generic_handle_irq() under the > assumption that this is always called from hard interrupt context. That > assumption is not true since 8d32a307e4fa ("genirq: Provide forced > interrupt threading") which went into 2.6.39 almost 10 years ago. > > Seems you got lucky that since 10 years someone no user of this uses a > threaded interrupt handler, like some of the i2c drivers actually do. :) > > So there are a couple of options to fix this: > >1) Set the IRQF_NO_THREAD flag and replace the waitqueue as that > won't work on RT. > > Looking at the usage it's a single waiter wakeup and a single > waiter at a time because the xfer is fully serialized by the > core. So we can switch it to rcuwait, if there would be > rcu_wait_event_timeout(), but that's fixable. > >2) Have a wrapper around handle_generic_irq() which ensures that > interrupts are disabled before invoking it. > >3) Make the interrupt which is dispatched there to be requested with > [devm_]request_any_context_irq(). That also requires that the > NESTED_THREAD flag is set on the irq descriptor. > > That's exactly made for the use case where the dispatching > interrupt can be either threaded or in hard interrupt context. > > But that's lots of churn. > > And because we have so many options, here is the question: > >Is i2c_handle_smbus_host_notify() guaranteed to be called from hard >interrupt handlers (assumed that we use #1 above)? > >I can't tell because there is also i2c_slave_host_notify_cb() which >invokes it and my i2c foo is not good enough to figure that out. > > If that's the case the #1 would be the straight forward solution. If > not, then you want #2 because then the problem will just pop up via the > slave thing and that might be not as trivial to fix as this one. > > If you can answer that, I can send you the proper patch :) tglx suggested moving this to the appropriate mailing lists, so I'mm Cc'ing those. Jean, Wolfram, Benjamin, or someone else, could you please check Thomas' questions above and let us know what you think? I'll copy-paste my attempt to answer this in bugzilla below: ``` As far as I can grep through bus drivers, yes, it is called from hard interrupt handlers only. i2c_handle_smbus_host_notify() is indeed called from i2c_slave_host_notify_cb(), which, in its turn, is set to be called as ->slave_cb() via i2c_slave_event() wrapper only. Also, check [1], slide #9. I'm not sure about that "usually" word though since I couldn't find any examples of "unusual" usage. /* not an i2c guru here either, just looking around the code */ [1] https://elinux.org/images/f/f6/ELCE15-WolframSang-ShinyNewI2CSlaveFramework.pdf ``` and also tglx' follow-up question: ``` The question is whether it's guaranteed under all circumstances including forced irq threading. The i801 driver has assumptions about this, so I wouldn't be surprised if there are more. ``` Thanks. -- Oleksandr Natalenko (post-factum)
Re: scheduling while atomic in z3fold
On Mon, Nov 30, 2020 at 02:20:14PM +0100, Sebastian Andrzej Siewior wrote: > On 2020-11-29 12:41:14 [+0100], Mike Galbraith wrote: > > On Sun, 2020-11-29 at 12:29 +0100, Oleksandr Natalenko wrote: > > > > > > Ummm so do compressors explode under non-rt kernel in your tests as > > > well, or it is just -rt that triggers this? > > > > I only tested a non-rt kernel with z3fold, which worked just fine. > > I tried this and it did not not explode yet. Mike, can you please > confirm? > > diff --git a/mm/z3fold.c b/mm/z3fold.c > index 18feaa0bc5377..0bf70f624a4bd 100644 > --- a/mm/z3fold.c > +++ b/mm/z3fold.c > @@ -642,14 +642,17 @@ static inline void add_to_unbuddied(struct z3fold_pool > *pool, > { > if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || > zhdr->middle_chunks == 0) { > - struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); > - > + struct list_head *unbuddied; > int freechunks = num_free_chunks(zhdr); > + > + migrate_disable(); > + unbuddied = this_cpu_ptr(pool->unbuddied); > + > spin_lock(&pool->lock); > list_add(&zhdr->buddy, &unbuddied[freechunks]); > spin_unlock(&pool->lock); > zhdr->cpu = smp_processor_id(); > - put_cpu_ptr(pool->unbuddied); > + migrate_enable(); > } > } > > @@ -887,7 +890,8 @@ static inline struct z3fold_header *__z3fold_alloc(struct > z3fold_pool *pool, > > lookup: > /* First, try to find an unbuddied z3fold page. */ > - unbuddied = get_cpu_ptr(pool->unbuddied); > + migrate_disable(); > + unbuddied = this_cpu_ptr(pool->unbuddied); > for_each_unbuddied_list(i, chunks) { > struct list_head *l = &unbuddied[i]; > > @@ -905,7 +909,7 @@ static inline struct z3fold_header *__z3fold_alloc(struct > z3fold_pool *pool, > !z3fold_page_trylock(zhdr)) { > spin_unlock(&pool->lock); > zhdr = NULL; > - put_cpu_ptr(pool->unbuddied); > + migrate_enable(); > if (can_sleep) > cond_resched(); > goto lookup; > @@ -919,7 +923,7 @@ static inline struct z3fold_header *__z3fold_alloc(struct > z3fold_pool *pool, > test_bit(PAGE_CLAIMED, &page->private)) { > z3fold_page_unlock(zhdr); > zhdr = NULL; > - put_cpu_ptr(pool->unbuddied); > + migrate_enable(); > if (can_sleep) > cond_resched(); > goto lookup; > @@ -934,7 +938,7 @@ static inline struct z3fold_header *__z3fold_alloc(struct > z3fold_pool *pool, > kref_get(&zhdr->refcount); > break; > } > - put_cpu_ptr(pool->unbuddied); > + migrate_enable(); > > if (!zhdr) { > int cpu; > diff --git a/mm/zswap.c b/mm/zswap.c > index 78a20f7b00f2c..b24f761b9241c 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -394,7 +394,9 @@ struct zswap_comp { > u8 *dstmem; > }; > > -static DEFINE_PER_CPU(struct zswap_comp, zswap_comp); > +static DEFINE_PER_CPU(struct zswap_comp, zswap_comp) = { > + .lock = INIT_LOCAL_LOCK(lock), Is it a residue? > +}; > > static int zswap_dstmem_prepare(unsigned int cpu) > { > > > -Mike > > Sebastian -- Oleksandr Natalenko (post-factum)
Re: scheduling while atomic in z3fold
On Sun, Nov 29, 2020 at 11:56:55AM +0100, Mike Galbraith wrote: > On Sun, 2020-11-29 at 10:21 +0100, Mike Galbraith wrote: > > On Sun, 2020-11-29 at 08:48 +0100, Mike Galbraith wrote: > > > On Sun, 2020-11-29 at 07:41 +0100, Mike Galbraith wrote: > > > > On Sat, 2020-11-28 at 15:27 +0100, Oleksandr Natalenko wrote: > > > > > > > > > > > > Shouldn't the list manipulation be protected with > > > > > > > local_lock+this_cpu_ptr instead of get_cpu_ptr+spin_lock? > > > > > > > > > > Totally untested: > > > > > > > > Hrm, the thing doesn't seem to care deeply about preemption being > > > > disabled, so adding another lock may be overkill. It looks like you > > > > could get the job done via migrate_disable()+this_cpu_ptr(). > > > > > > There is however an ever so tiny chance that I'm wrong about that :) > > > > Or not, your local_lock+this_cpu_ptr version exploded too. > > > > Perhaps there's a bit of non-rt related racy racy going on in zswap > > thingy that makes swap an even less wonderful idea for RT than usual. > > Raciness seems to be restricted to pool compressor. "zbud" seems to be > solid, virgin "zsmalloc" explodes, as does "z3fold" regardless which of > us puts his grubby fingerprints on it. > > Exploding compressors survived zero runs of runltp -f mm, I declared > zbud to be at least kinda sorta stable after box survived five runs. Ummm so do compressors explode under non-rt kernel in your tests as well, or it is just -rt that triggers this? I've never seen that on both -rt and non-rt, thus asking. -- Oleksandr Natalenko (post-factum)
Re: scheduling while atomic in z3fold
On Sat, Nov 28, 2020 at 03:09:24PM +0100, Oleksandr Natalenko wrote: > > While running v5.10-rc5-rt11 I bumped into the following: > > > > ``` > > BUG: scheduling while atomic: git/18695/0x0002 > > Preemption disabled at: > > [] z3fold_zpool_malloc+0x463/0x6e0 > > … > > Call Trace: > > dump_stack+0x6d/0x88 > > __schedule_bug.cold+0x88/0x96 > > __schedule+0x69e/0x8c0 > > preempt_schedule_lock+0x51/0x150 > > rt_spin_lock_slowlock_locked+0x117/0x2c0 > > rt_spin_lock_slowlock+0x58/0x80 > > rt_spin_lock+0x2a/0x40 > > z3fold_zpool_malloc+0x4c1/0x6e0 > > zswap_frontswap_store+0x39c/0x980 > > __frontswap_store+0x6e/0xf0 > > swap_writepage+0x39/0x70 > > shmem_writepage+0x31b/0x490 > > pageout+0xf4/0x350 > > shrink_page_list+0xa28/0xcc0 > > shrink_inactive_list+0x300/0x690 > > shrink_lruvec+0x59a/0x770 > > shrink_node+0x2d6/0x8d0 > > do_try_to_free_pages+0xda/0x530 > > try_to_free_pages+0xff/0x260 > > __alloc_pages_slowpath.constprop.0+0x3d5/0x1230 > > __alloc_pages_nodemask+0x2f6/0x350 > > allocate_slab+0x3da/0x660 > > ___slab_alloc+0x4ff/0x760 > > __slab_alloc.constprop.0+0x7a/0x100 > > kmem_cache_alloc+0x27b/0x2c0 > > __d_alloc+0x22/0x230 > > d_alloc_parallel+0x67/0x5e0 > > __lookup_slow+0x5c/0x150 > > path_lookupat+0x2ea/0x4d0 > > filename_lookup+0xbf/0x210 > > vfs_statx.constprop.0+0x4d/0x110 > > __do_sys_newlstat+0x3d/0x80 > > do_syscall_64+0x33/0x40 > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > ``` > > > > The preemption seems to be disabled here: > > > > ``` > > $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x463 > > z3fold_zpool_malloc+0x463/0x6e0: > > add_to_unbuddied at mm/z3fold.c:645 > > (inlined by) z3fold_alloc at mm/z3fold.c:1195 > > (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 > > ``` > > > > The call to the rt_spin_lock() seems to be here: > > > > ``` > > $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x4c1 > > z3fold_zpool_malloc+0x4c1/0x6e0: > > add_to_unbuddied at mm/z3fold.c:649 > > (inlined by) z3fold_alloc at mm/z3fold.c:1195 > > (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 > > ``` > > > > Or, in source code: > > > > ``` > > 639 /* Add to the appropriate unbuddied list */ > > 640 static inline void add_to_unbuddied(struct z3fold_pool *pool, > > 641 struct z3fold_header *zhdr) > > 642 { > > 643 if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || > > 644 zhdr->middle_chunks == 0) { > > 645 struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); > > 646 > > 647 int freechunks = num_free_chunks(zhdr); > > 648 spin_lock(&pool->lock); > > 649 list_add(&zhdr->buddy, &unbuddied[freechunks]); > > 650 spin_unlock(&pool->lock); > > 651 zhdr->cpu = smp_processor_id(); > > 652 put_cpu_ptr(pool->unbuddied); > > 653 } > > 654 } > > ``` > > > > Shouldn't the list manipulation be protected with > > local_lock+this_cpu_ptr instead of get_cpu_ptr+spin_lock? Totally untested: ``` diff --git a/mm/z3fold.c b/mm/z3fold.c index 18feaa0bc537..53fcb80c6167 100644 --- a/mm/z3fold.c +++ b/mm/z3fold.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include #include @@ -156,6 +157,7 @@ struct z3fold_pool { const char *name; spinlock_t lock; spinlock_t stale_lock; + local_lock_t llock; struct list_head *unbuddied; struct list_head lru; struct list_head stale; @@ -642,14 +644,17 @@ static inline void add_to_unbuddied(struct z3fold_pool *pool, { if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || zhdr->middle_chunks == 0) { - struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); + struct list_head *unbuddied; + int freechunks; + local_lock(&pool->llock); + unbuddied = *this_cpu_ptr(&pool->unbuddied); - int freechunks = num_free_chunks(zhdr); + freechunks = num_free_chunks(zhdr); spin_lock(&pool->lock); list_add(&zhdr->buddy, &unbuddied[freechunks]); spin_unlock(&pool->lock); zhdr->cpu = smp_processor_id(); - put_cpu_ptr(pool->unbuddied); + local_unlock(&pool->llock); } } @@ -
Re: [ANNOUNCE] v5.10-rc5-rt11
-- a/kernel/Kconfig.preempt > +++ b/kernel/Kconfig.preempt > @@ -65,6 +65,7 @@ config PREEMPT_RT > bool "Fully Preemptible Kernel (Real-Time)" > depends on EXPERT && ARCH_SUPPORTS_RT > select PREEMPTION > + select RT_MUTEXES > help > This option turns the kernel into a real-time kernel by replacing > various locking primitives (spinlocks, rwlocks, etc.) with > diff --git a/localversion-rt b/localversion-rt > index d79dde624aaac..05c35cb580779 100644 > --- a/localversion-rt > +++ b/localversion-rt > @@ -1 +1 @@ > --rt10 > +-rt11 -- Oleksandr Natalenko (post-factum)
Re: scheduling while atomic in z3fold
On Sat, Nov 28, 2020 at 03:05:24PM +0100, Oleksandr Natalenko wrote: > Hi. > > While running v5.10-rc5-rt11 I bumped into the following: > > ``` > BUG: scheduling while atomic: git/18695/0x0002 > Preemption disabled at: > [] z3fold_zpool_malloc+0x463/0x6e0 > … > Call Trace: > dump_stack+0x6d/0x88 > __schedule_bug.cold+0x88/0x96 > __schedule+0x69e/0x8c0 > preempt_schedule_lock+0x51/0x150 > rt_spin_lock_slowlock_locked+0x117/0x2c0 > rt_spin_lock_slowlock+0x58/0x80 > rt_spin_lock+0x2a/0x40 > z3fold_zpool_malloc+0x4c1/0x6e0 > zswap_frontswap_store+0x39c/0x980 > __frontswap_store+0x6e/0xf0 > swap_writepage+0x39/0x70 > shmem_writepage+0x31b/0x490 > pageout+0xf4/0x350 > shrink_page_list+0xa28/0xcc0 > shrink_inactive_list+0x300/0x690 > shrink_lruvec+0x59a/0x770 > shrink_node+0x2d6/0x8d0 > do_try_to_free_pages+0xda/0x530 > try_to_free_pages+0xff/0x260 > __alloc_pages_slowpath.constprop.0+0x3d5/0x1230 > __alloc_pages_nodemask+0x2f6/0x350 > allocate_slab+0x3da/0x660 > ___slab_alloc+0x4ff/0x760 > __slab_alloc.constprop.0+0x7a/0x100 > kmem_cache_alloc+0x27b/0x2c0 > __d_alloc+0x22/0x230 > d_alloc_parallel+0x67/0x5e0 > __lookup_slow+0x5c/0x150 > path_lookupat+0x2ea/0x4d0 > filename_lookup+0xbf/0x210 > vfs_statx.constprop.0+0x4d/0x110 > __do_sys_newlstat+0x3d/0x80 > do_syscall_64+0x33/0x40 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > ``` > > The preemption seems to be disabled here: > > ``` > $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x463 > z3fold_zpool_malloc+0x463/0x6e0: > add_to_unbuddied at mm/z3fold.c:645 > (inlined by) z3fold_alloc at mm/z3fold.c:1195 > (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 > ``` > > The call to the rt_spin_lock() seems to be here: > > ``` > $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x4c1 > z3fold_zpool_malloc+0x4c1/0x6e0: > add_to_unbuddied at mm/z3fold.c:649 > (inlined by) z3fold_alloc at mm/z3fold.c:1195 > (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 > ``` > > Or, in source code: > > ``` > 639 /* Add to the appropriate unbuddied list */ > 640 static inline void add_to_unbuddied(struct z3fold_pool *pool, > 641 struct z3fold_header *zhdr) > 642 { > 643 if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || > 644 zhdr->middle_chunks == 0) { > 645 struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); > 646 > 647 int freechunks = num_free_chunks(zhdr); > 648 spin_lock(&pool->lock); > 649 list_add(&zhdr->buddy, &unbuddied[freechunks]); > 650 spin_unlock(&pool->lock); > 651 zhdr->cpu = smp_processor_id(); > 652 put_cpu_ptr(pool->unbuddied); > 653 } > 654 } > ``` > > Shouldn't the list manipulation be protected with > local_lock+this_cpu_ptr instead of get_cpu_ptr+spin_lock? > > Thanks. > > -- > Oleksandr Natalenko (post-factum) Forgot to Cc linux-rt-users@, sorry. -- Oleksandr Natalenko (post-factum)
scheduling while atomic in z3fold
Hi. While running v5.10-rc5-rt11 I bumped into the following: ``` BUG: scheduling while atomic: git/18695/0x0002 Preemption disabled at: [] z3fold_zpool_malloc+0x463/0x6e0 … Call Trace: dump_stack+0x6d/0x88 __schedule_bug.cold+0x88/0x96 __schedule+0x69e/0x8c0 preempt_schedule_lock+0x51/0x150 rt_spin_lock_slowlock_locked+0x117/0x2c0 rt_spin_lock_slowlock+0x58/0x80 rt_spin_lock+0x2a/0x40 z3fold_zpool_malloc+0x4c1/0x6e0 zswap_frontswap_store+0x39c/0x980 __frontswap_store+0x6e/0xf0 swap_writepage+0x39/0x70 shmem_writepage+0x31b/0x490 pageout+0xf4/0x350 shrink_page_list+0xa28/0xcc0 shrink_inactive_list+0x300/0x690 shrink_lruvec+0x59a/0x770 shrink_node+0x2d6/0x8d0 do_try_to_free_pages+0xda/0x530 try_to_free_pages+0xff/0x260 __alloc_pages_slowpath.constprop.0+0x3d5/0x1230 __alloc_pages_nodemask+0x2f6/0x350 allocate_slab+0x3da/0x660 ___slab_alloc+0x4ff/0x760 __slab_alloc.constprop.0+0x7a/0x100 kmem_cache_alloc+0x27b/0x2c0 __d_alloc+0x22/0x230 d_alloc_parallel+0x67/0x5e0 __lookup_slow+0x5c/0x150 path_lookupat+0x2ea/0x4d0 filename_lookup+0xbf/0x210 vfs_statx.constprop.0+0x4d/0x110 __do_sys_newlstat+0x3d/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` The preemption seems to be disabled here: ``` $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x463 z3fold_zpool_malloc+0x463/0x6e0: add_to_unbuddied at mm/z3fold.c:645 (inlined by) z3fold_alloc at mm/z3fold.c:1195 (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 ``` The call to the rt_spin_lock() seems to be here: ``` $ scripts/faddr2line mm/z3fold.o z3fold_zpool_malloc+0x4c1 z3fold_zpool_malloc+0x4c1/0x6e0: add_to_unbuddied at mm/z3fold.c:649 (inlined by) z3fold_alloc at mm/z3fold.c:1195 (inlined by) z3fold_zpool_malloc at mm/z3fold.c:1737 ``` Or, in source code: ``` 639 /* Add to the appropriate unbuddied list */ 640 static inline void add_to_unbuddied(struct z3fold_pool *pool, 641 struct z3fold_header *zhdr) 642 { 643 if (zhdr->first_chunks == 0 || zhdr->last_chunks == 0 || 644 zhdr->middle_chunks == 0) { 645 struct list_head *unbuddied = get_cpu_ptr(pool->unbuddied); 646 647 int freechunks = num_free_chunks(zhdr); 648 spin_lock(&pool->lock); 649 list_add(&zhdr->buddy, &unbuddied[freechunks]); 650 spin_unlock(&pool->lock); 651 zhdr->cpu = smp_processor_id(); 652 put_cpu_ptr(pool->unbuddied); 653 } 654 } ``` Shouldn't the list manipulation be protected with local_lock+this_cpu_ptr instead of get_cpu_ptr+spin_lock? Thanks. -- Oleksandr Natalenko (post-factum)
Re: Oops (probably) unmounting /oldroot/firmware/efi/efivars.
Hello. On 25.11.2020 09:32, Greg Kroah-Hartman wrote: On Tue, Nov 24, 2020 at 10:24:27PM +0100, Oleksandr Natalenko wrote: Hi. On 24.11.2020 15:23, Ard Biesheuvel wrote: > Surely caused by > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/efivarfs?id=fe5186cf12e30facfe261e9be6c7904a170bd822 This also soaked through the stable queue into v5.9.11, and now the same BUG is triggered in the latest stable kernel. /cc Greg cc: me for what? /me has no context as to what to do here... This was a precursor to https://lore.kernel.org/stable/x74vbej49scpv...@kroah.com/ and you already jumped in there. Thanks. -- Oleksandr Natalenko (post-factum)
Re: Oops (probably) unmounting /oldroot/firmware/efi/efivars.
Hi. On 24.11.2020 15:23, Ard Biesheuvel wrote: Surely caused by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/efivarfs?id=fe5186cf12e30facfe261e9be6c7904a170bd822 This also soaked through the stable queue into v5.9.11, and now the same BUG is triggered in the latest stable kernel. /cc Greg -- Oleksandr Natalenko (post-factum)
Re: WARNING at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330
On 16.11.2020 11:31, Valentin Schneider wrote: On 16/11/20 10:27, Oleksandr Natalenko wrote: Hi. [...] Not sure whether the check is legitimate, but FWIW I've managed to put a test task [1] (it spawns a lot of threads and applies affinity) into a permanent unkillable D state here: ``` [<0>] affine_move_task+0x2d3/0x620 [<0>] __set_cpus_allowed_ptr+0x164/0x210 [<0>] sched_setaffinity+0x21a/0x300 [<0>] __x64_sys_sched_setaffinity+0x8c/0xc0 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` [...] I'm not positive about this being directly related to the original report, but I think it is still worth mentioning. Aye, thanks, that one should be fixed by: https://lore.kernel.org/r/20201113112414.2569-1-valentin.schnei...@arm.com Great, thanks, I'll pick it up now. Thanks. [1] https://gitlab.com/post-factum/burn_scheduler -- Oleksandr Natalenko (post-factum)
Re: WARNING at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330
On 16.11.2020 11:27, Oleksandr Natalenko wrote: Not sure whether the check is legitimate, but FWIW I've managed to put a test task [1] (it spawns a lot of threads and applies affinity) into a permanent unkillable D state here: Just to make it clear: I neither applied your patch nor saw a splat this time, just a D state only. -- Oleksandr Natalenko (post-factum)
Re: WARNING at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330
Hi. On 16.11.2020 11:00, Valentin Schneider wrote: On 15/11/20 22:32, Oleksandr Natalenko wrote: I'm running v5.10-rc3-rt7 for some time, and I came across this splat in dmesg: ``` [118769.951010] [ cut here ] [118769.951013] WARNING: CPU: 19 PID: 146 at kernel/sched/core.c:2013 Err, I didn't pick up on this back then, but isn't that check bogus? If the task is enqueued elsewhere, it's valid for it not to be affined 'here'. Also that is_migration_disabled() check within is_cpu_allowed() makes me think this isn't the best thing to call on a remote task. --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 1218f3ce1713..47d5b677585f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2010,7 +2010,7 @@ static int migration_cpu_stop(void *data) * valid again. Nothing to do. */ if (!pending) { - WARN_ON_ONCE(!is_cpu_allowed(p, cpu_of(rq))); + WARN_ON_ONCE(!cpumask_test_cpu(task_cpu(p), p->cpus_ptr)); goto out; } Not sure whether the check is legitimate, but FWIW I've managed to put a test task [1] (it spawns a lot of threads and applies affinity) into a permanent unkillable D state here: ``` [<0>] affine_move_task+0x2d3/0x620 [<0>] __set_cpus_allowed_ptr+0x164/0x210 [<0>] sched_setaffinity+0x21a/0x300 [<0>] __x64_sys_sched_setaffinity+0x8c/0xc0 [<0>] do_syscall_64+0x33/0x40 [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 ``` I think this corresponds to something around here: ``` $ scripts/faddr2line kernel/sched/core.o affine_move_task+0x2d3 affine_move_task+0x2d3/0x620: arch_atomic_fetch_sub at /home/pf/linux-pf-edge/src/linux-5.10/./arch/x86/include/asm/atomic.h:190 (inlined by) atomic_fetch_sub_release at /home/pf/linux-pf-edge/src/linux-5.10/./include/asm-generic/atomic-instrumented.h:221 (inlined by) __refcount_sub_and_test at /home/pf/linux-pf-edge/src/linux-5.10/./include/linux/refcount.h:272 (inlined by) __refcount_dec_and_test at /home/pf/linux-pf-edge/src/linux-5.10/./include/linux/refcount.h:315 (inlined by) refcount_dec_and_test at /home/pf/linux-pf-edge/src/linux-5.10/./include/linux/refcount.h:333 (inlined by) affine_move_task at /home/pf/linux-pf-edge/src/linux-5.10/kernel/sched/core.c:2334 ``` or: ``` 2332 wait_for_completion(&pending->done); 2333 2334 if (refcount_dec_and_test(&pending->refs)) 2335 wake_up_var(&pending->refs); ``` I'm not positive about this being directly related to the original report, but I think it is still worth mentioning. Thanks. [1] https://gitlab.com/post-factum/burn_scheduler -- Oleksandr Natalenko (post-factum)
WARNING at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330
P: a13d0eee99c0 R08: 002f R09: a13d0ef29af0 [118769.951124] R10: 00ec R11: 016a R12: b58c9118fdd0 [118769.951124] R13: R14: a136e7a9c820 R15: a136e7a9bf80 [118769.951126] ? set_cpus_allowed_ptr+0x10/0x10 [118769.951127] cpu_stopper_thread+0x89/0x130 [118769.951128] ? smpboot_register_percpu_thread+0xe0/0xe0 [118769.951129] smpboot_thread_fn+0x1d8/0x2c0 [118769.951130] kthread+0x190/0x1b0 [118769.951130] ? __kthread_init_worker+0x50/0x50 [118769.951131] ret_from_fork+0x22/0x30 [118769.951133] ---[ end trace 0002 ]--- ``` which corresponds to the following condition: ``` 2007 /* 2008 * When this was migrate_enable() but we no longer have an 2009 * @pending, a concurrent SCA 'fixed' things and we should be 2010 * valid again. Nothing to do. 2011 */ 2012 if (!pending) { 2013 WARN_ON_ONCE(!is_cpu_allowed(p, cpu_of(rq))); 2014 goto out; 2015 } ``` I'm not sure what triggered this, and the system still looks usable afterwards. I have no idea how to trigger it again ATM, so this is just a heads up in case you know what could go wrong. Thanks. -- Oleksandr Natalenko (post-factum)
Re: bcachefs-for-review
is going to to be bcachefs's killer feature (or at least one of them), and I'm pretty excited about it: it's a completely new approach unlike ZFS and btrfs, no write hole (we don't update existing stripes in place) and we don't have to fragment writes either like ZFS does. Add to that the caching that we already do and it's turning into a pretty amazing tool for managing a whole bunch of mixed storage. -- Oleksandr Natalenko (post-factum)
Re: [PATCH] bfq: fix blkio cgroup leakage v4
Hello. On 11.08.2020 08:43, Dmitry Monakhov wrote: Changes from v1: - update commit description with proper ref-accounting justification commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. In fact whole idea of original commit is wrong because bfq_group entity can not dissapear under us because it is referenced by child bfq_queue's entities from here: -> bfq_init_entity() ->bfqg_and_blkg_get(bfqg); ->entity->parent = bfqg->my_entity -> bfq_put_queue(bfqq) FINAL_PUT ->bfqg_and_blkg_put(bfqq_group(bfqq)) ->kmem_cache_free(bfq_pool, bfqq); So parent entity can not disappear while child entity is in tree, and child entities already has proper protection. This patch revert commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") bfq_group leak trace caused by bad commit: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) < : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Signed-off-by: Dmitry Monakhov --- block/bfq-cgroup.c | 2 +- block/bfq-iosched.h | 1 - block/bfq-wf2q.c| 12 ++-- 3 files changed, 3 insertions(+), 12 deletions(-) diff --git a/block/bfq-cgroup.c b/block/bfq-cgroup.c index 68882b9..b791e20 100644 --- a/block/bfq-cgroup.c +++ b/block/bfq-cgroup.c @@ -332,7 +332,7 @@ static void bfqg_put(struct bfq_group *bfqg) kfree(bfqg); } -void bfqg_and_blkg_get(struct bfq_group *bfqg) +static void bfqg_and_blkg_get(struct bfq_group *bfqg) { /* see comments in bfq_bic_update_cgroup for why refcounting bfqg */ bfqg_get(bfqg); diff --git a/block/bfq-iosched.h b/block/bfq-iosched.h index cd224aa..7038952 100644 --- a/block/bfq-iosched.h +++ b/block/bfq-iosched.h @@ -986,7 +986,6 @@ struct bfq_group *bfq_find_set_group(struct bfq_data *bfqd, struct blkcg_gq *bfqg_to_blkg(struct bfq_group *bfqg); struct bfq_group *bfqq_group(struct bfq_queue *bfqq); struct bfq_group *bfq_create_group_hierarchy(struct bfq_data *bfqd, int node); -void bfqg_and_blkg_get(struct bfq_group *bfqg); void bfqg_and_blkg_put(struct bfq_group *bfqg); #ifdef CONFIG_BFQ_GROUP_IOSCHED diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index eb0e2a6..26776bd 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -533,9 +533,7 @@ static void bfq_get_entity(struct bfq_entity *entity) bfqq->ref++; bfq_log_bfqq(bfqq->bfqd, bfqq, "get_entity: %p %d", bfqq, bfqq->ref); - } else - bfqg_and_blkg_get(container_of(entity, struct bfq_group, - entity)); + } } /** @@ -649,14 +647,8 @@ static void bfq_forget_entity(struct bfq_service_tree *st, entity->on_st_or_in_serv = false; st->wsum -= entity->weight; - if (is_in_service) - return; - - if (bfqq) + if (bfqq && !is_in_service) bfq_put_queue(bfqq); - else - bfqg_and_blkg_put(container_of(entity, struct bfq_group, - entity)); } /** No crashes reported this time, at least so far. Thanks. -- Oleksandr Natalenko (post-factum)
Re: [PATCH] block: bfq fix blkio cgroup leakage v3
On 27.07.2020 10:01, Dmitry Monakhov wrote: commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. See trace balow: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) < : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. We should drop group reference once it finaly removed from service inside __bfq_bfqd_reset_in_service, as we do with queue entities. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: changes since v2: - use safe iteration macro to prevent freed object dereference. Signed-off-by: Dmitry Monakhov --- block/bfq-wf2q.c | 24 ++-- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 8113138..13b417a 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -635,14 +635,10 @@ static void bfq_idle_insert(struct bfq_service_tree *st, * @entity: the entity being removed. * @is_in_service: true if entity is currently the in-service entity. * - * Forget everything about @entity. In addition, if entity represents - * a queue, and the latter is not in service, then release the service - * reference to the queue (the one taken through bfq_get_entity). In - * fact, in this case, there is really no more service reference to - * the queue, as the latter is also outside any service tree. If, - * instead, the queue is in service, then __bfq_bfqd_reset_in_service - * will take care of putting the reference when the queue finally - * stops being served. + * Forget everything about @entity. If entity is not in service, then release + * the service reference to the entity (the one taken through bfq_get_entity). + * If the entity is in service, then __bfq_bfqd_reset_in_service will take care + * of putting the reference when the entity finally stops being served. */ static void bfq_forget_entity(struct bfq_service_tree *st, struct bfq_entity *entity, @@ -1615,6 +1611,7 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) struct bfq_queue *in_serv_bfqq = bfqd->in_service_queue; struct bfq_entity *in_serv_entity = &in_serv_bfqq->entity; struct bfq_entity *entity = in_serv_entity; + struct bfq_entity *parent = NULL; bfq_clear_bfqq_wait_request(in_serv_bfqq); hrtimer_try_to_cancel(&bfqd->idle_slice_timer); @@ -1626,9 +1623,16 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) * execute the final step: reset in_service_entity along the * path from entity to the root. */ - for_each_entity(entity) + for_each_entity_safe(entity, parent) { entity->sched_data->in_service_entity = NULL; - + /* +* Release bfq_groups reference if it was not released in +* bfq_forget_entity, which was taken in bfq_get_entity. +*/ + if (!bfq_entity_to_bfqq(entity) && !entity->on_st_or_in_serv) + bfqg_and_blkg_put(container_of(entity, struct bfq_group, + entity)); + } /* * in_serv_entity is no longer in service, so, if it is in no * service tree either, then release the service reference to Reportedly, this one crashes too [1], and this happens even earlier than with v2. [1] http://pix.academ.info/images/img/2020/07/27/91f656514707728730b0b67f8c9f4a04.jpg -- Oleksandr Natalenko (post-factum)
Re: [PATCH] block: bfq fix blkio cgroup leakage v2
Hello. On 20.07.2020 19:04, Dmitry Monakhov wrote: commit db37a34c563b ("block, bfq: get a ref to a group when adding it to a service tree") introduce leak forbfq_group and blkcg_gq objects because of get/put imbalance. See trace balow: -> blkg_alloc -> bfq_pq_alloc -> bfqg_get (+1) ->bfq_activate_bfqq ->bfq_activate_requeue_entity -> __bfq_activate_entity ->bfq_get_entity ->bfqg_and_blkg_get (+1) < : Note1 ->bfq_del_bfqq_busy ->bfq_deactivate_entity+0x53/0xc0 [bfq] ->__bfq_deactivate_entity+0x1b8/0x210 [bfq] -> bfq_forget_entity(is_in_service = true) entity->on_st_or_in_serv = false <=== :Note2 if (is_in_service) return; ==> do not touch reference -> blkcg_css_offline -> blkcg_destroy_blkgs -> blkg_destroy -> bfq_pd_offline -> __bfq_deactivate_entity if (!entity->on_st_or_in_serv) /* true, because (Note2) return false; -> bfq_pd_free -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2) So bfq_group and blkcg_gq will leak forever, see test-case below. We should drop group reference once it finaly removed from service inside __bfq_bfqd_reset_in_service, as we do with queue entities. ##TESTCASE_BEGIN: #!/bin/bash max_iters=${1:-100} #prep cgroup mounts mount -t tmpfs cgroup_root /sys/fs/cgroup mkdir /sys/fs/cgroup/blkio mount -t cgroup -o blkio none /sys/fs/cgroup/blkio # Prepare blkdev grep blkio /proc/cgroups truncate -s 1M img losetup /dev/loop0 img echo bfq > /sys/block/loop0/queue/scheduler grep blkio /proc/cgroups for ((i=0;i /sys/fs/cgroup/blkio/a/cgroup.procs dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null echo 0 > /sys/fs/cgroup/blkio/cgroup.procs rmdir /sys/fs/cgroup/blkio/a grep blkio /proc/cgroups done ##TESTCASE_END: Signed-off-by: Dmitry Monakhov --- block/bfq-wf2q.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/block/bfq-wf2q.c b/block/bfq-wf2q.c index 8113138..93b236c 100644 --- a/block/bfq-wf2q.c +++ b/block/bfq-wf2q.c @@ -635,14 +635,10 @@ static void bfq_idle_insert(struct bfq_service_tree *st, * @entity: the entity being removed. * @is_in_service: true if entity is currently the in-service entity. * - * Forget everything about @entity. In addition, if entity represents - * a queue, and the latter is not in service, then release the service - * reference to the queue (the one taken through bfq_get_entity). In - * fact, in this case, there is really no more service reference to - * the queue, as the latter is also outside any service tree. If, - * instead, the queue is in service, then __bfq_bfqd_reset_in_service - * will take care of putting the reference when the queue finally - * stops being served. + * Forget everything about @entity. If entity is not in service, then release + * the service reference to the entity (the one taken through bfq_get_entity). + * If the entity is in service, then __bfq_bfqd_reset_in_service will take care + * of putting the reference when the entity finally stops being served. */ static void bfq_forget_entity(struct bfq_service_tree *st, struct bfq_entity *entity, @@ -1626,9 +1622,16 @@ bool __bfq_bfqd_reset_in_service(struct bfq_data *bfqd) * execute the final step: reset in_service_entity along the * path from entity to the root. */ - for_each_entity(entity) + for_each_entity(entity) { entity->sched_data->in_service_entity = NULL; - + /* +* Release bfq_groups reference if it was not released in +* bfq_forget_entity, which was taken in bfq_get_entity. +*/ + if (!bfq_entity_to_bfqq(entity) && !entity->on_st_or_in_serv) + bfqg_and_blkg_put(container_of(entity, struct bfq_group, + entity)); + } /* * in_serv_entity is no longer in service, so, if it is in no * service tree either, then release the service reference to As reported by one of my customers, this patch causes the following crash: [1] [1] http://pix.academ.info/images/img/2020/07/26/52d097c02b6061657443bba92de75e8a.jpg -- Oleksandr Natalenko (post-factum)
Re: [RFC v2 0/4] futex2: Add new futex interface
Hello. On 09.07.2020 19:59, André Almeida wrote: This RFC is a followup to the previous discussion initiated from my last patch "futex: Implement mechanism to wait on any of several futexes"[1]. As stated in the thread, the correct approach to move forward with the wait multiple operation would be to create a new syscall that would have all new cool features. The first patch adds the new interface and just translate the call for the old interface, without implementing new features. The goal here is to establish the interface and to check if everyone is happy with this API. The rest of patches are selftests to show the interface in action. I have the following questions: - What suggestions do you have to implement this? Start from scratch or reuse the most code possible? - The interface seems correct and implements the requirements asked by you? Those are the cool new features that this syscall should address some day: - Operate with variable bit size futexes, not restricted to 32: 8, 16 and 64 - Wait on multiple futexes, using the following semantics: struct futex_wait { void *uaddr; unsigned long val; unsigned long flags; }; sys_futex_waitv(struct futex_wait *waiters, unsigned int nr_waiters, unsigned long flags, struct __kernel_timespec *timo); - Have NUMA optimizations: if FUTEX_NUMA_FLAG is set, the `void *uaddr` argument won't be a value of type u{8, 16, 32, 64} anymore, but a struct containing a NUMA node hint: struct futex32_numa { u32 value __attribute__ ((aligned (8))); u32 hint; }; struct futex64_numa { u64 value __attribute__ ((aligned (16))); u64 hint; }; Thanks, André Changes since v1: - The timeout argument now uses __kernel_timespec as type - time32 interface was removed v1: https://lore.kernel.org/patchwork/cover/1255437/ [1] https://lore.kernel.org/patchwork/patch/1194339/ André Almeida (4): futex2: Add new futex interface selftests: futex: Add futex2 wake/wait test selftests: futex: Add futex2 timeout test selftests: futex: Add futex2 wouldblock test MAINTAINERS | 2 +- arch/x86/entry/syscalls/syscall_32.tbl| 2 + arch/x86/entry/syscalls/syscall_64.tbl| 2 + include/linux/syscalls.h | 7 ++ include/uapi/asm-generic/unistd.h | 8 +- include/uapi/linux/futex.h| 10 ++ init/Kconfig | 7 ++ kernel/Makefile | 1 + kernel/futex2.c | 73 kernel/sys_ni.c | 4 + tools/include/uapi/asm-generic/unistd.h | 7 +- .../selftests/futex/functional/.gitignore | 1 + .../selftests/futex/functional/Makefile | 4 +- .../selftests/futex/functional/futex2_wait.c | 111 ++ .../futex/functional/futex_wait_timeout.c | 38 -- .../futex/functional/futex_wait_wouldblock.c | 33 +- .../testing/selftests/futex/functional/run.sh | 3 + .../selftests/futex/include/futex2test.h | 77 18 files changed, 373 insertions(+), 17 deletions(-) create mode 100644 kernel/futex2.c create mode 100644 tools/testing/selftests/futex/functional/futex2_wait.c create mode 100644 tools/testing/selftests/futex/include/futex2test.h What branch/tag this submission is based on please? It seems it is not a 5.8 but rather 5.7 since the second patch misses faccessat2() syscall and fails to be applied cleanly. Thanks. -- Oleksandr Natalenko (post-factum)
Re: [PATCH v8 0/4] introduce memory hinting API for external process
4 +- > kernel/exit.c | 17 -- > kernel/pid.c| 17 ++ > kernel/sys_ni.c | 2 + > mm/madvise.c| 190 +--- > 28 files changed, 217 insertions(+), 46 deletions(-) > > -- > 2.27.0.111.gc72c7da667-goog > -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: [PATCH v8 0/4] introduce memory hinting API for external process
Hello. On Mon, Jun 22, 2020 at 12:36:35PM -0700, Minchan Kim wrote: > On Mon, Jun 22, 2020 at 12:28:56PM -0700, Minchan Kim wrote: > > Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, > > application could give hints to kernel what memory range are preferred to be > > reclaimed. However, in some platform(e.g., Android), the information > > required to make the hinting decision is not known to the app. > > Instead, it is known to a centralized userspace daemon(e.g., > > ActivityManagerService), > > and that daemon must be able to initiate reclaim on its own without any app > > involvement. > > > > To solve the concern, this patch introduces new syscall - > > process_madvise(2). > > Bascially, it's same with madvise(2) syscall but it has some differences. > > > > 1. It needs pidfd of target process to provide the hint > > 2. It supports only MADV_{COLD|PAGEOUT} at this moment. > >Other hints in madvise will be opened when there are explicit requests > > from > >community to prevent unexpected bugs we couldn't support. > > 3. Only privileged processes can do something for other process's address > >space. > > > > For more detail of the new API, please see "mm: introduce external memory > > hinting API" > > description in this patchset. > > > > * from v7 - > > http://lore.kernel.org/r/20200302193630.68771-1-minc...@kernel.org > > * dropping pid support from new syscall and fold releated patches into > > syscall patch > > * dropping KSM patch by discussion - Oleksandr, I lost the discussion. > > Please resend the single patch against of the patchset if you resolves > > the discussion. > > > > https://lore.kernel.org/linux-api/20200302193630.68771-8-minc...@kernel.org/ > > Oleksandr, it seems you discussed something with Vlastimil but couldn't > find conclustion yet and Since Jann put an a new note in the thread, > I detach the patch from this patchset. > > Please send the KSM patch based on this patchset if you belive there is > no need to be actionable for comments. I don't think we came to some definite conclusion, but I'll try to implement Vlastimil's suggestion and then send it separately for evaluation. Lets keep KSM bit out of your submission so I won't slow it down. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: PC speaker
Hello. On Thu, Jun 18, 2020 at 01:49:37PM -0400, R.F. Burns wrote: > Is it possible to write a kernel module which, when loaded, will blow > the PC speaker? 1) where have you been starting from 2011 till 2015? 2) why does the posting date of this letter varies a little bit? 3) why June (but July in 2008)? 4) are you a robot? 5) if you are a robot, what's your setup? 6) if you are not a robot, blink. I'm eagerly waiting for this email each year, but it's a shame you neither reveal any details nor respond to questions. We are intrigued! Please uncover this great LKML mystery. We'd like to know you closer. Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: mt7612 suspend/resume issue
On Mon, Jun 22, 2020 at 4:53 PM Lorenzo Bianconi wrote: > > On Sun, Jun 21, 2020 at 10:54 PM Lorenzo Bianconi > > wrote: > > > > > +static int __maybe_unused > > > > > +mt76x2e_suspend(struct pci_dev *pdev, pm_message_t state) > > > > > +{ > > > > > + struct mt76_dev *mdev = pci_get_drvdata(pdev); > > > > > + struct mt76x02_dev *dev = container_of(mdev, struct mt76x02_dev, > > > > > mt76); > > > > > + int i, err; > > > > > > can you please double-check what is the PCI state requested during > > > suspend? > > > > Do you mean ACPI S3 (this is the state the system enters)? If not, > > what should I check and where? > > yes, right. Just for debugging, can you please force the card in PCI_D0 > during the > suspend? Do you want me to do this: diff --git a/drivers/net/wireless/mediatek/mt76/mt76x2/pci.c b/drivers/net/wireless/mediatek/mt76/mt76x2/pci.c index 5543e242fb9b..e558342cce03 100644 --- a/drivers/net/wireless/mediatek/mt76/mt76x2/pci.c +++ b/drivers/net/wireless/mediatek/mt76/mt76x2/pci.c @@ -119,9 +119,8 @@ mt76x2e_suspend(struct pci_dev *pdev, pm_message_t state) mt76x02_dma_reset(dev); -pci_enable_wake(pdev, pci_choose_state(pdev, state), true); pci_save_state(pdev); -err = pci_set_power_state(pdev, pci_choose_state(pdev, state)); +err = pci_set_power_state(pdev, PCI_D0); if (err) goto restore; ? -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: mt7612 suspend/resume issue
Hello, Lorenzo. On Sun, Jun 21, 2020 at 10:54 PM Lorenzo Bianconi wrote: > > > +static int __maybe_unused > > > +mt76x2e_suspend(struct pci_dev *pdev, pm_message_t state) > > > +{ > > > + struct mt76_dev *mdev = pci_get_drvdata(pdev); > > > + struct mt76x02_dev *dev = container_of(mdev, struct mt76x02_dev, > > > mt76); > > > + int i, err; > > can you please double-check what is the PCI state requested during suspend? Do you mean ACPI S3 (this is the state the system enters)? If not, what should I check and where? Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: mt7612 suspend/resume issue
el: R10: 0001 R11: R12: 9fe0283798d0 čen 18 23:12:02 spock kernel: R13: R14: R15: fff0 čen 18 23:12:02 spock kernel: FS: () GS:9fe02f2c() knlGS: čen 18 23:12:02 spock kernel: CS: 0010 DS: ES: CR0: 80050033 čen 18 23:12:02 spock kernel: CR2: 7f313b33a940 CR3: 00012ea0a002 CR4: 001706e0 čen 18 23:12:02 spock kernel: Call Trace: čen 18 23:12:02 spock kernel: ieee80211_restart_work+0xb7/0xe0 [mac80211] čen 18 23:12:02 spock kernel: process_one_work+0x1d4/0x3c0 čen 18 23:12:02 spock kernel: worker_thread+0x228/0x470 čen 18 23:12:02 spock kernel: ? process_one_work+0x3c0/0x3c0 čen 18 23:12:02 spock kernel: kthread+0x19c/0x1c0 čen 18 23:12:02 spock kernel: ? __kthread_init_worker+0x30/0x30 čen 18 23:12:02 spock kernel: ret_from_fork+0x35/0x40 čen 18 23:12:02 spock kernel: ---[ end trace e017bc3573bd9bf2 ]--- čen 18 23:12:02 spock kernel: [ cut here ] čen 18 23:12:02 spock kernel: wlp1s0: Failed check-sdata-in-driver check, flags: 0x0 čen 18 23:12:02 spock kernel: WARNING: CPU: 3 PID: 171 at net/mac80211/driver-ops.h:17 drv_remove_interface+0x11f/0x130 [mac80211] čen 18 23:12:02 spock kernel: Modules linked in: cmac ccm bridge stp llc nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables msr tun nfnetlink nls_iso8859_1 nls_cp437 vfat fat mt76x2e mt76x2_common mt76x02_lib mt76 mac80211 intel_rapl_msr snd_hda_codec_hdmi snd_hda_codec_cirrus mei_hdcp snd_hda_codec_generic cfg80211 intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp dell_laptop iTCO_wdt snd_hda_intel kvm_intel iTCO_vendor_support dell_wmi snd_intel_dspcfg sparse_keymap snd_hda_codec ledtrig_audio wmi_bmof dell_smbios snd_hda_core kvm rtsx_usb_ms dell_wmi_descriptor memstick dcdbas snd_hwdep dell_smm_hwmon irqbypass psmouse intel_cstate snd_pcm intel_uncore joydev intel_rapl_perf mousedev mei_me alx rfkill input_leds snd_timer i2c_i801 snd mei lpc_ich libarc4 mdio soundcore battery wmi evdev dell_smo8800 mac_hid ac tcp_bbr crypto_user ip_tables x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt hid_logitech_hidpp hid_logitech_dj čen 18 23:12:02 spock kernel: hid_generic usbhid hid rtsx_usb_sdmmc mmc_core rtsx_usb dm_mod raid10 serio_raw atkbd libps2 md_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper xhci_pci xhci_hcd ehci_pci ehci_hcd i8042 serio i915 intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm agpgart čen 18 23:12:02 spock kernel: CPU: 3 PID: 171 Comm: kworker/3:3 Tainted: G W 5.7.0-pf3 #1 čen 18 23:12:02 spock kernel: Hardware name: Dell Inc. Vostro 3360/0F5DWF, BIOS A18 09/25/2013 čen 18 23:12:02 spock kernel: Workqueue: events_freezable ieee80211_restart_work [mac80211] čen 18 23:12:02 spock kernel: RIP: 0010:drv_remove_interface+0x11f/0x130 [mac80211] čen 18 23:12:02 spock kernel: Code: a0 57 f0 c2 e9 4b ff ff ff 48 8b 86 78 04 00 00 48 8d b6 98 04 00 00 48 c7 c7 e8 ef f8 c0 48 85 c0 48 0f 45 f0 e8 99 2e fa c2 <0f> 0b 5b 5d 41 5c c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 čen 18 23:12:02 spock kernel: RSP: 0018:a87c40403c80 EFLAGS: 00010282 čen 18 23:12:02 spock kernel: RAX: RBX: 9fe028f6e900 RCX: čen 18 23:12:02 spock kernel: RDX: 0001 RSI: 0082 RDI: čen 18 23:12:02 spock kernel: RBP: 9fe028379930 R08: 04c9 R09: 0001 čen 18 23:12:02 spock kernel: R10: 0001 R11: 6fc0 R12: 9fe028379000 čen 18 23:12:02 spock kernel: R13: 9fe028f6f4b8 R14: 9fe028378ca0 R15: 9fe0283787e0 čen 18 23:12:02 spock kernel: FS: () GS:9fe02f2c() knlGS: čen 18 23:12:02 spock kernel: CS: 0010 DS: ES: CR0: 80050033 čen 18 23:12:02 spock kernel: CR2: 7f313b33a940 CR3: 00012ea0a002 CR4: 001706e0 čen 18 23:12:02 spock kernel: Call Trace: čen 18 23:12:02 spock kernel: ieee80211_do_stop+0x5af/0x8c0 [mac80211] čen 18 23:12:02 spock kernel: ieee80211_stop+0x16/0x20 [mac80211] čen 18 23:12:02 spock kernel: __dev_close_many+0xaa/0x120 čen 18 23:12:02 spock kernel: dev_close_many+0xa1/0x2b0 čen 18 23:12:02 spock kernel: dev_close+0x6d/0x90 čen 18 23:12:02 spock kernel: cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211] čen 18 23:12:02 spock kernel: ieee80211_reconfig+0xa2/0x1700 [mac80211] čen 18 23:12:02 spock kernel: ieee80211_restart_work+0xb7/0xe0 [mac80211] čen 18 23:12:02 spock kernel: process_one_work+0x1d4/0x3c0 čen 18 23:12:02 spock kernel: worker_thread+0x228/0x470 čen 18 23:12:02 spock kernel: ? process_one_work+0x3c0/0x3c0 čen 18 23:12:02 spock kernel: kthread+0x19c/0x1c0 čen 18 23:12:02 spock kernel: ? __kthread_init_worker+0x30/0x30 čen 18 23:12:02 spock kernel: ret_from_fork+0x35/0x40 čen 18 23:12:02 spock kernel: ---[ end trace e017bc3573bd9bf3 ]--- === Do you still want me to try Felix's tree, or there's something else I can try? Thank you. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
mt7612 suspend/resume issue
1:18 spock kernel: ieee80211_reconfig+0xa2/0x1700 [mac80211] čen 15 20:21:18 spock kernel: ieee80211_restart_work+0xb7/0xe0 [mac80211] čen 15 20:21:18 spock kernel: process_one_work+0x1d4/0x3c0 čen 15 20:21:18 spock kernel: worker_thread+0x228/0x470 čen 15 20:21:18 spock kernel: ? process_one_work+0x3c0/0x3c0 čen 15 20:21:18 spock kernel: kthread+0x19c/0x1c0 čen 15 20:21:18 spock kernel: ? __kthread_init_worker+0x30/0x30 čen 15 20:21:18 spock kernel: ret_from_fork+0x35/0x40 čen 15 20:21:18 spock kernel: ---[ end trace d8e4d40f48014383 ]--- === The Wi-Fi becomes unusable from this point. If I `modprobe -r` the "mt76x2e" module after this splat, the system hangs completely. If the system resumes fine, the resume itself takes quite some time (more than 10 seconds). I've found a workaround for this, though. It seems the system behaves fine if I do `modprobe -r mt76x2e` before it goes to sleep, and then `modprobe mt76x2e` after resume. Also, the resume time improves greatly. I cannot say if it is some regression or not. I've installed the card just recently, and used it with v5.7 kernel series only. Do you have any idea what could go wrong and how to approach the issue? Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: [PATCH v7] mm: Proactive compaction
command allocates 700G of Java heap using hugepages. > > - With vanilla 5.6.0-rc3 > > 17.39user 1666.48system 27:37.89elapsed > > - With 5.6.0-rc3 + this patch, with proactiveness=20 > > 8.35user 194.58system 3:19.62elapsed > > Elapsed time remains around 3:15, as proactiveness is further increased. > > Note that proactive compaction happens throughout the runtime of these > workloads. The situation of one-time compaction, sufficient to supply > hugepages for following allocation stream, can probably happen for more > extreme proactiveness values, like 80 or 90. > > In the above Java workload, proactiveness is set to 20. The test starts > with a node's score of 80 or higher, depending on the delay between the > fragmentation step and starting the benchmark, which gives more-or-less > time for the initial round of compaction. As the benchmark consumes > hugepages, node's score quickly rises above the high threshold (90) and > proactive compaction starts again, which brings down the score to the > low threshold level (80). Repeat. > > bpftrace also confirms proactive compaction running 20+ times during the > runtime of this Java benchmark. kcompactd threads consume 100% of one of > the CPUs while it tries to bring a node's score within thresholds. > > Backoff behavior > > > Above workloads produce a memory state which is easy to compact. > However, if memory is filled with unmovable pages, proactive compaction > should essentially back off. To test this aspect: > > - Created a kernel driver that allocates almost all memory as hugepages > followed by freeing first 3/4 of each hugepage. > - Set proactiveness=40 > - Note that proactive_compact_node() is deferred maximum number of times > with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check > (=> ~30 seconds between retries). > > [1] https://patchwork.kernel.org/patch/11098289/ > [2] https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/ > [3] https://lwn.net/Articles/817905/ > > Signed-off-by: Nitin Gupta > Reviewed-by: Vlastimil Babka > Reviewed-by: Khalid Aziz > To: Andrew Morton > CC: Vlastimil Babka > CC: Khalid Aziz > CC: Michal Hocko > CC: Mel Gorman > CC: Matthew Wilcox > CC: Mike Kravetz > CC: Joonsoo Kim > CC: David Rientjes > CC: Nitin Gupta > CC: Oleksandr Natalenko > CC: linux-kernel > CC: linux-mm > CC: Linux API > > --- > Changelog v7 vs v6: > - Fix compile error while THP is disabled (Oleksandr) Thank you for taking this. > > Changelog v6 vs v5: > - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined, and >some cleanups (Vlastimil) > - Cap min threshold to avoid excess compaction load in case user sets >extreme values like 100 for `vm.compaction_proactiveness` sysctl (Khalid) > - Add some more explanation about the effect of tunable on compaction >behavior in user guide (Khalid) > > Changelog v5 vs v4: > - Change tunable from sysfs to sysctl (Vlastimil) > - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil) > - Minor cleanups (remove redundant initializations, ...) > > Changelog v4 vs v3: > - Document various functions. > - Added admin-guide for the new tunable `proactiveness`. > - Rename proactive_compaction_score to fragmentation_score for clarity. > > Changelog v3 vs v2: > - Make proactiveness a global tunable and not per-node. Also upadated > the >patch description to reflect the same (Vlastimil Babka). > - Don't start proactive compaction if kswapd is running (Vlastimil > Babka). > - Clarified in the description that compaction runs in parallel with >the workload, instead of a one-time compaction followed by a stream > of >hugepage allocations. > > Changelog v2 vs v1: > - Introduce per-node and per-zone "proactive compaction score". This >score is compared against watermarks which are set according to >user provided proactiveness value. > - Separate code-paths for proactive compaction from targeted compaction >i.e. where pgdat->kcompactd_max_order is non-zero. > - Renamed hpage_compaction_effort -> proactiveness. In future we may >use more than extfrag wrt hugepage size to determine proactive >compaction score. > --- > Documentation/admin-guide/sysctl/vm.rst | 15 ++ > include/linux/compaction.h | 2 + > kernel/sysctl.c | 9 ++ > mm/compaction.c | 183 +++- > mm/internal.h | 1 + > mm/vmstat.c | 18 +++ > 6 files changed, 223 insertions(+), 5 de
Re: [PATCH v6] mm: Proactive compaction
On Mon, Jun 15, 2020 at 10:29:01AM +0200, Oleksandr Natalenko wrote: > Just to let you know, this fails to compile for me with THP disabled on > v5.8-rc1: > > CC mm/compaction.o > In file included from ./include/linux/dev_printk.h:14, > from ./include/linux/device.h:15, > from ./include/linux/node.h:18, > from ./include/linux/cpu.h:17, > from mm/compaction.c:11: > In function ‘fragmentation_score_zone’, > inlined from ‘__compact_finished’ at mm/compaction.c:1982:11, > inlined from ‘compact_zone’ at mm/compaction.c:2062:8: > ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ > declared with attribute error: BUILD_BUG failed > 392 | _compiletime_assert(condition, msg, __compiletime_assert_, > __COUNTER__) > | ^ > ./include/linux/compiler.h:373:4: note: in definition of macro > ‘__compiletime_assert’ > 373 |prefix ## suffix();\ > |^~ > ./include/linux/compiler.h:392:2: note: in expansion of macro > ‘_compiletime_assert’ > 392 | _compiletime_assert(condition, msg, __compiletime_assert_, > __COUNTER__) > | ^~~ > ./include/linux/build_bug.h:39:37: note: in expansion of macro > ‘compiletime_assert’ >39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~ > ./include/linux/build_bug.h:59:21: note: in expansion of macro > ‘BUILD_BUG_ON_MSG’ >59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~ > ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ > 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > |^ > ./include/linux/huge_mm.h:115:26: note: in expansion of macro > ‘HPAGE_PMD_SHIFT’ > 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) > | ^~~ > mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ >64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER > |^~~ > mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ > 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > |^~ > In function ‘fragmentation_score_zone’, > inlined from ‘kcompactd’ at mm/compaction.c:1918:12: > ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ > declared with attribute error: BUILD_BUG failed > 392 | _compiletime_assert(condition, msg, __compiletime_assert_, > __COUNTER__) > | ^ > ./include/linux/compiler.h:373:4: note: in definition of macro > ‘__compiletime_assert’ > 373 |prefix ## suffix();\ > |^~ > ./include/linux/compiler.h:392:2: note: in expansion of macro > ‘_compiletime_assert’ > 392 | _compiletime_assert(condition, msg, __compiletime_assert_, > __COUNTER__) > | ^~~ > ./include/linux/build_bug.h:39:37: note: in expansion of macro > ‘compiletime_assert’ >39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) > | ^~ > ./include/linux/build_bug.h:59:21: note: in expansion of macro > ‘BUILD_BUG_ON_MSG’ >59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") > | ^~~~ > ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ > 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > |^ > ./include/linux/huge_mm.h:115:26: note: in expansion of macro > ‘HPAGE_PMD_SHIFT’ > 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) > | ^~~ > mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ >64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER > |^~~ > mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ > 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); > |^~ > In function ‘fragmentation_score_zone’, > inlined from ‘kcompactd’ at mm/compaction.c:1918:12: > ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ > declared with attribute error: BUILD_BUG failed > 392 | _compiletime_assert(condition, msg, __compiletime_assert_, > __COUNTER__) > |
Re: [PATCH v6] mm: Proactive compaction
./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) |^ ./include/linux/huge_mm.h:115:26: note: in expansion of macro ‘HPAGE_PMD_SHIFT’ 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) | ^~~ mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ 64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER |^~~ mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); |^~ In function ‘fragmentation_score_zone’, inlined from ‘kcompactd’ at mm/compaction.c:1918:12: ./include/linux/compiler.h:392:38: error: call to ‘__compiletime_assert_397’ declared with attribute error: BUILD_BUG failed 392 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^ ./include/linux/compiler.h:373:4: note: in definition of macro ‘__compiletime_assert’ 373 |prefix ## suffix();\ |^~ ./include/linux/compiler.h:392:2: note: in expansion of macro ‘_compiletime_assert’ 392 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~ ./include/linux/build_bug.h:39:37: note: in expansion of macro ‘compiletime_assert’ 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~ ./include/linux/build_bug.h:59:21: note: in expansion of macro ‘BUILD_BUG_ON_MSG’ 59 | #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed") | ^~~~ ./include/linux/huge_mm.h:319:28: note: in expansion of macro ‘BUILD_BUG’ 319 | #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) |^ ./include/linux/huge_mm.h:115:26: note: in expansion of macro ‘HPAGE_PMD_SHIFT’ 115 | #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT) | ^~~ mm/compaction.c:64:32: note: in expansion of macro ‘HPAGE_PMD_ORDER’ 64 | #define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER |^~~ mm/compaction.c:1898:28: note: in expansion of macro ‘COMPACTION_HPAGE_ORDER’ 1898 |extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); |^~ make[1]: *** [scripts/Makefile.build:281: mm/compaction.o] Error 1 make: *** [Makefile:1764: mm] Error 2 -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: [PATCH] Kconfig: default to CC_OPTIMIZE_FOR_PERFORMANCE_O3 for gcc >= 10
On Fri, May 08, 2020 at 05:21:47AM -0600, Jason A. Donenfeld wrote: > > Should we untangle -O3 from depending on ARC first maybe? > > Oh, hah, good point. Yes, I'll do that for a v2, but will wait another > day for feedback first. Just keep in mind that my previous attempt [1] failed because of too many false positive warnings despite -O3 really uncovered a couple of bugs in the codebase. Lets hope your attempt will be more successfull. I'll happily offer my review tag ;). Also Cc'ing Andrew who (IIRC) tried to took my sumbission and Arnd who tried to clean up the mess afterwards. [1] https://lore.kernel.org/lkml/20191211104619.114557-1-oleksa...@redhat.com/ -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: [PATCH] Kconfig: default to CC_OPTIMIZE_FOR_PERFORMANCE_O3 for gcc >= 10
On Thu, May 07, 2020 at 04:45:30PM -0600, Jason A. Donenfeld wrote: > GCC 10 appears to have changed -O2 in order to make compilation time > faster when using -flto, seemingly at the expense of performance, in > particular with regards to how the inliner works. Since -O3 these days > shouldn't have the same set of bugs as 10 years ago, this commit > defaults new kernel compiles to -O3 when using gcc >= 10. > > Signed-off-by: Jason A. Donenfeld > --- > init/Kconfig | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/init/Kconfig b/init/Kconfig > index 9e22ee8fbd75..fab3f810a68d 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1245,7 +1245,8 @@ config BOOT_CONFIG > > choice > prompt "Compiler optimization level" > - default CC_OPTIMIZE_FOR_PERFORMANCE > + default CC_OPTIMIZE_FOR_PERFORMANCE_O3 if GCC_VERSION >= 10 > + default CC_OPTIMIZE_FOR_PERFORMANCE if (GCC_VERSION < 10 || > CC_IS_CLANG) > > config CC_OPTIMIZE_FOR_PERFORMANCE > bool "Optimize for performance (-O2)" > -- > 2.26.2 > Should we untangle -O3 from depending on ARC first maybe? -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: mmotm 2020-04-26-00-15 uploaded (mm/madvise.c)
On Mon, Apr 27, 2020 at 04:45:12PM -0700, Minchan Kim wrote: > Hi Andrew, > > On Mon, Apr 27, 2020 at 01:50:53PM -0700, Andrew Morton wrote: > > On Sun, 26 Apr 2020 15:48:35 -0700 Randy Dunlap > > wrote: > > > > > On 4/26/20 10:26 AM, Randy Dunlap wrote: > > > > On 4/26/20 12:16 AM, a...@linux-foundation.org wrote: > > > >> The mm-of-the-moment snapshot 2020-04-26-00-15 has been uploaded to > > > >> > > > >>http://www.ozlabs.org/~akpm/mmotm/ > > > >> > > > >> mmotm-readme.txt says > > > >> > > > >> README for mm-of-the-moment: > > > >> > > > >> http://www.ozlabs.org/~akpm/mmotm/ > > > >> > > > >> This is a snapshot of my -mm patch queue. Uploaded at random hopefully > > > >> more than once a week. > > > >> > > > >> You will need quilt to apply these patches to the latest Linus release > > > >> (5.x > > > >> or 5.x-rcY). The series file is in broken-out.tar.gz and is > > > >> duplicated in > > > >> http://ozlabs.org/~akpm/mmotm/series > > > >> > > > >> The file broken-out.tar.gz contains two datestamp files: .DATE and > > > >> .DATE--mm-dd-hh-mm-ss. Both contain the string > > > >> -mm-dd-hh-mm-ss, > > > >> followed by the base kernel version against which this patch series is > > > >> to > > > >> be applied. > > > > > > > > Hi, > > > > I'm seeing lots of build failures in mm/madvise.c. > > > > > > > > Is Minchin's patch only partially applied or is it just missing some > > > > pieces? > > > > > > > > a. mm/madvise.c needs to #include > > > > > > > > b. looks like the sys_process_madvise() prototype in > > > > has not been updated: > > > > > > > > In file included from ../mm/madvise.c:11:0: > > > > ../include/linux/syscalls.h:239:18: error: conflicting types for > > > > ‘sys_process_madvise’ > > > > asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ > > > > ^ > > > > ../include/linux/syscalls.h:225:2: note: in expansion of macro > > > > ‘__SYSCALL_DEFINEx’ > > > > __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) > > > > ^ > > > > ../include/linux/syscalls.h:219:36: note: in expansion of macro > > > > ‘SYSCALL_DEFINEx’ > > > > #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, > > > > __VA_ARGS__) > > > > ^~~ > > > > ../mm/madvise.c:1295:1: note: in expansion of macro ‘SYSCALL_DEFINE6’ > > > > SYSCALL_DEFINE6(process_madvise, int, which, pid_t, upid, > > > > ^~~ > > > > In file included from ../mm/madvise.c:11:0: > > > > ../include/linux/syscalls.h:880:17: note: previous declaration of > > > > ‘sys_process_madvise’ was here > > > > asmlinkage long sys_process_madvise(int which, pid_t pid, unsigned > > > > long start, > > > > ^~~ > > > > > > I had to add 2 small patches to have clean madvise.c builds: > > > > > > > hm, not sure why these weren't noticed sooner, thanks. > > > > This patchset is looking a bit tired now. > > > > Things to be addressed (might be out of date): > > > > - http://lkml.kernel.org/r/293bcd25-934f-dd57-3314-bbcf00833...@redhat.com > > It seems to be not related to process_madvise. > > > > > - http://lkml.kernel.org/r/2a767d50-4034-da8c-c40c-280e0dda9...@suse.cz > > (I did this) > > Thanks! > > > > > - http://lkml.kernel.org/r/20200310222008.gb72...@google.com > > I will send foldable patches to handle comments. > > > > > - issues arising from the review of > > http://lkml.kernel.org/r/20200302193630.68771-8-minc...@kernel.org > > Oleksandr, What's the outcome of this issue? > Do we still need to change based on the comment? > My current understanding is that we do not mess with signals excessively in the given code path. -- Best regards, Oleksandr Natalenko (post-factum) Principal Software Maintenance Engineer
Re: mt76x2e hardware restart
Hi. On 23.10.2019 10:50, Lorenzo Bianconi wrote: I think I spotted the SG issue on mt76x2e. Could you please: - keep pcie_aspm patch I sent - remove the debug patch where I disabled TX Scatter-Gather on mt76x2e - apply the following patch Thanks for the patch. So far so good, I was able to start AP, connect to it and conduct a couple of simple speed tests. I'll use it more today and will let you know in case something breaks. -- Oleksandr Natalenko (post-factum)
Re: mt76x2e hardware restart
Hello. On 15.10.2019 18:52, Oleksandr Natalenko wrote: Thanks for the answer and the IRC discussion. As agreed I've applied [1] and [2], and have just swapped the card to try it again. So far, it works fine in 5 GHz band in 802.11ac mode as an AP. I'll give it more load with my phone over evening, and we can discuss what to do next (if needed) tomorrow again. Or feel free to drop me an email today. Thanks for your efforts. [1] https://github.com/LorenzoBianconi/wireless-drivers-next/commit/cf3436c42a297967235a9c9778620c585100529e.patch [2] https://github.com/LorenzoBianconi/wireless-drivers-next/commit/aad256eb62620f9646d39c1aa69234f50c89eed8.patch As agreed, here are iperf3 results, AP to STA distance is 2 meters. Client sends, TCP: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 70.4 MBytes 59.0 Mbits/sec 3800 sender [ 5] 0.00-10.03 sec 70.0 MBytes 58.6 Mbits/sec receiver Client receives, TCP: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.06 sec 196 MBytes 163 Mbits/sec 3081 sender [ 5] 0.00-10.01 sec 191 MBytes 160 Mbits/sec receiver Client sends, UDP, 128 streams: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [SUM] 0.00-10.00 sec 160 MBytes 134 Mbits/sec 0.000 ms 0/115894 (0%) sender [SUM] 0.00-10.01 sec 160 MBytes 134 Mbits/sec 0.347 ms 0/115892 (0%) receiver Client receives, UDP, 128 streams: [ ID] Interval Transfer Bitrate Jitter Lost/Total Datagrams [SUM] 0.00-10.01 sec 119 MBytes 99.4 Mbits/sec 0.000 ms 0/85888 (0%) sender [SUM] 0.00-10.00 sec 119 MBytes 99.5 Mbits/sec 0.877 ms 0/85888 (0%) receiver Given the HW is not the most powerful, the key point here is that nothing crashed after doing these tests. -- Oleksandr Natalenko (post-factum)
Re: mt76x2e hardware restart
Hey. On 12.10.2019 18:50, Lorenzo Bianconi wrote: sorry for the delay. Felix and me worked on this issue today. Could you please try if the following patch fixes your issue? Thanks for the answer and the IRC discussion. As agreed I've applied [1] and [2], and have just swapped the card to try it again. So far, it works fine in 5 GHz band in 802.11ac mode as an AP. I'll give it more load with my phone over evening, and we can discuss what to do next (if needed) tomorrow again. Or feel free to drop me an email today. Thanks for your efforts. [1] https://github.com/LorenzoBianconi/wireless-drivers-next/commit/cf3436c42a297967235a9c9778620c585100529e.patch [2] https://github.com/LorenzoBianconi/wireless-drivers-next/commit/aad256eb62620f9646d39c1aa69234f50c89eed8.patch -- Oleksandr Natalenko (post-factum)
Re: mt76x2e hardware restart
On 19.09.2019 23:22, Oleksandr Natalenko wrote: It checks for TX hang here: === mt76x02_mmio.c 557 void mt76x02_wdt_work(struct work_struct *work) 558 { ... 562 mt76x02_check_tx_hang(dev); === I've commented out the watchdog here ^^, and the card is not resetted any more, but similarly it stops working shortly after the first client connects. So, indeed, it must be some hang in the HW, and wdt seems to do a correct job. Is it even debuggable/fixable from the driver? -- Oleksandr Natalenko (post-factum)
Re: mt76x2e hardware restart
On 19.09.2019 18:24, Oleksandr Natalenko wrote: [ +9,979664] mt76x2e :01:00.0: Firmware Version: 0.0.00 [ +0,14] mt76x2e :01:00.0: Build: 1 [ +0,10] mt76x2e :01:00.0: Build Time: 201507311614 [ +0,018017] mt76x2e :01:00.0: Firmware running! [ +0,001101] ieee80211 phy4: Hardware restart was requested IIUC, this happens due to watchdog. I think the following applies. Watchdog is started here: === mt76x02_util.c 130 void mt76x02_init_device(struct mt76x02_dev *dev) 131 { ... 155 INIT_DELAYED_WORK(&dev->wdt_work, mt76x02_wdt_work); === It checks for TX hang here: === mt76x02_mmio.c 557 void mt76x02_wdt_work(struct work_struct *work) 558 { ... 562 mt76x02_check_tx_hang(dev); === Conditions: === mt76x02_mmio.c 530 static void mt76x02_check_tx_hang(struct mt76x02_dev *dev) 531 { 532 if (mt76x02_tx_hang(dev)) { 533 if (++dev->tx_hang_check >= MT_TX_HANG_TH) 534 goto restart; 535 } else { 536 dev->tx_hang_check = 0; 537 } 538 539 if (dev->mcu_timeout) 540 goto restart; 541 542 return; 543 544 restart: 545 mt76x02_watchdog_reset(dev); === Actual check: === mt76x02_mmio.c 367 static bool mt76x02_tx_hang(struct mt76x02_dev *dev) 368 { 369 u32 dma_idx, prev_dma_idx; 370 struct mt76_queue *q; 371 int i; 372 373 for (i = 0; i < 4; i++) { 374 q = dev->mt76.q_tx[i].q; 375 376 if (!q->queued) 377 continue; 378 379 prev_dma_idx = dev->mt76.tx_dma_idx[i]; 380 dma_idx = readl(&q->regs->dma_idx); 381 dev->mt76.tx_dma_idx[i] = dma_idx; 382 383 if (prev_dma_idx == dma_idx) 384 break; 385 } 386 387 return i < 4; 388 } === (I don't quite understand what it does here; why 4? does each device have 4 queues? maybe, my does not? I guess this is where watchdog is triggered, though, because otherwise I'd see mcu_timeout message like "MCU message %d (seq %d) timed out\n") Once it detects TX hang, the reset is triggered: === mt76x02_mmio.c 446 static void mt76x02_watchdog_reset(struct mt76x02_dev *dev) 447 { ... 485 if (restart) 486 mt76_mcu_restart(dev); === mt76_mcu_restart() is just a define for this series here: === mt76.h 555 #define mt76_mcu_restart(dev, ...) (dev)->mt76.mcu_ops->mcu_restart(&((dev)->mt76)) === Actual OP: === mt76x2/pci_mcu.c 188 int mt76x2_mcu_init(struct mt76x02_dev *dev) 189 { 190 static const struct mt76_mcu_ops mt76x2_mcu_ops = { 191 .mcu_restart = mt76pci_mcu_restart, 192 .mcu_send_msg = mt76x02_mcu_msg_send, 193 }; === This triggers loading the firmware: === mt76x2/pci_mcu.c 168 static int 169 mt76pci_mcu_restart(struct mt76_dev *mdev) 170 { ... 179 ret = mt76pci_load_firmware(dev); === which does the printout I observe: === mt76x2/pci_mcu.c 91 static int 92 mt76pci_load_firmware(struct mt76x02_dev *dev) 93 { ... 156 dev_info(dev->mt76.dev, "Firmware running!\n"); === Too bad it doesn't show the actual watchdog message, IOW, why the reset happens. I guess I will have to insert some pr_infos here and there. Does it make sense? Any ideas why this can happen? More info on the device during boot: === [ +0,333233] mt76x2e :01:00.0: enabling device ( -> 0002) [ +0,000571] mt76x2e :01:00.0: ASIC revision: 76120044 [ +0,017806] mt76x2e :01:00.0: ROM patch build: 20141115060606a === -- Oleksandr Natalenko (post-factum)
mt76x2e hardware restart
Hi. Recently, I've got the following card: 01:00.0 Network controller: MEDIATEK Corp. Device 7612 Subsystem: MEDIATEK Corp. Device 7612 Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at 8120 (64-bit, non-prefetchable) [size=1M] Expansion ROM at 8130 [disabled] [size=64K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [70] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [148] Device Serial Number 00-00-00-00-00-00-00-00 Capabilities: [158] Latency Tolerance Reporting Capabilities: [160] L1 PM Substates Kernel driver in use: mt76x2e Kernel modules: mt76x2e I try to use it as an access point with the following configuration: interface=wlp1s0 driver=nl80211 ssid=someap channel=36 noscan=1 hw_mode=a ieee80211n=1 require_ht=1 ieee80211ac=1 require_vht=1 vht_oper_chwidth=1 vht_capab=[SHORT-GI-80][RX-STBC-1][RX-ANTENNA-PATTERN][TX-ANTENNA-PATTERN] vht_oper_centr_freq_seg0_idx=42 auth_algs=1 wpa=2 wpa_passphrase=somepswd wpa_key_mgmt=WPA-PSK rsn_pairwise=CCMP macaddr_acl=1 accept_mac_file=/etc/hostapd/hostapd.allow ctrl_interface=/run/hostapd ctrl_interface_group=0 country_code=CZ ieee80211d=1 ieee80211h=1 wmm_enabled=1 ht_capab=[GF][HT40+][SHORT-GI-20][SHORT-GI-40][RX-STBC1][DSSS_CCK-40] The hostapd daemon starts, and the AP broadcasts the beacons: zář 19 17:50:04 srv hostapd[13251]: Configuration file: /etc/hostapd/ap_5ghz.conf zář 19 17:50:05 srv hostapd[13251]: wlp1s0: interface state UNINITIALIZED->COUNTRY_UPDATE zář 19 17:50:05 srv hostapd[13251]: Using interface wlp1s0 with hwaddr xx:xx:xx:xx:xx:xx and ssid "someap" zář 19 17:50:05 srv hostapd[13251]: wlp1s0: interface state COUNTRY_UPDATE->ENABLED zář 19 17:50:05 srv hostapd[13251]: wlp1s0: AP-ENABLED zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: associated (aid 1) zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: associated (aid 1) zář 19 17:50:17 srv hostapd[13251]: wlp1s0: AP-STA-CONNECTED xx:xx:xx:xx:xx:xx zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx RADIUS: starting accounting session 07E311195378B570 zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx WPA: pairwise key handshake completed (RSN) zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx RADIUS: starting accounting session 07E311195378B570 zář 19 17:50:17 srv hostapd[13251]: wlp1s0: STA xx:xx:xx:xx:xx:xx WPA: pairwise key handshake completed (RSN) The client is able to see it and connect to it, but after a couple of seconds the following happens on the AP: [ +9,979664] mt76x2e :01:00.0: Firmware Version: 0.0.00 [ +0,14] mt76x2e :01:00.0: Build: 1 [ +0,10] mt76x2e :01:00.0: Build Time: 201507311614 [ +0,018017] mt76x2e :01:00.0: Firmware running! [ +0,001101] ieee80211 phy4: Hardware restart was requested and the AP dies. The client cannot reconnect to it, although hostapd logs show that it tries: zář 19 17:51:15 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:51:15 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:51:19 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:51:19 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:52:54 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:52:54 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:52:59 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:52:59 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: authenticated zář 19 17:56:14 srv hostapd[13504]: wlp1s0: STA xx:xx:xx:xx:xx:xx IEEE 802.11: deauthenticated due to inactivity (timer DEAUTH/REMOVE) AP stays completely unusable until I remove and modprobe mt76x2e module again. And then everything begins from scratch, and the AP dies within seconds. I observe this on a fresh v5.3 kernel. I haven't tried anything older. The only somewhat relevant thread I was able to found is [1], but it's not clear what's the resolution if any. Could you please suggest how to deal with this issue? Thanks. [1] https://forum.openwrt.org/t/wifi-issues-with-18-06-4-on-mt76/40537 -- Oleksandr Natalenko (post-factum)
[PATCH 1/1] shiftfs-5.2: use copy_from_user() correctly
Signed-off-by: Oleksandr Natalenko --- fs/shiftfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/shiftfs.c b/fs/shiftfs.c index 49f6714e9f95..14f3764577d8 100644 --- a/fs/shiftfs.c +++ b/fs/shiftfs.c @@ -1526,7 +1526,7 @@ static bool in_ioctl_whitelist(int flag, unsigned long arg) case BTRFS_IOC_SUBVOL_GETFLAGS: return true; case BTRFS_IOC_SUBVOL_SETFLAGS: - if (copy_from_user(&flags, arg, sizeof(flags))) + if (copy_from_user(&flags, argp, sizeof(flags))) return false; if (flags & ~BTRFS_SUBVOL_RDONLY) -- 2.22.1
[PATCH 0/1] Small potential fix for shiftfs
Hey, people. I was lurking at shiftfs just out of curiosity and managed to bump into a compiler warning that is (as I suppose) easily fixed by the subsequent patch. Feel free to drag this into your Ubuntu tree if needed. I haven't played with it yet, just compiling (because I'm looking for something that is bindfs but in-kernel) :). Oleksandr Natalenko (1): shiftfs-5.2: use copy_from_user() correctly fs/shiftfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- 2.22.1
Re: [PATCH] hid: add another quirk for Chicony PixArt mouse
Erm. Cc: s.parscha...@gmx.de instead of inactive @suse address. On Fri, Jun 21, 2019 at 11:17:36AM +0200, Oleksandr Natalenko wrote: > I've spotted another Chicony PixArt mouse in the wild, which requires > HID_QUIRK_ALWAYS_POLL quirk, otherwise it disconnects each minute. > > USB ID of this device is 0x04f2:0x0939. > > We've introduced quirks like this for other models before, so lets add > this mouse too. > > Link: > https://github.com/sriemer/fix-linux-mouse#usb-mouse-disconnectsreconnects-every-minute-on-linux > Signed-off-by: Oleksandr Natalenko > --- > drivers/hid/hid-ids.h| 1 + > drivers/hid/hid-quirks.c | 1 + > 2 files changed, 2 insertions(+) > > diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h > index eac0c54c5970..69f0553d9d95 100644 > --- a/drivers/hid/hid-ids.h > +++ b/drivers/hid/hid-ids.h > @@ -269,6 +269,7 @@ > #define USB_DEVICE_ID_CHICONY_MULTI_TOUCH0xb19d > #define USB_DEVICE_ID_CHICONY_WIRELESS 0x0618 > #define USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE 0x1053 > +#define USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE2 0x0939 > #define USB_DEVICE_ID_CHICONY_WIRELESS2 0x1123 > #define USB_DEVICE_ID_ASUS_AK1D 0x1125 > #define USB_DEVICE_ID_CHICONY_TOSHIBA_WT10A 0x1408 > diff --git a/drivers/hid/hid-quirks.c b/drivers/hid/hid-quirks.c > index e5ca6fe2ca57..671a285724f9 100644 > --- a/drivers/hid/hid-quirks.c > +++ b/drivers/hid/hid-quirks.c > @@ -42,6 +42,7 @@ static const struct hid_device_id hid_quirks[] = { > { HID_USB_DEVICE(USB_VENDOR_ID_ATEN, USB_DEVICE_ID_ATEN_UC100KM), > HID_QUIRK_NOGET }, > { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, > USB_DEVICE_ID_CHICONY_MULTI_TOUCH), HID_QUIRK_MULTI_INPUT }, > { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, > USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE), HID_QUIRK_ALWAYS_POLL }, > + { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, > USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE2), HID_QUIRK_ALWAYS_POLL }, > { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, > USB_DEVICE_ID_CHICONY_WIRELESS), HID_QUIRK_MULTI_INPUT }, > { HID_USB_DEVICE(USB_VENDOR_ID_CHIC, USB_DEVICE_ID_CHIC_GAMEPAD), > HID_QUIRK_BADPAD }, > { HID_USB_DEVICE(USB_VENDOR_ID_CH, > USB_DEVICE_ID_CH_3AXIS_5BUTTON_STICK), HID_QUIRK_NOGET }, > -- > 2.22.0 > -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
[PATCH] hid: add another quirk for Chicony PixArt mouse
I've spotted another Chicony PixArt mouse in the wild, which requires HID_QUIRK_ALWAYS_POLL quirk, otherwise it disconnects each minute. USB ID of this device is 0x04f2:0x0939. We've introduced quirks like this for other models before, so lets add this mouse too. Link: https://github.com/sriemer/fix-linux-mouse#usb-mouse-disconnectsreconnects-every-minute-on-linux Signed-off-by: Oleksandr Natalenko --- drivers/hid/hid-ids.h| 1 + drivers/hid/hid-quirks.c | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/hid/hid-ids.h b/drivers/hid/hid-ids.h index eac0c54c5970..69f0553d9d95 100644 --- a/drivers/hid/hid-ids.h +++ b/drivers/hid/hid-ids.h @@ -269,6 +269,7 @@ #define USB_DEVICE_ID_CHICONY_MULTI_TOUCH 0xb19d #define USB_DEVICE_ID_CHICONY_WIRELESS 0x0618 #define USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE 0x1053 +#define USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE20x0939 #define USB_DEVICE_ID_CHICONY_WIRELESS20x1123 #define USB_DEVICE_ID_ASUS_AK1D0x1125 #define USB_DEVICE_ID_CHICONY_TOSHIBA_WT10A0x1408 diff --git a/drivers/hid/hid-quirks.c b/drivers/hid/hid-quirks.c index e5ca6fe2ca57..671a285724f9 100644 --- a/drivers/hid/hid-quirks.c +++ b/drivers/hid/hid-quirks.c @@ -42,6 +42,7 @@ static const struct hid_device_id hid_quirks[] = { { HID_USB_DEVICE(USB_VENDOR_ID_ATEN, USB_DEVICE_ID_ATEN_UC100KM), HID_QUIRK_NOGET }, { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, USB_DEVICE_ID_CHICONY_MULTI_TOUCH), HID_QUIRK_MULTI_INPUT }, { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE), HID_QUIRK_ALWAYS_POLL }, + { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, USB_DEVICE_ID_CHICONY_PIXART_USB_OPTICAL_MOUSE2), HID_QUIRK_ALWAYS_POLL }, { HID_USB_DEVICE(USB_VENDOR_ID_CHICONY, USB_DEVICE_ID_CHICONY_WIRELESS), HID_QUIRK_MULTI_INPUT }, { HID_USB_DEVICE(USB_VENDOR_ID_CHIC, USB_DEVICE_ID_CHIC_GAMEPAD), HID_QUIRK_BADPAD }, { HID_USB_DEVICE(USB_VENDOR_ID_CH, USB_DEVICE_ID_CH_3AXIS_5BUTTON_STICK), HID_QUIRK_NOGET }, -- 2.22.0
[PATCH NOTFORMERGE 2/5] mm: revert madvise_inject_error line split
Just to highlight it after our conversation. Signed-off-by: Oleksandr Natalenko --- mm/madvise.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index edb7184f665c..70aeb54f3e1c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1041,8 +1041,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm, #ifdef CONFIG_MEMORY_FAILURE if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) - return madvise_inject_error(behavior, - start, start + len_in); + return madvise_inject_error(behavior, start, start + len_in); #endif write = madvise_need_mmap_write(behavior); -- 2.22.0
[PATCH NOTFORMERGE 3/5] mm: include uio.h to madvise.c
I couldn't compile it w/o this header. Signed-off-by: Oleksandr Natalenko --- mm/madvise.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/madvise.c b/mm/madvise.c index 70aeb54f3e1c..9755340da157 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -25,6 +25,7 @@ #include #include #include +#include #include -- 2.22.0
[PATCH NOTFORMERGE 1/5] mm: rename madvise_core to madvise_common
"core" usually means something very different within the kernel land, thus lets just follow the way it is handled in mutexes, rw_semaphores etc and name common things as "_common". Signed-off-by: Oleksandr Natalenko --- mm/madvise.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index 94d782097afd..edb7184f665c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -998,7 +998,7 @@ process_madvise_behavior_valid(int behavior) } /* - * madvise_core - request behavior hint to address range of the target process + * madvise_common - request behavior hint to address range of the target process * * @task: task_struct got behavior hint, not giving the hint * @mm: mm_struct got behavior hint, not giving the hint @@ -1009,7 +1009,7 @@ process_madvise_behavior_valid(int behavior) * @task could be a zombie leader if it calls sys_exit so accessing mm_struct * via task->mm is prohibited. Please use @mm insetad of task->mm. */ -static int madvise_core(struct task_struct *task, struct mm_struct *mm, +static int madvise_common(struct task_struct *task, struct mm_struct *mm, unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; @@ -1132,7 +1132,7 @@ static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param, return ret; } -static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm, +static int process_madvise_common(struct task_struct *tsk, struct mm_struct *mm, int *behaviors, struct iov_iter *iter, const struct iovec *range_vec, @@ -1144,7 +1144,7 @@ static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm, for (i = 0; i < riovcnt && iov_iter_count(iter); i++) { err = -EINVAL; if (process_madvise_behavior_valid(behaviors[i])) - err = madvise_core(tsk, mm, + err = madvise_common(tsk, mm, (unsigned long)range_vec[i].iov_base, range_vec[i].iov_len, behaviors[i]); @@ -1220,7 +1220,7 @@ static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm, */ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return madvise_core(current, current->mm, start, len_in, behavior); + return madvise_common(current, current->mm, start, len_in, behavior); } @@ -1252,7 +1252,7 @@ SYSCALL_DEFINE3(process_madvise, int, pidfd, /* * We don't support cookie to gaurantee address space atomicity yet. -* Once we implment cookie, process_madvise_core need to hold mmap_sme +* Once we implment cookie, process_madvise_common need to hold mmap_sme * during entire operation to guarantee atomicity. */ if (params.cookie != 0) @@ -1316,7 +1316,7 @@ SYSCALL_DEFINE3(process_madvise, int, pidfd, goto release_task; } - ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem); + ret = process_madvise_common(task, mm, behaviors, &iter, iov_r, nr_elem); mmput(mm); release_task: put_task_struct(task); -- 2.22.0
[PATCH NOTFORMERGE 5/5] mm/madvise: allow KSM hints for remote API
It all began with the fact that KSM works only on memory that is marked by madvise(). And the only way to get around that is to either: * use LD_PRELOAD; or * patch the kernel with something like UKSM or PKSM. (i skip ptrace can of worms here intentionally) To overcome this restriction, lets employ a new remote madvise API. This can be used by some small userspace helper daemon that will do auto-KSM job for us. I think of two major consumers of remote KSM hints: * hosts, that run containers, especially similar ones and especially in a trusted environment, sharing the same runtime like Node.js; * heavy applications, that can be run in multiple instances, not limited to opensource ones like Firefox, but also those that cannot be modified since they are binary-only and, maybe, statically linked. Speaking of statistics, more numbers can be found in the very first submission, that is related to this one [1]. For my current setup with two Firefox instances I get 100 to 200 MiB saved for the second instance depending on the amount of tabs. 1 FF instance with 15 tabs: $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 410 2 FF instances, second one has 12 tabs (all the tabs are different): $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 592 At the very moment I do not have specific numbers for containerised workload, but those should be comparable in case the containers share similar/same runtime. [1] https://lore.kernel.org/patchwork/patch/1012142/ Signed-off-by: Oleksandr Natalenko --- mm/madvise.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 84f899b1b6da..e8f9c49794a3 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -991,6 +991,8 @@ process_madvise_behavior_valid(int behavior) switch (behavior) { case MADV_COLD: case MADV_PAGEOUT: + case MADV_MERGEABLE: + case MADV_UNMERGEABLE: return true; default: -- 2.22.0
[PATCH NOTFORMERGE 0/5] Extend remote madvise API to KSM hints
Hi, Minchan. This is a set of commits based on our discussion on your submission [1]. First 2 implement minor suggestions just for you to not forget to take them into account. uio.h inclusion was needed for me to be able to compile your series successfully. Also please note I had to enable "Transparent Hugepage Support" as well as "Enable idle page tracking" options, otherwise the build failed. I guess this can be addressed by you better since the errors are introduced with MADV_COLD introduction. Last 2 commits are the actual KSM hints enablement. The first one implements additional check for the case where the mmap_sem is taken for write, and the second one just allows KSM hints to be used by the remote interface. I'm not Cc'ing else anyone except two mailing lists to not distract people unnecessarily. If you are fine with this addition, please use it for your next iteration of process_madvise(), and then you'll Cc all the people needed. Thanks. [1] https://lore.kernel.org/lkml/20190531064313.193437-1-minc...@kernel.org/ Oleksandr Natalenko (5): mm: rename madvise_core to madvise_common mm: revert madvise_inject_error line split mm: include uio.h to madvise.c mm/madvise: employ mmget_still_valid for write lock mm/madvise: allow KSM hints for remote API mm/madvise.c | 23 ++- 1 file changed, 14 insertions(+), 9 deletions(-) -- 2.22.0
[PATCH NOTFORMERGE 4/5] mm/madvise: employ mmget_still_valid for write lock
Do the very same trick as we already do since 04f5866e41fb. KSM hints will require locking mmap_sem for write since they modify vm_flags, so for remote KSM hinting this additional check is needed. Signed-off-by: Oleksandr Natalenko --- mm/madvise.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/madvise.c b/mm/madvise.c index 9755340da157..84f899b1b6da 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -1049,6 +1049,8 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm, if (write) { if (down_write_killable(&mm->mmap_sem)) return -EINTR; + if (current->mm != mm && !mmget_still_valid(mm)) + goto skip_mm; } else { down_read(&mm->mmap_sem); } @@ -1099,6 +1101,7 @@ static int madvise_common(struct task_struct *task, struct mm_struct *mm, } out: blk_finish_plug(&plug); +skip_mm: if (write) up_write(&mm->mmap_sem); else -- 2.22.0
Re: [PATCH v2 0/5] Introduce MADV_COLD and MADV_PAGEOUT
On Wed, Jun 12, 2019 at 12:59:45PM +0200, Pavel Machek wrote: > > - Problem > > > > Naturally, cached apps were dominant consumers of memory on the system. > > However, they were not significant consumers of swap even though they are > > good candidate for swap. Under investigation, swapping out only begins > > once the low zone watermark is hit and kswapd wakes up, but the overall > > allocation rate in the system might trip lmkd thresholds and cause a cached > > process to be killed(we measured performance swapping out vs. zapping the > > memory by killing a process. Unsurprisingly, zapping is 10x times faster > > even though we use zram which is much faster than real storage) so kill > > from lmkd will often satisfy the high zone watermark, resulting in very > > few pages actually being moved to swap. > > Is it still faster to swap-in the application than to restart it? It's the same type of question I was addressing earlier in the remote KSM discussion: making applications aware of all the memory management stuff or delegate the decision to some supervising task. In this case, we cannot rewrite all the application to handle imaginary SIGRESTART (or whatever you invent to handle restarts gracefully). SIGTERM may require more memory to finish stuff to not lose your data (and I guess you don't want to lose your data, right?), and SIGKILL is pretty much destructive. Offloading proactive memory management to a process that knows how to do it allows to handle not only throwaway containers/microservices, but also usual desktop/mobile workflow. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > > information required to make the reclaim decision is not known to the app. > > Instead, it is known to a centralized userspace daemon, and that daemon > > must be able to initiate reclaim on its own without any app involvement. > > To solve the concern, this patch introduces new syscall - > > > > struct pr_madvise_param { > > int size; /* the size of this structure */ > > int cookie; /* reserved to support atomicity */ > > int nr_elem;/* count of below arrary fields */ > > int __user *hints; /* hints for each range */ > > /* to store result of each operation */ > > const struct iovec __user *results; > > /* input address ranges */ > > const struct iovec __user *ranges; > > }; > > > > int process_madvise(int pidfd, struct pr_madvise_param *u_param, > > unsigned long flags); > > That's quite a complex interface. > > Could we simply have feel_free_to_swap_out(int pid) syscall? :-). I wonder for how long we'll go on with adding new syscalls each time we need some amendment to existing interfaces. Yes, clone6(), I'm looking at you :(. In case of process_madvise() keep in mind it will be focused not only on MADV_COLD, but also, potentially, on other MADV_ flags as well. I can hardly imagine we'll add one syscall per each flag. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: 5.1 kernel: khugepaged stuck at 100%
Hi. On Fri, Jun 07, 2019 at 09:40:52AM +0200, Max Kellermann wrote: > On 2019/06/06 19:24, Max Kellermann wrote: > > I have the same problem (kernel 5.1.7), but over here, it's a PHP > > process, not khugepaged, which is looping inside compaction_alloc. > > This is what happened an hour later: > > kernel tried to execute NX-protected page - exploit attempt? (uid: 3) > BUG: unable to handle kernel paging request at c036f00f > #PF error: [PROT] [INSTR] > PGD 35fa10067 P4D 35fa10067 PUD 35fa12067 PMD 105ba71067 PTE 80022d28e061 > Oops: 0011 [#1] SMP PTI > CPU: 12 PID: 263514 Comm: php-cgi7.0 Not tainted 5.1.7-cmag1-th+ #5 > Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 > 10/17/2018 > RIP: 0010:0xc036f00f > Code: Bad RIP value. > RSP: 0018:b63c4d547928 EFLAGS: 00010216 > RAX: RBX: b63c4d547b10 RCX: ffc004d021bd > RDX: 9ac83fffc500 RSI: 7fe0026810dee7ff RDI: 7fe0026810dee400 > RBP: 7fe0026810dee400 R08: 0002 R09: 00020300 > R10: 00010642641a0d3a R11: 0001 R12: 7fe0026810dee800 > R13: 0001 R14: R15: 9ac83fffc500 > FS: 7fa5c1000740() GS:9ad01f60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: c036efe5 CR3: 0008eb8a0005 CR4: 001606e0 > Call Trace: > ? move_freelist_tail+0xd0/0xd0 > ? migrate_pages+0xaa/0x780 > ? isolate_freepages_block+0x380/0x380 > ? compact_zone+0x6ec/0xca0 > ? compact_zone_order+0xd8/0x120 > ? try_to_compact_pages+0xb1/0x260 > ? __alloc_pages_direct_compact+0x87/0x160 > ? __alloc_pages_slowpath+0x427/0xd50 > ? __alloc_pages_nodemask+0x2d6/0x310 > ? do_huge_pmd_anonymous_page+0x131/0x680 > ? vma_merge+0x24f/0x3a0 > ? __handle_mm_fault+0xbca/0x1260 > ? handle_mm_fault+0x135/0x1b0 > ? __do_page_fault+0x242/0x4b0 > ? page_fault+0x8/0x30 > ? page_fault+0x1e/0x30 > Modules linked in: > CR2: c036f00f > ---[ end trace 0f31edf3041f5d9e ]--- > RIP: 0010:0xc036f00f > Code: Bad RIP value. > RSP: 0018:b63c4d547928 EFLAGS: 00010216 > RAX: RBX: b63c4d547b10 RCX: ffc004d021bd > RDX: 9ac83fffc500 RSI: 7fe0026810dee7ff RDI: 7fe0026810dee400 > RBP: 7fe0026810dee400 R08: 0002 R09: 00020300 > R10: 00010642641a0d3a R11: 0001 R12: 7fe0026810dee800 > R13: 0001 R14: R15: 9ac83fffc500 > FS: 7fa5c1000740() GS:9ad01f60() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: c036efe5 CR3: 0008eb8a0005 CR4: 0000001606e0 Make sure to check if e577c8b64d ("mm, compaction: make sure we isolate a valid PFN") fixes your issue. It is staged for 5.1.8, BTW. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFCv2 4/6] mm: factor out madvise's core functionality
Hi. On Sat, Jun 01, 2019 at 08:29:59AM +0900, Minchan Kim wrote: > > > > > /* snip a lot */ > > > > > > > > > > #ifdef CONFIG_MEMORY_FAILURE > > > > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE) > > > > > - return madvise_inject_error(behavior, start, start + > > > > > len_in); > > > > > + return madvise_inject_error(behavior, > > > > > + start, start + len_in); > > > > > > > > Not sure what this change is about except changing the line length. > > > > Note, madvise_inject_error() still operates on "current" through > > > > get_user_pages_fast() and gup_pgd_range(), but that was not changed > > > > here. I Know you've filtered out this hint later, so technically this > > > > is not an issue, but, maybe, this needs some attention too since we've > > > > already spotted it? > > > > > > It is leftover I had done. I actually modified it to handle remote > > > task but changed my mind not to fix it because process_madvise > > > will not support it at this moment. I'm not sure it's a good idea > > > to change it for *might-be-done-in-future* at this moment even though > > > we have spotted. > > > > I'd expect to have at least some comments in code on why other hints > > are disabled, so if we already know some shortcomings, this information > > would not be lost. > > Okay, I will add some comment but do not want to fix code piece until > someone want to expose the poisoning to external process. Fair enough. > > > > > write = madvise_need_mmap_write(behavior); > > > > > if (write) { > > > > > - if (down_write_killable(¤t->mm->mmap_sem)) > > > > > + if (down_write_killable(&mm->mmap_sem)) > > > > > return -EINTR; > > > > > > > > Do you still need that trick with mmget_still_valid() here? > > > > Something like: > > > > > > Since MADV_COLD|PAGEOUT doesn't change address space layout or > > > vma->vm_flags, technically, we don't need it if I understand > > > correctly. Right? > > > > I'd expect so, yes. But. > > > > Since we want this interface to be universal and to be able to cover > > various needs, and since my initial intention with working in this > > direction involved KSM, I'd ask you to enable KSM hints too, and once > > (and if) that happens, the work there is done under write lock, and > > you'll need this trick to be applied. > > > > Of course, I can do that myself later in a subsequent patch series once > > (and, again, if) your series is merged, but, maybe, we can cover this > > already especially given the fact that KSM hinting is a relatively easy > > task in this pile. I did some preliminary tests with it, and so far no > > dragons have started to roar. > > Then, do you mind sending a patch based upon this series to expose > MADV_MERGEABLE to process_madvise? It will have the right description > why you want to have such feature which I couldn't provide since I don't > have enough material to write the motivation. And the patch also could > include the logic to prevent coredump race, which is more proper since > finally we need to hold mmap_sem write-side lock, finally. > I will pick it up and will rebase since then. Sure, I can. Would you really like to have it being based on this exact revision, or I should wait till you deal with MADV_COLD & Co and re-iterate this part again? Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFCv2 4/6] mm: factor out madvise's core functionality
On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote: > On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote: > > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote: > > > This patch factor out madvise's core functionality so that upcoming > > > patch can reuse it without duplication. It shouldn't change any behavior. > > > > > > Signed-off-by: Minchan Kim > > > --- > > > mm/madvise.c | 188 +++ > > > 1 file changed, 101 insertions(+), 87 deletions(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 9d749a1420b4..466623ea8c36 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, > > > unsigned long addr, > > > struct page *page; > > > int isolated = 0; > > > struct vm_area_struct *vma = walk->vma; > > > + struct task_struct *task = walk->private; > > > unsigned long next; > > > > > > - if (fatal_signal_pending(current)) > > > + if (fatal_signal_pending(task)) > > > return -EINTR; > > > > > > next = pmd_addr_end(addr, end); > > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, > > > unsigned long addr, > > > } > > > > > > static void madvise_pageout_page_range(struct mmu_gather *tlb, > > > - struct vm_area_struct *vma, > > > - unsigned long addr, unsigned long end) > > > + struct task_struct *task, > > > + struct vm_area_struct *vma, > > > + unsigned long addr, unsigned long end) > > > { > > > struct mm_walk warm_walk = { > > > .pmd_entry = madvise_pageout_pte_range, > > > .mm = vma->vm_mm, > > > + .private = task, > > > }; > > > > > > tlb_start_vma(tlb, vma); > > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct > > > mmu_gather *tlb, > > > } > > > > > > > > > -static long madvise_pageout(struct vm_area_struct *vma, > > > - struct vm_area_struct **prev, > > > - unsigned long start_addr, unsigned long end_addr) > > > +static long madvise_pageout(struct task_struct *task, > > > + struct vm_area_struct *vma, struct vm_area_struct **prev, > > > + unsigned long start_addr, unsigned long end_addr) > > > { > > > struct mm_struct *mm = vma->vm_mm; > > > struct mmu_gather tlb; > > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct > > > *vma, > > > > > > lru_add_drain(); > > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr); > > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr); > > > tlb_finish_mmu(&tlb, start_addr, end_addr); > > > > > > return 0; > > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct > > > vm_area_struct *vma, > > > return 0; > > > } > > > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > > +static long madvise_dontneed_free(struct mm_struct *mm, > > > + struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > unsigned long start, unsigned long end, > > > int behavior) > > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct > > > vm_area_struct *vma, > > > if (!userfaultfd_remove(vma, start, end)) { > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > > > - down_read(¤t->mm->mmap_sem); > > > - vma = find_vma(current->mm, start); > > > + down_read(&mm->mmap_sem); > > > + vma = find_vma(mm, start); > > > if (!vma) > > > return -ENOMEM; > > > if (start < vma->vm_start) { > > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct > > > vm_area_struct *vma, > > > * Application wants to free up the pages and associated backing store. >
Re: [RFCv2 4/6] mm: factor out madvise's core functionality
> + * MADV_NORMAL - the default behavior is to read clusters. This > + * results in some read-ahead and read-behind. > + * MADV_RANDOM - the system should read the minimum amount of data > + * on any access, since it is unlikely that the appli- > + * cation will need more than what it asks for. > + * MADV_SEQUENTIAL - pages in the given range will probably be accessed > + * once, so they can be aggressively read ahead, and > + * can be freed soon after they are accessed. > + * MADV_WILLNEED - the application is notifying the system to read > + * some pages ahead. > + * MADV_DONTNEED - the application is finished with the given range, > + * so the kernel can free resources associated with it. > + * MADV_FREE - the application marks pages in the given range as lazy free, > + * where actual purges are postponed until memory pressure happens. > + * MADV_REMOVE - the application wants to free up the given range of > + * pages and associated backing store. > + * MADV_DONTFORK - omit this area from child's address space when forking: > + * typically, to avoid COWing pages pinned by get_user_pages(). > + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when > forking. > + * MADV_WIPEONFORK - present the child process with zero-filled memory in > this > + * range after a fork. > + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK > + * MADV_HWPOISON - trigger memory error handler as if the given memory range > + * were corrupted by unrecoverable hardware memory failure. > + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. > + * MADV_MERGEABLE - the application recommends that KSM try to merge pages > in > + * this area with pages of identical content from other such areas. > + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with > others. > + * MADV_HUGEPAGE - the application wants to back the given range by > transparent > + * huge pages in the future. Existing pages might be coalesced and > + * new pages might be allocated as THP. > + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > + * transparent huge pages so the existing pages will not be > + * coalesced into THP and new pages will not be allocated as THP. > + * MADV_DONTDUMP - the application wants to prevent pages in the given range > + * from being included in its core dump. > + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > + * > + * return values: > + * zero- success > + * -EINVAL - start + len < 0, start is not page-aligned, > + * "behavior" is not a valid value, or application > + * is attempting to release locked or shared pages, > + * or the specified address range includes file, Huge TLB, > + * MAP_SHARED or VMPFNMAP range. > + * -ENOMEM - addresses in the specified range are not currently > + * mapped, or are outside the AS of the process. > + * -EIO- an I/O error occurred while paging in data. > + * -EBADF - map exists, but area maps something that isn't a file. > + * -EAGAIN - a kernel resource was temporarily unavailable. > + */ > +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > +{ > + return madvise_core(current, current->mm, start, len_in, behavior); > +} > -- > 2.22.0.rc1.257.g3120a18244-goog > -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: 5.1 and 5.1.1: BUG: unable to handle kernel paging request at ffffea0002030000
On Fri, May 24, 2019 at 01:31:46PM +0100, Mel Gorman wrote: > On Fri, May 24, 2019 at 01:43:29PM +0200, Oleksandr Natalenko wrote: > > > Please use the offset 0xb9 > > > > > > addr2line -i -e /usr/src/linux/vmlinux `printf "0x%lX" $((0x$ADDR+0xb9)) > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > Cc'ing myself since i observe such a behaviour sometimes right after KVM > > VM is launched. No luck with reproducing it on demand so far, though. > > > > It's worth testing the patch at > https://lore.kernel.org/lkml/1558689619-16891-1-git-send-email-suzuki.poul...@arm.com/T/#u It is reported [1] that this fixes the issue. Thanks. [1] https://github.com/NixOS/nixpkgs/issues/61909 > > -- > Mel Gorman > SUSE Labs -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: 5.1 and 5.1.1: BUG: unable to handle kernel paging request at ffffea0002030000
On Fri, May 24, 2019 at 01:31:46PM +0100, Mel Gorman wrote: > On Fri, May 24, 2019 at 01:43:29PM +0200, Oleksandr Natalenko wrote: > > > Please use the offset 0xb9 > > > > > > addr2line -i -e /usr/src/linux/vmlinux `printf "0x%lX" $((0x$ADDR+0xb9)) > > > > > > -- > > > Mel Gorman > > > SUSE Labs > > > > Cc'ing myself since i observe such a behaviour sometimes right after KVM > > VM is launched. No luck with reproducing it on demand so far, though. > > > > It's worth testing the patch at > https://lore.kernel.org/lkml/1558689619-16891-1-git-send-email-suzuki.poul...@arm.com/T/#u On my TODO list already, thanks. > > -- > Mel Gorman > SUSE Labs -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: 5.1 and 5.1.1: BUG: unable to handle kernel paging request at ffffea0002030000
On Tue, May 21, 2019 at 01:43:10PM +0100, Mel Gorman wrote: > On Tue, May 21, 2019 at 05:01:06AM -0400, Justin Piszcz wrote: > > On Mon, May 20, 2019 at 7:56 AM Mel Gorman > > wrote: > > > > > > On Sun, May 12, 2019 at 04:27:45AM -0400, Justin Piszcz wrote: > > > > Hello, > > > > > > > > I've turned off zram/zswap and I am still seeing the following during > > > > periods of heavy I/O, I am returning to 5.0.xx in the meantime. > > > > > > > > Kernel: 5.1.1 > > > > Arch: x86_64 > > > > Dist: Debian x86_64 > > > > > > > > [29967.019411] BUG: unable to handle kernel paging request at > > > > ea000203 > > > > [29967.019414] #PF error: [normal kernel read fault] > > > > [29967.019415] PGD 103ffee067 P4D 103ffee067 PUD 103ffed067 PMD 0 > > > > [29967.019417] Oops: [#1] SMP PTI > > > > [29967.019419] CPU: 10 PID: 77 Comm: khugepaged Tainted: G > > > >T 5.1.1 #4 > > > > [29967.019420] Hardware name: Supermicro X9SRL-F/X9SRL-F, BIOS 3.2 > > > > 01/16/2015 > > > > [29967.019424] RIP: 0010:isolate_freepages_block+0xb9/0x310 > > > > [29967.019425] Code: 24 28 48 c1 e0 06 40 f6 c5 1f 48 89 44 24 20 49 > > > > 8d 45 79 48 89 44 24 18 44 89 f0 4d 89 ee 45 89 fd 41 89 c7 0f 84 ef > > > > 00 00 00 <48> 8b 03 41 83 c4 01 a9 00 00 01 00 75 0c 48 8b 43 08 a8 01 > > > > 0f 84 > > > > > > If you have debugging symbols installed, can you translate the faulting > > > address with the following? > > > > > > ADDR=`nm /path/to/vmlinux-or-debuginfo-file | grep "t > > > isolate_freepages_block\$" | awk '{print $1}'` > > > addr2line -i -e vmlinux `printf "0x%lX" $((0x$ADDR+0xb9))` > > > > Another event this morning, this occurred when copying a single ~25GB > > backup file from one block device device (3ware HW RAID) to a SW > > RAID-1 (mdadm): > > > > With this event, it was a fault and khugepaged is not stuck at 100% > > but this may be related as the stack trace is similar where > > compaction_alloc is utilizing most of the CPU: > > https://lkml.org/lkml/2019/5/9/225 > > > > # ADDR=`nm /usr/src/linux/vmlinux | grep "t isolate_freepages_block\$" > > | awk '{print $1}'` > > # echo $ADDR > > 812274f0 > > # addr2line -i -e /usr/src/linux/vmlinux `printf "0x%lX" $((0x$ADDR+0x83d))` > > compaction.c:? > > # addr2line -i -e /usr/src/linux/vmlinux `printf "0x%lX" $((0x$ADDR+0x8d0))` > > compaction.c:? > > > > Please use the offset 0xb9 > > addr2line -i -e /usr/src/linux/vmlinux `printf "0x%lX" $((0x$ADDR+0xb9)) > > -- > Mel Gorman > SUSE Labs Cc'ing myself since i observe such a behaviour sometimes right after KVM VM is launched. No luck with reproducing it on demand so far, though. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: 5.1 kernel: khugepaged stuck at 100%
On Thu, May 16, 2019 at 10:27:50AM -0400, Justin Piszcz wrote: > On Thu, May 16, 2019 at 10:17 AM Justin Piszcz > wrote: > > > > On Thu, May 16, 2019 at 10:14 AM Justin Piszcz > > wrote: > > > > > > On Tue, May 14, 2019 at 9:16 AM Kirill A. Shutemov > > > wrote: > > > > > > > Could you check what khugepaged doing? > > > > > > > > cat /proc/$(pidof khugepaged)/stack > > > > > > It is doing it again, 10:12am - 2019-05-16 > > > > > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > > > COMMAND > > >77 root 39 19 0 0 0 R 100.0 0.0 92:06.94 > > > khugepaged > > > > > > Kernel: 5.1.2 > > > > > > $ sudo cat /proc/$(pidof khugepaged)/stack > > > [<0>] 0x > > > > > As a workaround/in the meantime--I will add the following to lilo/grub: > transparent_hugepage=never > > Justin Cc'ing myself since i observe such a behaviour sometimes right after KVM VM is launched. No luck with reproducing it on demand so far, though. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFC 0/7] introduce memory hinting API for external process
vated pages and the > >other is > >> > > > MADV_COLD which will reclaim private pages instantly. These new > >options > >> > > > complement MADV_DONTNEED and MADV_FREE by adding > >non-destructive ways to > >> > > > gain some free memory space. MADV_COLD is similar to > >MADV_DONTNEED in a way > >> > > > that it hints the kernel that memory region is not currently > >needed and > >> > > > should be reclaimed immediately; MADV_COOL is similar to > >MADV_FREE in a way > >> > > > that it hints the kernel that memory region is not currently > >needed and > >> > > > should be reclaimed when memory pressure rises. > >> > > > > >> > > > This approach is similar in spirit to madvise(MADV_WONTNEED), > >but the > >> > > > information required to make the reclaim decision is not known > >to the app. > >> > > > Instead, it is known to a centralized userspace daemon, and > >that daemon > >> > > > must be able to initiate reclaim on its own without any app > >involvement. > >> > > > To solve the concern, this patch introduces new syscall - > >> > > > > >> > > > struct pr_madvise_param { > >> > > > int size; > >> > > > const struct iovec *vec; > >> > > > } > >> > > > > >> > > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > >> > > > struct pr_madvise_param *restuls, > >> > > > struct pr_madvise_param *ranges, > >> > > > unsigned long flags); > >> > > > > >> > > > The syscall get pidfd to give hints to external process and > >provides > >> > > > pair of result/ranges vector arguments so that it could give > >several > >> > > > hints to each address range all at once. > >> > > > > >> > > > I guess others have different ideas about the naming of syscall > >and options > >> > > > so feel free to suggest better naming. > >> > > > >> > > Yes, all new syscalls making use of pidfds should be named > >> > > pidfd_. So please make this pidfd_madvise. > >> > > >> > I don't have any particular preference but just wondering why pidfd > >is > >> > so special to have it as prefix of system call name. > >> > >> It's a whole new API to address processes. We already have > >> clone(CLONE_PIDFD) and pidfd_send_signal() as you have seen since you > >> exported pidfd_to_pid(). And we're going to have pidfd_open(). Your > >> syscall works only with pidfds so it's tied to this api as well so it > >> should follow the naming scheme. This also makes life easier for > >> userspace and is consistent. > > > >Okay. I will change the API name at next revision. > >Thanks. > > Thanks! > Fwiw, there's been a similar patch by Oleksandr for pidfd_madvise I stumbled > upon a few days back: > https://gitlab.com/post-factum/pf-kernel/commit/0595f874a53fa898739ac315ddf208554d9dc897 > > He wanted to be cc'ed but I forgot. Thanks :). FWIW, since this submission is essentially a continuation of our discussion involving my earlier KSM submissions here, I won't move my gitlab branch forward and will be happy to assist with what we have here, be it pidfd_madvise() or a set or /proc files (or smth else). > > Christian > -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFC 4/7] mm: factor out madvise's core functionality
Hi. On Tue, May 21, 2019 at 08:50:00AM +0200, Michal Hocko wrote: > On Tue 21-05-19 08:36:28, Oleksandr Natalenko wrote: > [...] > > Regarding restricting the hints, I'm definitely interested in having > > remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote > > madvise() introduces another issue with traversing remote VMAs reliably. > > IIUC, one can do this via userspace by parsing [s]maps file only, which > > is not very consistent, and once some range is parsed, and then it is > > immediately gone, a wrong hint will be sent. > > > > Isn't this a problem we should worry about? > > See http://lkml.kernel.org/r/20190520091829.gy6...@dhcp22.suse.cz Oh, thanks for the pointer. Indeed, for my specific task with remote KSM I'd go with map_files instead. This doesn't solve the task completely in case of traversal through all the VMAs in one pass, but makes it easier comparing to a remote syscall. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH v3] vt: Fix a missing-check bug in drivers/tty/vt/vt.c
Cc'ing Grzegorz. On Tue, May 21, 2019 at 11:21:18AM +0800, Gen Zhang wrote: > On Mon, May 20, 2019 at 10:55:40PM -0400, Nicolas Pitre wrote: > > As soon as you release the lock, another thread could come along and > > start using the memory pointed by vc_cons[currcons].d you're about to > > free here. This is unlikely for an initcall, but still. > > > > You should consider this ordering instead: > > > > err_vc_screenbuf: > > kfree(vc); > > vc_cons[currcons].d = NULL; > > err_vc: > > console_unlock(); > > return -ENOMEM; > In function con_init(), the pointer variable vc_cons[currcons].d, vc and > vc->vc_screenbuf is allocated a memory space via kzalloc(). And they are > used in the following codes. > However, when there is a memory allocation error, kzalloc() can fail. > Thus null pointer (vc_cons[currcons].d, vc and vc->vc_screenbuf) > dereference may happen. And it will cause the kernel to crash. Therefore, > we should check return value and handle the error. > Further,the loop condition MIN_NR_CONSOLES is defined as 1 in > include/uapi/linux/vt.h and it is not changed. So there is no need to > unwind the loop. > > Signed-off-by: Gen Zhang > > --- > diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c > index fdd12f8..ea47eb3 100644 > --- a/drivers/tty/vt/vt.c > +++ b/drivers/tty/vt/vt.c > @@ -3350,10 +3350,14 @@ static int __init con_init(void) > > for (currcons = 0; currcons < MIN_NR_CONSOLES; currcons++) { > vc_cons[currcons].d = vc = kzalloc(sizeof(struct vc_data), > GFP_NOWAIT); > + if (!vc) > + goto err_vc; > INIT_WORK(&vc_cons[currcons].SAK_work, vc_SAK); > tty_port_init(&vc->port); > visual_init(vc, currcons, 1); > vc->vc_screenbuf = kzalloc(vc->vc_screenbuf_size, GFP_NOWAIT); > + if (!vc->vc_screenbuf) > + goto err_vc_screenbuf; > vc_init(vc, vc->vc_rows, vc->vc_cols, > currcons || !vc->vc_sw->con_save_screen); > } > @@ -3375,6 +3379,13 @@ static int __init con_init(void) > register_console(&vt_console_driver); > #endif > return 0; > +err_vc_screenbuf: > + kfree(vc); > + vc_cons[currcons].d = NULL; > +err_vc: > + console_unlock(); > + return -ENOMEM; > + > } > console_initcall(con_init); > --- -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFC 4/7] mm: factor out madvise's core functionality
Hi. On Tue, May 21, 2019 at 10:26:49AM +0900, Minchan Kim wrote: > On Mon, May 20, 2019 at 04:26:33PM +0200, Oleksandr Natalenko wrote: > > Hi. > > > > On Mon, May 20, 2019 at 12:52:51PM +0900, Minchan Kim wrote: > > > This patch factor out madvise's core functionality so that upcoming > > > patch can reuse it without duplication. > > > > > > It shouldn't change any behavior. > > > > > > Signed-off-by: Minchan Kim > > > --- > > > mm/madvise.c | 168 +++ > > > 1 file changed, 89 insertions(+), 79 deletions(-) > > > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > > index 9a6698b56845..119e82e1f065 100644 > > > --- a/mm/madvise.c > > > +++ b/mm/madvise.c > > > @@ -742,7 +742,8 @@ static long madvise_dontneed_single_vma(struct > > > vm_area_struct *vma, > > > return 0; > > > } > > > > > > -static long madvise_dontneed_free(struct vm_area_struct *vma, > > > +static long madvise_dontneed_free(struct task_struct *tsk, > > > + struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > unsigned long start, unsigned long end, > > > int behavior) > > > @@ -754,8 +755,8 @@ static long madvise_dontneed_free(struct > > > vm_area_struct *vma, > > > if (!userfaultfd_remove(vma, start, end)) { > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */ > > > > > > - down_read(¤t->mm->mmap_sem); > > > - vma = find_vma(current->mm, start); > > > + down_read(&tsk->mm->mmap_sem); > > > + vma = find_vma(tsk->mm, start); > > > if (!vma) > > > return -ENOMEM; > > > if (start < vma->vm_start) { > > > @@ -802,7 +803,8 @@ static long madvise_dontneed_free(struct > > > vm_area_struct *vma, > > > * Application wants to free up the pages and associated backing store. > > > * This is effectively punching a hole into the middle of a file. > > > */ > > > -static long madvise_remove(struct vm_area_struct *vma, > > > +static long madvise_remove(struct task_struct *tsk, > > > + struct vm_area_struct *vma, > > > struct vm_area_struct **prev, > > > unsigned long start, unsigned long end) > > > { > > > @@ -836,13 +838,13 @@ static long madvise_remove(struct vm_area_struct > > > *vma, > > > get_file(f); > > > if (userfaultfd_remove(vma, start, end)) { > > > /* mmap_sem was not released by userfaultfd_remove() */ > > > - up_read(¤t->mm->mmap_sem); > > > + up_read(&tsk->mm->mmap_sem); > > > } > > > error = vfs_fallocate(f, > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > > > offset, end - start); > > > fput(f); > > > - down_read(¤t->mm->mmap_sem); > > > + down_read(&tsk->mm->mmap_sem); > > > return error; > > > } > > > > > > @@ -916,12 +918,13 @@ static int madvise_inject_error(int behavior, > > > #endif > > > > What about madvise_inject_error() and get_user_pages_fast() in it > > please? > > Good point. Maybe, there more places where assume context is "current" so > I'm thinking to limit hints we could allow from external process. > It would be better for maintainance point of view in that we could know > the workload/usecases when someone ask new advises from external process > without making every hints works both contexts. Well, for madvise_inject_error() we still have a remote variant of get_user_pages(), and that should work, no? Regarding restricting the hints, I'm definitely interested in having remote MADV_MERGEABLE/MADV_UNMERGEABLE. But, OTOH, doing it via remote madvise() introduces another issue with traversing remote VMAs reliably. IIUC, one can do this via userspace by parsing [s]maps file only, which is not very consistent, and once some range is parsed, and then it is immediately gone, a wrong hint will be sent. Isn't this a problem we should worry about? > > Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH] vt/fbcon: deinitialize resources in visual_init() after failed memory allocation
Hi. On Fri, Apr 26, 2019 at 04:43:57PM +0200, Grzegorz Halat wrote: > After memory allocation failure vc_allocate() doesn't clean up data > which has been initialized in visual_init(). In case of fbcon this > leads to divide-by-0 in fbcon_init() on next open of the same tty. > > memory allocation in vc_allocate() may fail here: > 1097: vc->vc_screenbuf = kzalloc(vc->vc_screenbuf_size, GFP_KERNEL); > > on next open() fbcon_init() skips vc_font.data initialization: > 1088: if (!p->fontdata) { > > division by zero in fbcon_init() happens here: > 1149: new_cols /= vc->vc_font.width; > > Additional check is needed in fbcon_deinit() to prevent > usage of uninitialized vc_screenbuf: > > 1251:if (vc->vc_hi_font_mask && vc->vc_screenbuf) > 1252:set_vc_hi_font(vc, false); > > Crash: > > #6 [c90001eafa60] divide_error at 81a00be4 > [exception RIP: fbcon_init+463] > RIP: 814b860f RSP: c90001eafb18 RFLAGS: 00010246 > ... > #7 [c90001eafb60] visual_init at 8154c36e > #8 [c90001eafb80] vc_allocate at 8154f53c > #9 [c90001eafbc8] con_install at 8154f624 > ... > > Signed-off-by: Grzegorz Halat > --- > drivers/tty/vt/vt.c | 11 +-- > drivers/video/fbdev/core/fbcon.c | 2 +- > 2 files changed, 10 insertions(+), 3 deletions(-) > > diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c > index 650c66886c80..ec85d195678f 100644 > --- a/drivers/tty/vt/vt.c > +++ b/drivers/tty/vt/vt.c > @@ -1056,6 +1056,13 @@ static void visual_init(struct vc_data *vc, int num, > int init) > vc->vc_screenbuf_size = vc->vc_rows * vc->vc_size_row; > } > > + > +static void visual_deinit(struct vc_data *vc) > +{ > + vc->vc_sw->con_deinit(vc); > + module_put(vc->vc_sw->owner); > +} > + > int vc_allocate(unsigned int currcons) /* return 0 on success */ > { > struct vt_notifier_param param; > @@ -1103,6 +1110,7 @@ int vc_allocate(unsigned int currcons) /* return 0 on > success */ > > return 0; > err_free: > + visual_deinit(vc); > kfree(vc); > vc_cons[currcons].d = NULL; > return -ENOMEM; > @@ -1331,9 +1339,8 @@ struct vc_data *vc_deallocate(unsigned int currcons) > param.vc = vc = vc_cons[currcons].d; > atomic_notifier_call_chain(&vt_notifier_list, VT_DEALLOCATE, > ¶m); > vcs_remove_sysfs(currcons); > - vc->vc_sw->con_deinit(vc); > + visual_deinit(vc); > put_pid(vc->vt_pid); > - module_put(vc->vc_sw->owner); > vc_uniscr_set(vc, NULL); > kfree(vc->vc_screenbuf); > vc_cons[currcons].d = NULL; > diff --git a/drivers/video/fbdev/core/fbcon.c > b/drivers/video/fbdev/core/fbcon.c > index cd059a801662..c59b23f6e9ba 100644 > --- a/drivers/video/fbdev/core/fbcon.c > +++ b/drivers/video/fbdev/core/fbcon.c > @@ -1248,7 +1248,7 @@ static void fbcon_deinit(struct vc_data *vc) > if (free_font) > vc->vc_font.data = NULL; > > - if (vc->vc_hi_font_mask) > + if (vc->vc_hi_font_mask && vc->vc_screenbuf) > set_vc_hi_font(vc, false); > > if (!con_is_bound(&fb_con)) > -- > 2.20.1 > LGTM. Reviewed-by: Oleksandr Natalenko -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFC 0/7] introduce memory hinting API for external process
ization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h| 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c| 2 +- > kernel/sys_ni.c| 1 + > mm/madvise.c | 600 + > mm/swap.c | 43 ++ > mm/vmscan.c| 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog > Please Cc me for the next iteration since I was working on the very same thing recently [1]. Thank you. [1] https://gitlab.com/post-factum/pf-kernel/commits/remote-madvise-v3 -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [RFC 4/7] mm: factor out madvise's core functionality
end|vma->vm_end). */ > - error = madvise_vma(vma, &prev, start, tmp, behavior); > + error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); > if (error) > goto out; > start = tmp; > @@ -1119,14 +1063,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, > size_t, len_in, int, behavior) > if (prev) > vma = prev->vm_next; > else/* madvise_remove dropped mmap_sem */ > - vma = find_vma(current->mm, start); > + vma = find_vma(tsk->mm, start); > } > out: > blk_finish_plug(&plug); > if (write) > - up_write(¤t->mm->mmap_sem); > + up_write(&tsk->mm->mmap_sem); > else > - up_read(¤t->mm->mmap_sem); > + up_read(&tsk->mm->mmap_sem); > > return error; > } > + > +/* > + * The madvise(2) system call. > + * > + * Applications can use madvise() to advise the kernel how it should > + * handle paging I/O in this VM area. The idea is to help the kernel > + * use appropriate read-ahead and caching techniques. The information > + * provided is advisory only, and can be safely disregarded by the > + * kernel without affecting the correct operation of the application. > + * > + * behavior values: > + * MADV_NORMAL - the default behavior is to read clusters. This > + * results in some read-ahead and read-behind. > + * MADV_RANDOM - the system should read the minimum amount of data > + * on any access, since it is unlikely that the appli- > + * cation will need more than what it asks for. > + * MADV_SEQUENTIAL - pages in the given range will probably be accessed > + * once, so they can be aggressively read ahead, and > + * can be freed soon after they are accessed. > + * MADV_WILLNEED - the application is notifying the system to read > + * some pages ahead. > + * MADV_DONTNEED - the application is finished with the given range, > + * so the kernel can free resources associated with it. > + * MADV_FREE - the application marks pages in the given range as lazy free, > + * where actual purges are postponed until memory pressure happens. > + * MADV_REMOVE - the application wants to free up the given range of > + * pages and associated backing store. > + * MADV_DONTFORK - omit this area from child's address space when forking: > + * typically, to avoid COWing pages pinned by get_user_pages(). > + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when > forking. > + * MADV_WIPEONFORK - present the child process with zero-filled memory in > this > + * range after a fork. > + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK > + * MADV_HWPOISON - trigger memory error handler as if the given memory range > + * were corrupted by unrecoverable hardware memory failure. > + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory. > + * MADV_MERGEABLE - the application recommends that KSM try to merge pages > in > + * this area with pages of identical content from other such areas. > + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with > others. > + * MADV_HUGEPAGE - the application wants to back the given range by > transparent > + * huge pages in the future. Existing pages might be coalesced and > + * new pages might be allocated as THP. > + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by > + * transparent huge pages so the existing pages will not be > + * coalesced into THP and new pages will not be allocated as THP. > + * MADV_DONTDUMP - the application wants to prevent pages in the given range > + * from being included in its core dump. > + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > + * > + * return values: > + * zero- success > + * -EINVAL - start + len < 0, start is not page-aligned, > + * "behavior" is not a valid value, or application > + * is attempting to release locked or shared pages, > + * or the specified address range includes file, Huge TLB, > + * MAP_SHARED or VMPFNMAP range. > + * -ENOMEM - addresses in the specified range are not currently > + * mapped, or are outside the AS of the process. > + * -EIO- an I/O error occurred while paging in data. > + * -EBADF - map exists, but area maps something that isn't a file. > + * -EAGAIN - a kernel resource was temporarily unavailable. > + */ > +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > +{ > + return madvise_core(current, start, len_in, behavior); > +} > -- > 2.21.0.1020.gf2820cf01a-goog > -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
On Thu, May 16, 2019 at 04:20:13PM +0200, Oleksandr Natalenko wrote: > > [...] > > > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, > > > struct pid_namespace *ns, > > > static ssize_t madvise_write(struct file *file, const char __user *buf, > > > size_t count, loff_t *ppos) > > > { > > > + /* For now, only KSM hints are implemented */ > > > +#ifdef CONFIG_KSM > > > + char buffer[PROC_NUMBUF]; > > > + int behaviour; > > > struct task_struct *task; > > > + struct mm_struct *mm; > > > + int err = 0; > > > + struct vm_area_struct *vma; > > > + > > > + memset(buffer, 0, sizeof(buffer)); > > > + if (count > sizeof(buffer) - 1) > > > + count = sizeof(buffer) - 1; > > > + if (copy_from_user(buffer, buf, count)) > > > + return -EFAULT; > > > + > > > + if (!memcmp("merge", buffer, min(sizeof("merge")-1, count))) > > > > This means that you also match on something like "mergeblah". Just use > > strcmp(). > > I agree. Just to make it more interesting I must say that > >/sys/kernel/mm/transparent_hugepage/enabled > > uses memcmp in the very same way, and thus echoing "always" or > "madvis" works perfectly there, and it was like that from the very > beginning, it seems. Should we fix it, or it became (zomg) a public API? Actually, maybe, the reason for using memcmp is to handle "echo" properly: by default it puts a newline character at the end, so if we use just strcmp, echo should be called with -n, otherwise strcmp won't match the string. Huh? > [...] -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
Hi. On Thu, May 16, 2019 at 12:00:24PM +0200, Jann Horn wrote: > On Thu, May 16, 2019 at 11:43 AM Oleksandr Natalenko > wrote: > > Use previously introduced remote madvise knob to mark task's > > anonymous memory as mergeable. > > > > To force merging task's VMAs, "merge" hint is used: > > > ># echo merge > /proc//madvise > > > > Force unmerging is done similarly: > > > ># echo unmerge > /proc//madvise > > > > To achieve this, previously introduced ksm_madvise_*() helpers > > are used. > > Why does this not require PTRACE_MODE_ATTACH_FSCREDS to the target > process? Enabling KSM on another process is hazardous because it > significantly increases the attack surface for side channels. > > (Note that if you change this to require PTRACE_MODE_ATTACH_FSCREDS, > you'll want to use mm_access() in the ->open handler and drop the mm > in ->release. mm_access() from a ->write handler is not permitted.) Sounds reasonable. So, something similar to what mem_open() & friends do now: static int madvise_open(...) ... struct task_struct *task = get_proc_task(inode); ... if (task) { mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); put_task_struct(task); if (!IS_ERR_OR_NULL(mm)) { mmgrab(mm); mmput(mm); ... Then: static ssize_t madvise_write(...) ... if (!mmget_not_zero(mm)) goto out; down_write(&mm->mmap_sem); if (!mmget_still_valid(mm)) goto skip_mm; ... skip_mm: up_write(&mm->mmap_sem); mmput(mm); out: return ...; And, finally: static int madvise_release(...) ... mmdrop(mm); ... Right? > [...] > > @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, > > struct pid_namespace *ns, > > static ssize_t madvise_write(struct file *file, const char __user *buf, > > size_t count, loff_t *ppos) > > { > > + /* For now, only KSM hints are implemented */ > > +#ifdef CONFIG_KSM > > + char buffer[PROC_NUMBUF]; > > + int behaviour; > > struct task_struct *task; > > + struct mm_struct *mm; > > + int err = 0; > > + struct vm_area_struct *vma; > > + > > + memset(buffer, 0, sizeof(buffer)); > > + if (count > sizeof(buffer) - 1) > > + count = sizeof(buffer) - 1; > > + if (copy_from_user(buffer, buf, count)) > > + return -EFAULT; > > + > > + if (!memcmp("merge", buffer, min(sizeof("merge")-1, count))) > > This means that you also match on something like "mergeblah". Just use > strcmp(). I agree. Just to make it more interesting I must say that /sys/kernel/mm/transparent_hugepage/enabled uses memcmp in the very same way, and thus echoing "always" or "madvis" works perfectly there, and it was like that from the very beginning, it seems. Should we fix it, or it became (zomg) a public API? > > + behaviour = MADV_MERGEABLE; > > + else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, > > count))) > > + behaviour = MADV_UNMERGEABLE; > > + else > > + return -EINVAL; > > > > task = get_proc_task(file_inode(file)); > > if (!task) > > return -ESRCH; > > > > + mm = get_task_mm(task); > > + if (!mm) { > > + err = -EINVAL; > > + goto out_put_task_struct; > > + } > > + > > + down_write(&mm->mmap_sem); > > Should a check for mmget_still_valid(mm) be inserted here? See commit > 04f5866e41fb70690e28397487d8bd8eea7d712a. Yeah, it seems so :/. Thanks for the pointer. I've put it into the madvise_write snippet above. > > + switch (behaviour) { > > + case MADV_MERGEABLE: > > + case MADV_UNMERGEABLE: > > This switch isn't actually necessary at this point, right? Yup, but it is there to highlight a possibility of adding other, non-KSM options. So, let it be, and I'll just re-arrange CONFIG_KSM ifdef instead. Thank you. > > + vma = mm->mmap; > > + while (vma) { > > + if (behaviour == MADV_MERGEABLE) > > + ksm_madvise_merge(vma->vm_mm, vma, > > &vma->vm_flags); > > + else > > + ksm_madvise_unmerge(vma, vma->vm_start, > > vma->vm_end, &vma->vm_flags); > > + vma = vma->vm_next; > > + } > > + break; > > + } > > + up_write(&mm->mmap_sem); > > + > > + mmput(mm); > > + > > +out_put_task_struct: > > put_task_struct(task); > > > > - return count; > > + return err ? err : count; > > +#else > > + return -EINVAL; > > +#endif /* CONFIG_KSM */ > > } -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
Hi. On Thu, May 16, 2019 at 12:44:12PM +0200, Michal Hocko wrote: > On Thu 16-05-19 11:42:29, Oleksandr Natalenko wrote: > [...] > > * to mark all the eligible VMAs as mergeable, use: > > > ># echo merge > /proc//madvise > > > > * to unmerge all the VMAs, use: > > > ># echo unmerge > /proc//madvise > > Please do not open a new thread until a previous one reaches some > conclusion. I have outlined some ways to go forward in > http://lkml.kernel.org/r/20190515145151.gg16...@dhcp22.suse.cz. > I haven't heard any feedback on that, yet you open a 3rd way in a > different thread. This will not help to move on with the discussion. > > Please follow up on that thread. Sure, I will follow the thread once and if there are responses. Consider this one to be an intermediate summary of current suggestions and also an indication that it is better to have the code early for public eyes. Thank you. > -- > Michal Hocko > SUSE Labs -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
[PATCH RFC 0/5] mm/ksm, proc: introduce remote madvise
It all began with the fact that KSM works only on memory that is marked by madvise(). And the only way to get around that is to either: * use LD_PRELOAD; or * patch the kernel with something like UKSM or PKSM. (i skip ptrace can of worms here intentionally) To overcome this restriction, lets implement a per-process /proc knob, which allows calling madvise remotely. This can be used manually on a task in question or by some small userspace helper daemon that will do auto-KSM job for us. Also, following the discussions from the previous submissions [2] and [3], make the interface more generic, so that it can be used for other madvise hints in the future. At this point, I'd like Android people to speak up, for instance, and clarify in which form they need page granularity or other things I've missed or have never heard about. So, I think of three major consumers of this interface: * hosts, that run containers, especially similar ones and especially in a trusted environment, sharing the same runtime like Node.js; * heavy applications, that can be run in multiple instances, not limited to opensource ones like Firefox, but also those that cannot be modified since they are binary-only and, maybe, statically linked; * Android environment that wants to do tricks with MADV_WILLNEED/DONTNEED or something similar. On to the actual implementation. The per-process knob is named "madvise", and it is write-only. It accepts a madvise hint name to be executed. Currently, only KSM hints are implemented: * to mark all the eligible VMAs as mergeable, use: # echo merge > /proc//madvise * to unmerge all the VMAs, use: # echo unmerge > /proc//madvise I've implemented address space level granularity instead of VMA/page granularity intentionally for simplicity. If the discussion goes in other directions, this can be re-implemented to act on a specific VMA (via map_files?) or page-wise. Speaking of statistics, more numbers can be found in the very first submission, that is related to this one [1]. For my current setup with two Firefox instances I get 100 to 200 MiB saved for the second instance depending on the amount of tabs. 1 FF instance with 15 tabs: $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 410 2 FF instances, second one has 12 tabs (all the tabs are different): $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 592 At the very moment I do not have specific numbers for containerised workload, but those should be comparable in case the containers share similar/same runtime. The history of this patchset: * [2] was based on Timofey's submission [1], but it didn't use a dedicated kthread to walk through the list of tasks/VMAs. Instead, do_anonymous_page() was amended to implement fully automatic mode, but this approach was incorrect due to improper locking and not desired due to excessive complexity and being KSM-specific; * [3] implemented KSM-specific madvise hints via sysfs, leaving traversing /proc to userspace if needed. The approach was not desired due to the fact that sysfs shouldn't implement any per-process API. Also, the interface was not generic enough to extend it for other users. I drop all the "Reviewed-by" tags from previous submissions because of code changes and because the objective of this series is now somewhat different. Please comment! Thanks. [1] https://lore.kernel.org/patchwork/patch/1012142/ [2] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/02417.html [3] http://lkml.iu.edu/hypermail/linux/kernel/1905.1/05076.html Oleksandr Natalenko (5): proc: introduce madvise placeholder mm/ksm: introduce ksm_madvise_merge() helper mm/ksm: introduce ksm_madvise_unmerge() helper mm/ksm, proc: introduce remote merge mm/ksm, proc: add remote madvise documentation Documentation/filesystems/proc.txt | 13 + fs/proc/base.c | 70 +++ include/linux/ksm.h| 4 ++ mm/ksm.c | 92 +++--- 4 files changed, 145 insertions(+), 34 deletions(-) -- 2.21.0
[PATCH RFC 5/5] mm/ksm, proc: add remote madvise documentation
Document respective /proc//madvise knob. Signed-off-by: Oleksandr Natalenko --- Documentation/filesystems/proc.txt | 13 + 1 file changed, 13 insertions(+) diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 66cad5c86171..17106e435bba 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -45,6 +45,7 @@ Table of Contents 3.9 /proc//map_files - Information about memory mapped files 3.10 /proc//timerslack_ns - Task timerslack value 3.11 /proc//patch_state - Livepatch patch operation state + 3.12 /proc//madvise - Remote madvise 4Configuring procfs 4.1 Mount options @@ -1948,6 +1949,18 @@ patched. If the patch is being enabled, then the task has already been patched. If the patch is being disabled, then the task hasn't been unpatched yet. +3.12/proc//madvise - Remote madvise + +This write-only file allows executing madvise operation for another task. + +If CONFIG_KSM is enabled, the following actions are available: + + * marking task's memory as mergeable: +# echo merge > /proc//madvise + + * unmerging all the task's memory: +# echo unmerge > /proc//madvise + -- Configuring procfs -- 2.21.0
[PATCH RFC 1/5] proc: introduce madvise placeholder
Add a write-only /proc//madvise file to handle remote madvise operations. For now, this is just a soother that does nothing. Signed-off-by: Oleksandr Natalenko --- fs/proc/base.c | 20 1 file changed, 20 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index 6a803a0b75df..f69532d6b74f 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2957,6 +2957,25 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, } #endif /* CONFIG_STACKLEAK_METRICS */ +static ssize_t madvise_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + + task = get_proc_task(file_inode(file)); + if (!task) + return -ESRCH; + + put_task_struct(task); + + return count; +} + +static const struct file_operations proc_madvise_operations = { + .write = madvise_write, + .llseek = noop_llseek, +}; + /* * Thread groups */ @@ -3061,6 +3080,7 @@ static const struct pid_entry tgid_base_stuff[] = { #ifdef CONFIG_STACKLEAK_METRICS ONE("stack_depth", S_IRUGO, proc_stack_depth), #endif + REG("madvise", S_IRUGO|S_IWUSR, proc_madvise_operations), }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) -- 2.21.0
[PATCH RFC 4/5] mm/ksm, proc: introduce remote merge
Use previously introduced remote madvise knob to mark task's anonymous memory as mergeable. To force merging task's VMAs, "merge" hint is used: # echo merge > /proc//madvise Force unmerging is done similarly: # echo unmerge > /proc//madvise To achieve this, previously introduced ksm_madvise_*() helpers are used. Signed-off-by: Oleksandr Natalenko --- fs/proc/base.c | 52 +- 1 file changed, 51 insertions(+), 1 deletion(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index f69532d6b74f..6677580080ed 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -94,6 +94,8 @@ #include #include #include +#include +#include #include #include "internal.h" #include "fd.h" @@ -2960,15 +2962,63 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns, static ssize_t madvise_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { + /* For now, only KSM hints are implemented */ +#ifdef CONFIG_KSM + char buffer[PROC_NUMBUF]; + int behaviour; struct task_struct *task; + struct mm_struct *mm; + int err = 0; + struct vm_area_struct *vma; + + memset(buffer, 0, sizeof(buffer)); + if (count > sizeof(buffer) - 1) + count = sizeof(buffer) - 1; + if (copy_from_user(buffer, buf, count)) + return -EFAULT; + + if (!memcmp("merge", buffer, min(sizeof("merge")-1, count))) + behaviour = MADV_MERGEABLE; + else if (!memcmp("unmerge", buffer, min(sizeof("unmerge")-1, count))) + behaviour = MADV_UNMERGEABLE; + else + return -EINVAL; task = get_proc_task(file_inode(file)); if (!task) return -ESRCH; + mm = get_task_mm(task); + if (!mm) { + err = -EINVAL; + goto out_put_task_struct; + } + + down_write(&mm->mmap_sem); + switch (behaviour) { + case MADV_MERGEABLE: + case MADV_UNMERGEABLE: + vma = mm->mmap; + while (vma) { + if (behaviour == MADV_MERGEABLE) + ksm_madvise_merge(vma->vm_mm, vma, &vma->vm_flags); + else + ksm_madvise_unmerge(vma, vma->vm_start, vma->vm_end, &vma->vm_flags); + vma = vma->vm_next; + } + break; + } + up_write(&mm->mmap_sem); + + mmput(mm); + +out_put_task_struct: put_task_struct(task); - return count; + return err ? err : count; +#else + return -EINVAL; +#endif /* CONFIG_KSM */ } static const struct file_operations proc_madvise_operations = { -- 2.21.0
[PATCH RFC 3/5] mm/ksm: introduce ksm_madvise_unmerge() helper
Move MADV_UNMERGEABLE part of ksm_madvise() into a dedicated helper since it will be further used for unmerging VMAs forcibly. This does not bring any functional changes. Signed-off-by: Oleksandr Natalenko --- include/linux/ksm.h | 2 ++ mm/ksm.c| 32 ++-- 2 files changed, 24 insertions(+), 10 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index e824b3141677..a91a7cfc87a1 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -21,6 +21,8 @@ struct mem_cgroup; #ifdef CONFIG_KSM int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long *vm_flags); +int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long *vm_flags); int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); int __ksm_enter(struct mm_struct *mm); diff --git a/mm/ksm.c b/mm/ksm.c index 1fdcf2fbd58d..e0357e25e09f 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2478,6 +2478,25 @@ int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, return 0; } +int ksm_madvise_unmerge(struct vm_area_struct *vma, unsigned long start, + unsigned long end, unsigned long *vm_flags) +{ + int err; + + if (!(*vm_flags & VM_MERGEABLE)) + return 0; /* just ignore the advice */ + + if (vma->anon_vma) { + err = unmerge_ksm_pages(vma, start, end); + if (err) + return err; + } + + *vm_flags &= ~VM_MERGEABLE; + + return 0; +} + int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags) { @@ -2492,16 +2511,9 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start, break; case MADV_UNMERGEABLE: - if (!(*vm_flags & VM_MERGEABLE)) - return 0; /* just ignore the advice */ - - if (vma->anon_vma) { - err = unmerge_ksm_pages(vma, start, end); - if (err) - return err; - } - - *vm_flags &= ~VM_MERGEABLE; + err = ksm_madvise_unmerge(vma, start, end, vm_flags); + if (err) + return err; break; } -- 2.21.0
[PATCH RFC 2/5] mm/ksm: introduce ksm_madvise_merge() helper
Move MADV_MERGEABLE part of ksm_madvise() into a dedicated helper since it will be further used for marking VMAs to be merged forcibly. This does not bring any functional changes. Signed-off-by: Oleksandr Natalenko --- include/linux/ksm.h | 2 ++ mm/ksm.c| 60 +++-- 2 files changed, 38 insertions(+), 24 deletions(-) diff --git a/include/linux/ksm.h b/include/linux/ksm.h index e48b1e453ff5..e824b3141677 100644 --- a/include/linux/ksm.h +++ b/include/linux/ksm.h @@ -19,6 +19,8 @@ struct stable_node; struct mem_cgroup; #ifdef CONFIG_KSM +int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *vm_flags); int ksm_madvise(struct vm_area_struct *vma, unsigned long start, unsigned long end, int advice, unsigned long *vm_flags); int __ksm_enter(struct mm_struct *mm); diff --git a/mm/ksm.c b/mm/ksm.c index fc64874dc6f4..1fdcf2fbd58d 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -2442,41 +2442,53 @@ static int ksm_scan_thread(void *nothing) return 0; } -int ksm_madvise(struct vm_area_struct *vma, unsigned long start, - unsigned long end, int advice, unsigned long *vm_flags) +int ksm_madvise_merge(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *vm_flags) { - struct mm_struct *mm = vma->vm_mm; int err; - switch (advice) { - case MADV_MERGEABLE: - /* -* Be somewhat over-protective for now! -*/ - if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | -VM_PFNMAP| VM_IO | VM_DONTEXPAND | -VM_HUGETLB | VM_MIXEDMAP)) - return 0; /* just ignore the advice */ + /* +* Be somewhat over-protective for now! +*/ + if (*vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE | +VM_PFNMAP| VM_IO | VM_DONTEXPAND | +VM_HUGETLB | VM_MIXEDMAP)) + return 0; /* just ignore the advice */ - if (vma_is_dax(vma)) - return 0; + if (vma_is_dax(vma)) + return 0; #ifdef VM_SAO - if (*vm_flags & VM_SAO) - return 0; + if (*vm_flags & VM_SAO) + return 0; #endif #ifdef VM_SPARC_ADI - if (*vm_flags & VM_SPARC_ADI) - return 0; + if (*vm_flags & VM_SPARC_ADI) + return 0; #endif - if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { - err = __ksm_enter(mm); - if (err) - return err; - } + if (!test_bit(MMF_VM_MERGEABLE, &mm->flags)) { + err = __ksm_enter(mm); + if (err) + return err; + } + + *vm_flags |= VM_MERGEABLE; + + return 0; +} - *vm_flags |= VM_MERGEABLE; +int ksm_madvise(struct vm_area_struct *vma, unsigned long start, + unsigned long end, int advice, unsigned long *vm_flags) +{ + struct mm_struct *mm = vma->vm_mm; + int err; + + switch (advice) { + case MADV_MERGEABLE: + err = ksm_madvise_merge(mm, vma, vm_flags); + if (err) + return err; break; case MADV_UNMERGEABLE: -- 2.21.0
Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
On Wed, May 15, 2019 at 10:33:21AM +0200, Michal Hocko wrote: > > For my current setup with 2 Firefox instances I get 100 to 200 MiB saved > > for the second instance depending on the amount of tabs. > > What does prevent Firefox (an opensource project) to be updated to use > the explicit merging? This was rather an example of a big project. Other big projects may be closed source, of course. And yes, with regard to FF specifically I think nothing prevents it from being modified appropriately. > > Answering your question regarding using existing interfaces, since > > there's only one, madvise(2), this requires modifying all the > > applications one wants to de-duplicate. In case of containers with > > arbitrary content or in case of binary-only apps this is pretty hard if > > not impossible to do properly. > > OK, this makes more sense. Please note that there are other people who > would like to see certain madvise operations to be done on a remote > process - e.g. to allow external memory management (Android would like > to control memory aging so something like MADV_DONTNEED without loosing > content and more probably) and potentially other madvise operations. > Or maybe we need a completely new interface other than madvise. I didn't know about those intentions. Could you please point me to a relevant discussion so that I can check the details? > In general, having a more generic API that would cover more usecases is > definitely much more preferable than one ad-hoc API that handles a very > specific usecase. So please try to think about a more generic Yup, I see now. Since you are aware of ongoing intentions, please do Cc those people then and/or let me know about previous discussions please. That way thinking of how a new API should be implemented (be it a sysfs file or something else) should be easier and more visible. Thanks. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer
Re: [PATCH RFC v2 0/4] mm/ksm: add option to automerge VMAs
Hi. On Wed, May 15, 2019 at 08:53:11AM +0200, Michal Hocko wrote: > On Wed 15-05-19 08:25:23, Oleksandr Natalenko wrote: > [...] > > > > Please make sure to describe a usecase that warrants adding a new > > > > interface we have to maintain for ever. > > > > I think of two major consumers of this interface: > > > > 1) hosts, that run containers, especially similar ones and especially in > > a trusted environment; > > > > 2) heavy applications, that can be run in multiple instances, not > > limited to opensource ones like Firefox, but also those that cannot be > > modified. > > This is way too generic. Please provide something more specific. Ideally > with numbers. Why those usecases cannot use an existing interfaces. > Remember you are trying to add a new user interface which we will have > to maintain for ever. For my current setup with 2 Firefox instances I get 100 to 200 MiB saved for the second instance depending on the amount of tabs. 1 FF instance with 15 tabs: $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 410 2 FF instances, second one has 12 tabs (all the tabs are different): $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc 592 At the very moment I do not have specific numbers for containerised workload, but those should be similar in case the containers share similar/same runtime (like multiple Node.js containers etc). Answering your question regarding using existing interfaces, since there's only one, madvise(2), this requires modifying all the applications one wants to de-duplicate. In case of containers with arbitrary content or in case of binary-only apps this is pretty hard if not impossible to do properly. > I will try to comment on the interface itself later. But I have to say > that I am not impressed. Abusing sysfs for per process features is quite > gross to be honest. Sure, please do. Thanks for your time and inputs. -- Best regards, Oleksandr Natalenko (post-factum) Senior Software Maintenance Engineer