Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages
On Sat, Mar 09, 2024 at 12:14:23PM +0800, Hou Tao wrote: > Hi, > > On 2/29/2024 11:01 PM, Brian Foster wrote: > > On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote: > >> From: Hou Tao > >> > >> When reading a file kept in virtiofs from kernel (e.g., insmod a kernel > >> module), if the cache of virtiofs is disabled, the read buffer will be > >> passed to virtiofs through out_args[0].value instead of pages. Because > >> virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new() > >> will create a bounce buffer for the read buffer by using kmalloc() and > >> copy the read buffer into bounce buffer. If the read buffer is large > >> (e.g., 1MB), the allocation will incur significant stress on the memory > >> subsystem. > >> > >> So instead of allocating bounce buffer by using kmalloc(), allocate a > >> bounce buffer which is backed by scattered pages. The original idea is > >> to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To > >> simplify the copy operations in the bounce buffer, use a bio_vec flex > >> array to represent the argbuf. Also add an is_flat field in struct > >> virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce > >> buffer. > >> > >> Signed-off-by: Hou Tao > >> --- > >> fs/fuse/virtio_fs.c | 163 > >> 1 file changed, 149 insertions(+), 14 deletions(-) > >> > >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > >> index f10fff7f23a0f..ffea684bd100d 100644 > >> --- a/fs/fuse/virtio_fs.c > >> +++ b/fs/fuse/virtio_fs.c > > ... > >> @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct > >> work_struct *work) > >>} > >> } > >> > > ... > >> static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf > >> *argbuf, > >> unsigned int offset, > >> const void *src, unsigned int len) > >> { > >> - memcpy(argbuf->buf + offset, src, len); > >> + struct iov_iter iter; > >> + unsigned int copied; > >> + > >> + if (argbuf->is_flat) { > >> + memcpy(argbuf->f.buf + offset, src, len); > >> + return; > >> + } > >> + > >> + iov_iter_bvec(&iter, ITER_DEST, argbuf->s.bvec, > >> +argbuf->s.nr, argbuf->s.size); > >> + iov_iter_advance(&iter, offset); > > Hi Hou, > > > > Just a random comment, but it seems a little inefficient to reinit and > > readvance the iter like this on every copy/call. It looks like offset is > > already incremented in the callers of the argbuf copy helpers. Perhaps > > iov_iter could be lifted into the callers and passed down, or even just > > include it in the argbuf structure and init it at alloc time? > > Sorry for the late reply. Being busy with off-site workshop these days. > No worries. > I have tried a similar idea before in which iov_iter was saved directly > in argbuf struct, but it didn't work out well. The reason is that for > copy both in_args and out_args, an iov_iter is needed, but the direction > is different. Currently the bi-directional io_vec is not supported, so > the code have to initialize the iov_iter twice: one for copy from > in_args and another one for copy to out_args. > Ok, seems reasonable enough. > For dio read initiated from kernel, both of its in_numargs and > out_numargs is 1, so there will be only one iov_iter_advance() in > virtio_fs_argbuf_copy_to_out_arg() and the offset is 64, so I think the > overhead will be fine. For dio write initiated from kernel, its > in_numargs is 2 and out_numargs is 1, so there will be two invocations > of iov_iter_advance(). The first one with offset=64, and the another one > with offset=round_up_page_size(64 + write_size), so the later one may > introduce extra overhead. But compared with the overhead of data copy, I > still think the overhead of calling iov_iter_advance() is fine. > I'm not claiming the overhead is some practical problem here, but rather that we shouldn't need to be concerned with it in the first place with some cleaner code. It's been a bit since I first looked at this. I was originally wondering about defining the iov_iter in the caller and pass as a param, but on another look, do the lowest level helpers really need to exist? If you don't anticipate further users, IMO something like the diff below
Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages
On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote: > From: Hou Tao > > When reading a file kept in virtiofs from kernel (e.g., insmod a kernel > module), if the cache of virtiofs is disabled, the read buffer will be > passed to virtiofs through out_args[0].value instead of pages. Because > virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new() > will create a bounce buffer for the read buffer by using kmalloc() and > copy the read buffer into bounce buffer. If the read buffer is large > (e.g., 1MB), the allocation will incur significant stress on the memory > subsystem. > > So instead of allocating bounce buffer by using kmalloc(), allocate a > bounce buffer which is backed by scattered pages. The original idea is > to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To > simplify the copy operations in the bounce buffer, use a bio_vec flex > array to represent the argbuf. Also add an is_flat field in struct > virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce > buffer. > > Signed-off-by: Hou Tao > --- > fs/fuse/virtio_fs.c | 163 > 1 file changed, 149 insertions(+), 14 deletions(-) > > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c > index f10fff7f23a0f..ffea684bd100d 100644 > --- a/fs/fuse/virtio_fs.c > +++ b/fs/fuse/virtio_fs.c ... > @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct > work_struct *work) > } > } > ... > static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf > *argbuf, > unsigned int offset, > const void *src, unsigned int len) > { > - memcpy(argbuf->buf + offset, src, len); > + struct iov_iter iter; > + unsigned int copied; > + > + if (argbuf->is_flat) { > + memcpy(argbuf->f.buf + offset, src, len); > + return; > + } > + > + iov_iter_bvec(&iter, ITER_DEST, argbuf->s.bvec, > + argbuf->s.nr, argbuf->s.size); > + iov_iter_advance(&iter, offset); Hi Hou, Just a random comment, but it seems a little inefficient to reinit and readvance the iter like this on every copy/call. It looks like offset is already incremented in the callers of the argbuf copy helpers. Perhaps iov_iter could be lifted into the callers and passed down, or even just include it in the argbuf structure and init it at alloc time? Brian > + > + copied = _copy_to_iter(src, len, &iter); > + WARN_ON_ONCE(copied != len); > } > > static unsigned int > @@ -451,15 +569,32 @@ virtio_fs_argbuf_out_args_offset(struct > virtio_fs_argbuf *argbuf, >const struct fuse_args *args) > { > unsigned int num_in = args->in_numargs - args->in_pages; > + unsigned int offset = fuse_len_args(num_in, > + (struct fuse_arg *)args->in_args); > > - return fuse_len_args(num_in, (struct fuse_arg *)args->in_args); > + if (argbuf->is_flat) > + return offset; > + return round_up(offset, PAGE_SIZE); > } > > static void virtio_fs_argbuf_copy_to_out_arg(struct virtio_fs_argbuf *argbuf, >unsigned int offset, void *dst, >unsigned int len) > { > - memcpy(dst, argbuf->buf + offset, len); > + struct iov_iter iter; > + unsigned int copied; > + > + if (argbuf->is_flat) { > + memcpy(dst, argbuf->f.buf + offset, len); > + return; > + } > + > + iov_iter_bvec(&iter, ITER_SOURCE, argbuf->s.bvec, > + argbuf->s.nr, argbuf->s.size); > + iov_iter_advance(&iter, offset); > + > + copied = _copy_from_iter(dst, len, &iter); > + WARN_ON_ONCE(copied != len); > } > > /* > @@ -1154,7 +1289,7 @@ static unsigned int sg_init_fuse_args(struct > scatterlist *sg, > len = fuse_len_args(numargs - argpages, args); > if (len) > total_sgs += virtio_fs_argbuf_setup_sg(req->argbuf, *len_used, > -len, &sg[total_sgs]); > +&len, &sg[total_sgs]); > > if (argpages) > total_sgs += sg_init_fuse_pages(&sg[total_sgs], > @@ -1199,7 +1334,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq > *fsvq, > } > > /* Use a bounce buffer since stack args cannot be mapped */ > - req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC); > + req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC, true); > if (!req->argbuf) { > ret = -ENOMEM; > goto out; > -- > 2.29.2 > >
RE: [PATCH v3 4/5] arch,locking/atomic: hexagon: add arch_cmpxchg[64]_local
> -Original Message- > From: wuqiang.matt > Sent: Tuesday, November 21, 2023 8:24 AM > To: ubiz...@gmail.com; mark.rutl...@arm.com; vgu...@kernel.org; Brian > Cain ; jo...@southpole.se; > stefan.kristians...@saunalahti.fi; sho...@gmail.com; ch...@zankel.net; > jcmvb...@gmail.com; ge...@linux-m68k.org; andi.sh...@linux.intel.com; > mi...@kernel.org; pal...@rivosinc.com; andrzej.ha...@intel.com; > a...@arndb.de; pet...@infradead.org; mhira...@kernel.org > Cc: linux-a...@vger.kernel.org; linux-snps-...@lists.infradead.org; linux- > ker...@vger.kernel.org; linux-hexa...@vger.kernel.org; linux- > openr...@vger.kernel.org; linux-trace-ker...@vger.kernel.org; > mat...@163.com; li...@roeck-us.net; wuqiang.matt > ; kernel test robot > Subject: [PATCH v3 4/5] arch,locking/atomic: hexagon: add > arch_cmpxchg[64]_local > > WARNING: This email originated from outside of Qualcomm. Please be wary of > any links or attachments, and do not enable macros. > > hexagonc hasn't arch_cmpxhg_local implemented, which causes > building failures for any references of try_cmpxchg_local, > reported by the kernel test robot. > > This patch implements arch_cmpxchg[64]_local with the native > cmpxchg variant if the corresponding data size is supported, > otherwise generci_cmpxchg[64]_local is to be used. > > Reported-by: kernel test robot > Closes: https://lore.kernel.org/oe-kbuild-all/202310272207.tLPflya4- > l...@intel.com/ > > Signed-off-by: wuqiang.matt > Reviewed-by: Masami Hiramatsu (Google) > --- > arch/hexagon/include/asm/cmpxchg.h | 51 +- > 1 file changed, 50 insertions(+), 1 deletion(-) > > diff --git a/arch/hexagon/include/asm/cmpxchg.h > b/arch/hexagon/include/asm/cmpxchg.h > index bf6cf5579cf4..302fa30f25aa 100644 > --- a/arch/hexagon/include/asm/cmpxchg.h > +++ b/arch/hexagon/include/asm/cmpxchg.h > @@ -8,6 +8,8 @@ > #ifndef _ASM_CMPXCHG_H > #define _ASM_CMPXCHG_H > > +#include > + > /* > * __arch_xchg - atomically exchange a register and a memory location > * @x: value to swap > @@ -51,13 +53,15 @@ __arch_xchg(unsigned long x, volatile void *ptr, int > size) > * variable casting. > */ > > -#define arch_cmpxchg(ptr, old, new)\ > +#define __cmpxchg_32(ptr, old, new)\ > ({ \ > __typeof__(ptr) __ptr = (ptr); \ > __typeof__(*(ptr)) __old = (old); \ > __typeof__(*(ptr)) __new = (new); \ > __typeof__(*(ptr)) __oldval = 0;\ > \ > + BUILD_BUG_ON(sizeof(*(ptr)) != 4); \ > + \ > asm volatile( \ > "1: %0 = memw_locked(%1);\n"\ > " { P0 = cmp.eq(%0,%2);\n"\ > @@ -72,4 +76,49 @@ __arch_xchg(unsigned long x, volatile void *ptr, int size) > __oldval; \ > }) > > +#define __cmpxchg(ptr, old, val, size) \ > +({ \ > + __typeof__(*(ptr)) oldval; \ > + \ > + switch (size) { \ > + case 4: \ > + oldval = __cmpxchg_32(ptr, old, val); \ > + break; \ > + default:\ > + BUILD_BUG();\ > + oldval = val; \ > + break; \ > + } \ > + \ > + oldval; \ > +}) > + > +#define arch_cmpxchg(ptr, o, n)__cmpxchg((ptr), (o), (n), > sizeof(*(ptr))) > + > +/* > + * always make arch_cmpxchg[64]_local available, native cmpxchg > + * will be used if available, then generic_cmpxchg[64]_local > + */ > +#include > + > +#define arch_cmpxchg_local(ptr, old, val) \ > +({
Re: [PATCH 1/2] dt-bindings: mwifiex: document use with the SD8777 chipset
On Sun, Oct 29, 2023 at 12:08:16PM +0100, Karel Balej wrote: > Document the corresponding compatible string for the use of this driver > with the Marvell SD8777 wireless chipset. > > Signed-off-by: Karel Balej FWIW, the binding looks fine from mwifiex point of view, so: Acked-by: Brian Norris But see cover letter. We can't merge driver support without a linux-firmware-compatible (or otherwise redistributable) firmware, so NAK for the series.
Re: [PATCH 0/2] net: mwifiex: add support for the SD8777 chipset
On Sun, Oct 29, 2023 at 12:08:15PM +0100, Karel Balej wrote: > The driver requires proprietary firmware which is not yet part of > linux-firmware, but it is packaged in postmarketOS. You gotta get that done: https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#new_driver "have firmware images submitted for linux-firmware with an acceptable license allowing redistribution" We can't have a driver requesting a mrvl/sd8777_uapsta.bin firmware that isn't available for anyone [1]. Until that's done, NAK. [1] I think you might be referring to this: https://github.com/xcover3/android_vendor_samsung_xcover3lte/commit/6e324b43b32dc607327d89148dd5d83a14429ee6 But I don't see any license info, so I don't think that's going to be appropriate for linux-firmware.
Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition
On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote: > On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote: > > On 2022/9/9 21:01, Brian Foster wrote: > > > Yes.. I don't recall all the internals of the tools and test, but IIRC > > > it relied on discard to perform zeroing between checkpoints or some such > > > and avoid spurious failures. The purpose of running on dm-thin was > > > merely to provide reliable discard zeroing behavior on the target device > > > and thus to allow the test to run reliably. > > Hi Brian, > > > > As far as I know, generic/470 was original designed to verify > > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the > > reason, we need to ensure that all underlying devices under > > dm-log-writes device support DAX. However dm-thin device never supports > > DAX so > > running generic/470 with dm-thin device always returns "not run". > > > > Please see the difference between old and new logic: > > > >old logic new logic > > --- > > log-writes device(DAX) log-writes device(DAX) > > | | > > PMEM0(DAX) + PMEM1(DAX) Thin device(non-DAX) + PMEM1(DAX) > >| > > PMEM0(DAX) > > --- > > > > We think dm-thin device is not a good solution for generic/470, is there > > any other solution to support both discard zero and DAX? > > Hi Brian, > > I have sent a patch[1] to revert your fix because I think it's not good for > generic/470 to use thin volume as my revert patch[1] describes: > [1] > https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u > I think the history here is that generic/482 was changed over first in commit 65cc9a235919 ("generic/482: use thin volume as data device"), and then sometime later we realized generic/455,457,470 had the same general flaw and were switched over. The dm/dax compatibility thing was probably just an oversight, but I am a little curious about that because it should have been obvious that the change caused the test to no longer run. Did something change after that to trigger that change in behavior? > With the revert, generic/470 can always run successfully on my environment > so I wonder how to reproduce the out-of-order replay issue on XFS v5 > filesystem? > I don't quite recall the characteristics of the failures beyond that we were seeing spurious test failures with generic/482 that were due to essentially putting the fs/log back in time in a way that wasn't quite accurate due to the clearing by the logwrites tool not taking place. If you wanted to reproduce in order to revisit that, perhaps start with generic/482 and let it run in a loop for a while and see if it eventually triggers a failure/corruption..? > PS: I want to reproduce the issue and try to find a better solution to fix > it. > It's been a while since I looked at any of this tooling to semi-grok how it works. Perhaps it could learn to rely on something more explicit like zero range (instead of discard?) or fall back to manual zeroing? If the eventual solution is simple and low enough overhead, it might make some sense to replace the dmthin hack across the set of tests mentioned above. Brian > Best Regards, > Xiao Yang > > > > > BTW, only log-writes, stripe and linear support DAX for now. >
Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition
On Thu, Sep 08, 2022 at 09:46:04PM +0800, Shiyang Ruan wrote: > > > 在 2022/8/4 8:51, Darrick J. Wong 写道: > > On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote: > > ... > > > > > > > > > > > > > > > BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) > > > > > > > are > > > > > > > waiting to be merged, is it time to think about "removing the > > > > > > > experimental tag" again? :) > > > > > > > > > > > > It's probably time to take up that question again. > > > > > > > > > > > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it > > > > > > didn't succeed because it sets up dmlogwrites atop dmthinp atop > > > > > > pmem, > > > > > > and at least one of those dm layers no longer allows fsdax > > > > > > pass-through, > > > > > > so XFS silently turned mount -o dax into -o dax=never. :( > > > > > > > > > > Hi Darrick, > > > > > > > > > > I tried generic/470 but it didn't run: > > > > > [not run] Cannot use thin-pool devices on DAX capable block > > > > > devices. > > > > > > > > > > Did you modify the _require_dm_target() in common/rc? I added > > > > > thin-pool > > > > > to not to check dax capability: > > > > > > > > > >case $target in > > > > >stripe|linear|log-writes|thin-pool) # add thin-pool here > > > > >;; > > > > > > > > > > then the case finally ran and it silently turned off dax as you said. > > > > > > > > > > Are the steps for reproduction correct? If so, I will continue to > > > > > investigate this problem. > > > > > > > > Ah, yes, I did add thin-pool to that case statement. Sorry I forgot to > > > > mention that. I suspect that the removal of dm support for pmem is > > > > going to force us to completely redesign this test. I can't really > > > > think of how, though, since there's no good way that I know of to gain a > > > > point-in-time snapshot of a pmem device. > > > > > > Hi Darrick, > > > > > > > removal of dm support for pmem > > > I think here we are saying about xfstest who removed the support, not > > > kernel? > > > > > > I found some xfstests commits: > > > fc7b3903894a6213c765d64df91847f4460336a2 # common/rc: add the > > > restriction. > > > fc5870da485aec0f9196a0f2bed32f73f6b2c664 # generic/470: use thin-pool > > > > > > So, this case was never able to run since the second commit? (I didn't > > > notice the not run case. I thought it was expected to be not run.) > > > > > > And according to the first commit, the restriction was added because > > > some of dm devices don't support dax. So my understanding is: we should > > > redesign the case to make the it work, and firstly, we should add dax > > > support for dm devices in kernel. > > > > dm devices used to have fsdax support; I think Christoph is actively > > removing (or already has removed) all that support. > > > > > In addition, is there any other testcase has the same problem? so that > > > we can deal with them together. > > > > The last I checked, there aren't any that require MAP_SYNC or pmem aside > > from g/470 and the three poison notification tests that you sent a few > > days ago. > > > > --D > > > > Hi Darrick, Brian > > I made a little investigation on generic/470. > > This case was able to run before introducing thin-pool[1], but since that, > it became 'Failed'/'Not Run' because thin-pool does not support DAX. I have > checked the log of thin-pool, it never supports DAX. And, it's not someone > has removed the fsdax support. So, I think it's not correct to bypass the > requirement conditions by adding 'thin-pool' to _require_dm_target(). > > As far as I known, to prevent out-of-order replay of dm-log-write, thin-pool > was introduced (to provide discard zeroing). Should we solve the > 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian > Yes.. I don't recall all the internals of the tools and test, but IIRC it relied on discard to perform zeroing between checkpoints or some such and avoid spurious failures. The purpose of running on dm-thin was merely to provide reliable discard zeroing behavior on the target device and thus to allow the test to run reliably. Brian > Besides, since it's not a fsdax problem, I think there is nothing need to be > fixed in fsdax. I'd like to help it solved, but I'm still wondering if we > could back to the original topic("Remove Experimental Tag") firstly? :) > > > [1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin volume > for dmlogwrites target device > > > -- > Thanks, > Ruan. > >
Re: [PATCH bpf-next v4 0/3] add batched ops for percpu array
Sorry I missed this. I don't recall why I didn't add the array percpu variant, but LGTM. Acked-by: Brian Vazquez On Fri, Apr 16, 2021 at 3:09 PM Martin KaFai Lau wrote: > > On Thu, Apr 15, 2021 at 02:46:16PM -0300, Pedro Tammela wrote: > > This patchset introduces batched operations for the per-cpu variant of > > the array map. > > > > It also removes the percpu macros from 'bpf_util.h'. This change was > > suggested by Andrii in a earlier iteration of this patchset. > > > > The tests were updated to reflect all the new changes. > Acked-by: Martin KaFai Lau
Re: [PATCH] wireless/marvell/mwifiex: Fix a double free in mwifiex_send_tdls_action_frame
Hi, On Wed, Apr 14, 2021 at 12:08:32AM +0800, lyl2...@mail.ustc.edu.cn wrote: > > Hi, > maintianers. > > Sorry to disturb you, but this patch seems to be missed more than two weeks. > Could you help to review this patch? I am sure it won't take you much time. You might take a look here: https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#checking_state_of_patches_from_patchwork https://patchwork.kernel.org/project/linux-wireless/list/ Two weeks is not unheard of for waiting. Anyway, you *did* succeed in catching my attention today, so I'll give you a review, below: > > -原始邮件- > > 发件人: "Lv Yunlong" > > 发送时间: 2021-03-29 19:24:35 (星期一) > > 收件人: amitkar...@gmail.com, ganapathi.b...@nxp.com, huxinming...@gmail.com, > > kv...@codeaurora.org, da...@davemloft.net, k...@kernel.org > > 抄送: linux-wirel...@vger.kernel.org, net...@vger.kernel.org, > > linux-kernel@vger.kernel.org, "Lv Yunlong" > > 主题: [PATCH] wireless/marvell/mwifiex: Fix a double free in > > mwifiex_send_tdls_action_frame > > > > In mwifiex_send_tdls_action_frame, it calls > > mwifiex_construct_tdls_action_frame > > (..,skb). The skb will be freed in mwifiex_construct_tdls_action_frame() > > when > > it is failed. But when mwifiex_construct_tdls_action_frame() returns error, > > the skb will be freed in the second time by dev_kfree_skb_any(skb). > > > > My patch removes the redundant dev_kfree_skb_any(skb) when > > mwifiex_construct_tdls_action_frame() failed. > > > > Fixes: b23bce2965680 ("mwifiex: add tdls_mgmt handler support") > > Signed-off-by: Lv Yunlong > > --- > > drivers/net/wireless/marvell/mwifiex/tdls.c | 1 - > > 1 file changed, 1 deletion(-) > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/tdls.c > > b/drivers/net/wireless/marvell/mwifiex/tdls.c > > index 97bb87c3676b..8d4d0a9cf6ac 100644 > > --- a/drivers/net/wireless/marvell/mwifiex/tdls.c > > +++ b/drivers/net/wireless/marvell/mwifiex/tdls.c > > @@ -856,7 +856,6 @@ int mwifiex_send_tdls_action_frame(struct > > mwifiex_private *priv, const u8 *peer, > > if (mwifiex_construct_tdls_action_frame(priv, peer, action_code, > > dialog_token, status_code, > > skb)) { > > - dev_kfree_skb_any(skb); Good catch, and this looks correct for most cases, but I'll note that you missed one issue: mwifiex_construct_tdls_action_frame() has a default/error case ("Unknown TDLS action frame type") which does *not* free this SKB, so you should patch that up at the same time. Otherwise, you're trading a double-free for a potential memory leak. With that fixed: Reviewed-by: Brian Norris Thanks. > > return -EINVAL; > > } > > > > -- > > 2.25.1 > >
Re: [PATCH] xfs: fix return of uninitialized value in variable error
On Fri, Apr 09, 2021 at 03:18:34PM +0100, Colin King wrote: > From: Colin Ian King > > A previous commit removed a call to xfs_attr3_leaf_read that > assigned an error return code to variable error. We now have > a few early error return paths to label 'out' that return > error if error is set; however error now is uninitialized > so potentially garbage is being returned. Fix this by setting > error to zero to restore the original behaviour where error > was zero at the label 'restart'. > > Addresses-Coverity: ("Uninitialized scalar variable") > Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines") > Signed-off-by: Colin Ian King > --- > fs/xfs/libxfs/xfs_attr.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c > index 472b3039eabb..902e5f7e6642 100644 > --- a/fs/xfs/libxfs/xfs_attr.c > +++ b/fs/xfs/libxfs/xfs_attr.c > @@ -928,6 +928,7 @@ xfs_attr_node_addname( >* Search to see if name already exists, and get back a pointer >* to where it should go. >*/ > + error = 0; > retval = xfs_attr_node_hasname(args, &state); > if (retval != -ENOATTR && retval != -EEXIST) > goto out; I think it would be nicer to initialize at the top of the function as opposed to try and "preserve" historical behavior, but that nit aside: Reviewed-by: Brian Foster > -- > 2.30.2 >
Re: [PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings
Hi Alex, It has not landed yet, it's currently in Andrew's mm tree. I can reach out again when it makes it into Linus' tree. Brian On Thu, Mar 25, 2021 at 2:34 PM Alejandro Colomar (man-pages) wrote: > > Hello Brian, > > Is this already merged in Linux? I guess not, as I've seen a patch of > yous for the kernel, right? > > Thanks, > > Alex > > On 3/23/21 7:25 PM, Brian Geffon wrote: > > mremap(2) now supports MREMAP_DONTUNMAP with mapping types other > > than private anonymous. > > > > Signed-off-by: Brian Geffon > > --- > > man2/mremap.2 | 13 ++--- > > 1 file changed, 2 insertions(+), 11 deletions(-) > > > > diff --git a/man2/mremap.2 b/man2/mremap.2 > > index 3ed0c0c0a..72acbc111 100644 > > --- a/man2/mremap.2 > > +++ b/man2/mremap.2 > > @@ -118,16 +118,6 @@ This flag, which must be used in conjunction with > > remaps a mapping to a new address but does not unmap the mapping at > > .IR old_address . > > .IP > > -The > > -.B MREMAP_DONTUNMAP > > -flag can be used only with private anonymous mappings > > -(see the description of > > -.BR MAP_PRIVATE > > -and > > -.BR MAP_ANONYMOUS > > -in > > -.BR mmap (2)). > > -.IP > > After completion, > > any access to the range specified by > > .IR old_address > > @@ -227,7 +217,8 @@ was specified, but one or more pages in the range > > specified by > > .IR old_address > > and > > .IR old_size > > -were not private anonymous; > > +were part of a special mapping or the mapping is one that > > +does not support merging or expanding; > > .IP * > > .B MREMAP_DONTUNMAP > > was specified and > > > > -- > Alejandro Colomar > Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/ > http://www.alejandro-colomar.es/
[PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings
mremap(2) now supports MREMAP_DONTUNMAP with mapping types other than private anonymous. Signed-off-by: Brian Geffon --- man2/mremap.2 | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/man2/mremap.2 b/man2/mremap.2 index 3ed0c0c0a..72acbc111 100644 --- a/man2/mremap.2 +++ b/man2/mremap.2 @@ -118,16 +118,6 @@ This flag, which must be used in conjunction with remaps a mapping to a new address but does not unmap the mapping at .IR old_address . .IP -The -.B MREMAP_DONTUNMAP -flag can be used only with private anonymous mappings -(see the description of -.BR MAP_PRIVATE -and -.BR MAP_ANONYMOUS -in -.BR mmap (2)). -.IP After completion, any access to the range specified by .IR old_address @@ -227,7 +217,8 @@ was specified, but one or more pages in the range specified by .IR old_address and .IR old_size -were not private anonymous; +were part of a special mapping or the mapping is one that +does not support merging or expanding; .IP * .B MREMAP_DONTUNMAP was specified and -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v5 3/3] selftests: Add a MREMAP_DONTUNMAP selftest for shmem
This test extends the current mremap tests to validate that the MREMAP_DONTUNMAP operation can be performed on shmem mappings. Signed-off-by: Brian Geffon --- tools/testing/selftests/vm/mremap_dontunmap.c | 52 +++ 1 file changed, 52 insertions(+) diff --git a/tools/testing/selftests/vm/mremap_dontunmap.c b/tools/testing/selftests/vm/mremap_dontunmap.c index 3a7b5ef0b0c6..f01dc4a85b0b 100644 --- a/tools/testing/selftests/vm/mremap_dontunmap.c +++ b/tools/testing/selftests/vm/mremap_dontunmap.c @@ -127,6 +127,57 @@ static void mremap_dontunmap_simple() "unable to unmap source mapping"); } +// This test validates that MREMAP_DONTUNMAP on a shared mapping works as expected. +static void mremap_dontunmap_simple_shmem() +{ + unsigned long num_pages = 5; + + int mem_fd = memfd_create("memfd", MFD_CLOEXEC); + BUG_ON(mem_fd < 0, "memfd_create"); + + BUG_ON(ftruncate(mem_fd, num_pages * page_size) < 0, + "ftruncate"); + + void *source_mapping = + mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE, +MAP_FILE | MAP_SHARED, mem_fd, 0); + BUG_ON(source_mapping == MAP_FAILED, "mmap"); + + BUG_ON(close(mem_fd) < 0, "close"); + + memset(source_mapping, 'a', num_pages * page_size); + + // Try to just move the whole mapping anywhere (not fixed). + void *dest_mapping = + mremap(source_mapping, num_pages * page_size, num_pages * page_size, + MREMAP_DONTUNMAP | MREMAP_MAYMOVE, NULL); + if (dest_mapping == MAP_FAILED && errno == EINVAL) { + // Old kernel which doesn't support MREMAP_DONTUNMAP on shmem. + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, + "unable to unmap source mapping"); + return; + } + + BUG_ON(dest_mapping == MAP_FAILED, "mremap"); + + // Validate that the pages have been moved, we know they were moved if + // the dest_mapping contains a's. + BUG_ON(check_region_contains_byte + (dest_mapping, num_pages * page_size, 'a') != 0, + "pages did not migrate"); + + // Because the region is backed by shmem, we will actually see the same + // memory at the source location still. + BUG_ON(check_region_contains_byte + (source_mapping, num_pages * page_size, 'a') != 0, + "source should have no ptes"); + + BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1, + "unable to unmap destination mapping"); + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, + "unable to unmap source mapping"); +} + // This test validates MREMAP_DONTUNMAP will move page tables to a specific // destination using MREMAP_FIXED, also while validating that the source // remains intact. @@ -300,6 +351,7 @@ int main(void) BUG_ON(page_buffer == MAP_FAILED, "unable to mmap a page."); mremap_dontunmap_simple(); + mremap_dontunmap_simple_shmem(); mremap_dontunmap_simple_fixed(); mremap_dontunmap_partial_mapping(); mremap_dontunmap_partial_mapping_overwrite(); -- 2.31.0.291.g576ba9dcdaf-goog
[PATCH v5 2/3] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL. 1. https://lkml.org/lkml/2020/12/28/2340 Signed-off-by: Brian Geffon Acked-by: Hugh Dickins --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- fs/aio.c | 5 + include/linux/mm.h| 2 +- mm/mmap.c | 6 +- mm/mremap.c | 2 +- 5 files changed, 5 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index e916646adc69..0daf2f1cf7a8 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) return 0; } -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) { /* Not supported */ return -EINVAL; diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..76ce0cc3ee4e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx) } } -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) +static int aio_ring_mremap(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct kioctx_table *table; int i, res = -EINVAL; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - spin_lock(&mm->ioctx_lock); rcu_read_lock(); table = rcu_dereference(mm->ioctx_table); diff --git a/include/linux/mm.h b/include/linux/mm.h index 64a71bf20536..ecdc6e8dc5af 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -570,7 +570,7 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); /* Called any time before splitting to check if it's allowed */ int (*may_split)(struct vm_area_struct *area, unsigned long addr); - int (*mremap)(struct vm_area_struct *area, unsigned long flags); + int (*mremap)(struct vm_area_struct *area); /* * Called by mprotect() to make driver-specific permission * checks before mprotect() is finalised. The VMA must not diff --git a/mm/mmap.c b/mm/mmap.c index 3f287599a7a3..9d7651e4e1fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct vm_area_struct *vma) return ((struct vm_special_mapping *)vma->vm_private_data)->name; } -static int special_mapping_mremap(struct vm_area_struct *new_vma, - unsigned long flags) +static int special_mapping_mremap(struct vm_area_struct *new_vma) { struct vm_special_mapping *sm = new_vma->vm_private_data; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) return -EFAULT; diff --git a/mm/mremap.c b/mm/mremap.c index db5b8b28c2dd..d22629ff8f3c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { - err = vma->vm_ops->mremap(new_vma, flags); + err = vma->vm_ops->mremap(new_vma); } if (unlikely(err)) { -- 2.31.0.291.g576ba9dcdaf-goog
[PATCH v5 1/3] mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This restriction was placed initially for simplicity and not because there exists a technical reason to do so. This change will widen the support to include any mappings which are not VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if VM_DONTEXPAND or VM_PFNMAP mappings are specified. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ Signed-off-by: Brian Geffon Acked-by: Hugh Dickins Tested-by: Lokesh Gidra --- mm/mremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index ec8f840399ed..db5b8b28c2dd 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EINVAL); } - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || - vma->vm_flags & VM_SHARED)) + if ((flags & MREMAP_DONTUNMAP) && + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) return ERR_PTR(-EINVAL); if (is_vm_hugetlb_page(vma)) -- 2.31.0.291.g576ba9dcdaf-goog
[PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings
mremap(2) now supports MREMAP_DONTUNMAP with mapping types other than private anonymous. Signed-off-by: Brian Geffon --- man2/mremap.2 | 13 ++--- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/man2/mremap.2 b/man2/mremap.2 index 3ed0c0c0a..72acbc111 100644 --- a/man2/mremap.2 +++ b/man2/mremap.2 @@ -118,16 +118,6 @@ This flag, which must be used in conjunction with remaps a mapping to a new address but does not unmap the mapping at .IR old_address . .IP -The -.B MREMAP_DONTUNMAP -flag can be used only with private anonymous mappings -(see the description of -.BR MAP_PRIVATE -and -.BR MAP_ANONYMOUS -in -.BR mmap (2)). -.IP After completion, any access to the range specified by .IR old_address @@ -227,7 +217,8 @@ was specified, but one or more pages in the range specified by .IR old_address and .IR old_size -were not private anonymous; +were part of a special mapping or the mapping is one that +does not support merging or expanding; .IP * .B MREMAP_DONTUNMAP was specified and -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v4 1/3] mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This restriction was placed initially for simplicity and not because there exists a technical reason to do so. This change will widen the support to include any mappings which are not VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if VM_DONTEXPAND or VM_PFNMAP mappings are specified. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ Signed-off-by: Brian Geffon Acked-by: Hugh Dickins Tested-by: Lokesh Gidra --- mm/mremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index ec8f840399ed..db5b8b28c2dd 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EINVAL); } - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || - vma->vm_flags & VM_SHARED)) + if ((flags & MREMAP_DONTUNMAP) && + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) return ERR_PTR(-EINVAL); if (is_vm_hugetlb_page(vma)) -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v4 3/3] selftests: Add a MREMAP_DONTUNMAP selftest for shmem
This test extends the current mremap tests to validate that the MREMAP_DONTUNMAP operation can be performed on shmem mappings. Signed-off-by: Brian Geffon --- tools/testing/selftests/vm/mremap_dontunmap.c | 52 +++ 1 file changed, 52 insertions(+) diff --git a/tools/testing/selftests/vm/mremap_dontunmap.c b/tools/testing/selftests/vm/mremap_dontunmap.c index 3a7b5ef0b0c6..f01dc4a85b0b 100644 --- a/tools/testing/selftests/vm/mremap_dontunmap.c +++ b/tools/testing/selftests/vm/mremap_dontunmap.c @@ -127,6 +127,57 @@ static void mremap_dontunmap_simple() "unable to unmap source mapping"); } +// This test validates that MREMAP_DONTUNMAP on a shared mapping works as expected. +static void mremap_dontunmap_simple_shmem() +{ + unsigned long num_pages = 5; + + int mem_fd = memfd_create("memfd", MFD_CLOEXEC); + BUG_ON(mem_fd < 0, "memfd_create"); + + BUG_ON(ftruncate(mem_fd, num_pages * page_size) < 0, + "ftruncate"); + + void *source_mapping = + mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE, +MAP_FILE | MAP_SHARED, mem_fd, 0); + BUG_ON(source_mapping == MAP_FAILED, "mmap"); + + BUG_ON(close(mem_fd) < 0, "close"); + + memset(source_mapping, 'a', num_pages * page_size); + + // Try to just move the whole mapping anywhere (not fixed). + void *dest_mapping = + mremap(source_mapping, num_pages * page_size, num_pages * page_size, + MREMAP_DONTUNMAP | MREMAP_MAYMOVE, NULL); + if (dest_mapping == MAP_FAILED && errno == EINVAL) { + // Old kernel which doesn't support MREMAP_DONTUNMAP on shmem. + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, + "unable to unmap source mapping"); + return; + } + + BUG_ON(dest_mapping == MAP_FAILED, "mremap"); + + // Validate that the pages have been moved, we know they were moved if + // the dest_mapping contains a's. + BUG_ON(check_region_contains_byte + (dest_mapping, num_pages * page_size, 'a') != 0, + "pages did not migrate"); + + // Because the region is backed by shmem, we will actually see the same + // memory at the source location still. + BUG_ON(check_region_contains_byte + (source_mapping, num_pages * page_size, 'a') != 0, + "source should have no ptes"); + + BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1, + "unable to unmap destination mapping"); + BUG_ON(munmap(source_mapping, num_pages * page_size) == -1, + "unable to unmap source mapping"); +} + // This test validates MREMAP_DONTUNMAP will move page tables to a specific // destination using MREMAP_FIXED, also while validating that the source // remains intact. @@ -300,6 +351,7 @@ int main(void) BUG_ON(page_buffer == MAP_FAILED, "unable to mmap a page."); mremap_dontunmap_simple(); + mremap_dontunmap_simple_shmem(); mremap_dontunmap_simple_fixed(); mremap_dontunmap_partial_mapping(); mremap_dontunmap_partial_mapping_overwrite(); -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v4 2/3] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT. 1. https://lkml.org/lkml/2020/12/28/2340 Signed-off-by: Brian Geffon Acked-by: Hugh Dickins --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- fs/aio.c | 5 + include/linux/mm.h| 2 +- mm/mmap.c | 6 +- mm/mremap.c | 2 +- 5 files changed, 5 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index e916646adc69..0daf2f1cf7a8 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) return 0; } -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) { /* Not supported */ return -EINVAL; diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..76ce0cc3ee4e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx) } } -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) +static int aio_ring_mremap(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct kioctx_table *table; int i, res = -EINVAL; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - spin_lock(&mm->ioctx_lock); rcu_read_lock(); table = rcu_dereference(mm->ioctx_table); diff --git a/include/linux/mm.h b/include/linux/mm.h index 64a71bf20536..ecdc6e8dc5af 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -570,7 +570,7 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); /* Called any time before splitting to check if it's allowed */ int (*may_split)(struct vm_area_struct *area, unsigned long addr); - int (*mremap)(struct vm_area_struct *area, unsigned long flags); + int (*mremap)(struct vm_area_struct *area); /* * Called by mprotect() to make driver-specific permission * checks before mprotect() is finalised. The VMA must not diff --git a/mm/mmap.c b/mm/mmap.c index 3f287599a7a3..9d7651e4e1fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct vm_area_struct *vma) return ((struct vm_special_mapping *)vma->vm_private_data)->name; } -static int special_mapping_mremap(struct vm_area_struct *new_vma, - unsigned long flags) +static int special_mapping_mremap(struct vm_area_struct *new_vma) { struct vm_special_mapping *sm = new_vma->vm_private_data; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) return -EFAULT; diff --git a/mm/mremap.c b/mm/mremap.c index db5b8b28c2dd..d22629ff8f3c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { - err = vma->vm_ops->mremap(new_vma, flags); + err = vma->vm_ops->mremap(new_vma); } if (unlikely(err)) { -- 2.31.0.rc2.261.g7f71774620-goog
Re: [RFC 2/7] ath10k: Add support to process rx packet in thread
On Mon, Mar 22, 2021 at 4:58 PM Ben Greear wrote: > On 7/22/20 6:00 AM, Felix Fietkau wrote: > > On 2020-07-22 14:55, Johannes Berg wrote: > >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote: > >> > >>> I'm considering testing a different approach (with mt76 initially): > >>> - Add a mac80211 rx function that puts processed skbs into a list > >>> instead of handing them to the network stack directly. > >> > >> Would this be *after* all the mac80211 processing, i.e. in place of the > >> rx-up-to-stack? > > Yes, it would run all the rx handlers normally and then put the > > resulting skbs into a list instead of calling netif_receive_skb or > > napi_gro_frags. > > Whatever came of this? I realized I'm running Felix's patch since his mt76 > driver needs it. Any chance it will go upstream? If you're asking about $subject (moving NAPI/RX to a thread), this landed upstream recently: http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715 It needs a bit of coaxing to work on a WiFi driver (including: WiFi drivers tend to have a different netdev for NAPI than they expose to /sys/class/net/), but it's there. I'm not sure if people had something else in mind in the stuff you're quoting though. Brian
Re: [PATCH 2/2] ibmvfc: make ibmvfc_wait_for_ops MQ aware
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH 1/2] ibmvfc: fix potential race in ibmvfc_wait_for_ops
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
[PATCH v3 2/2] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT. 1. https://lkml.org/lkml/2020/12/28/2340 Signed-off-by: Brian Geffon --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- fs/aio.c | 5 + include/linux/mm.h| 2 +- mm/mmap.c | 6 +- mm/mremap.c | 2 +- 5 files changed, 5 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index e916646adc69..0daf2f1cf7a8 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) return 0; } -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) { /* Not supported */ return -EINVAL; diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..76ce0cc3ee4e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx) } } -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) +static int aio_ring_mremap(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct kioctx_table *table; int i, res = -EINVAL; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - spin_lock(&mm->ioctx_lock); rcu_read_lock(); table = rcu_dereference(mm->ioctx_table); diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..8c3729eb3e38 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -570,7 +570,7 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); /* Called any time before splitting to check if it's allowed */ int (*may_split)(struct vm_area_struct *area, unsigned long addr); - int (*mremap)(struct vm_area_struct *area, unsigned long flags); + int (*mremap)(struct vm_area_struct *area); /* * Called by mprotect() to make driver-specific permission * checks before mprotect() is finalised. The VMA must not diff --git a/mm/mmap.c b/mm/mmap.c index 3f287599a7a3..9d7651e4e1fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct vm_area_struct *vma) return ((struct vm_special_mapping *)vma->vm_private_data)->name; } -static int special_mapping_mremap(struct vm_area_struct *new_vma, - unsigned long flags) +static int special_mapping_mremap(struct vm_area_struct *new_vma) { struct vm_special_mapping *sm = new_vma->vm_private_data; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) return -EFAULT; diff --git a/mm/mremap.c b/mm/mremap.c index db5b8b28c2dd..d22629ff8f3c 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { - err = vma->vm_ops->mremap(new_vma, flags); + err = vma->vm_ops->mremap(new_vma); } if (unlikely(err)) { -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v3 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This change will widen the support to include any mappings which are not VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL if VM_DONTEXPAND or VM_PFNMAP mappings are specified. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ Signed-off-by: Brian Geffon --- mm/mremap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index ec8f840399ed..db5b8b28c2dd 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EINVAL); } - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || - vma->vm_flags & VM_SHARED)) + if ((flags & MREMAP_DONTUNMAP) && + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) return ERR_PTR(-EINVAL); if (is_vm_hugetlb_page(vma)) -- 2.31.0.rc2.261.g7f71774620-goog
Re: [PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP
You're 100% correct, I'll mail a new patch in a few Brian On Wed, Mar 17, 2021 at 5:19 PM Peter Xu wrote: > > On Wed, Mar 17, 2021 at 04:44:25PM -0400, Brian Geffon wrote: > > Hi Peter, > > Hi, Brian, > > > Thank you as always for taking a look. This change relies on the > > existing check in vma_to_resize on line 686: > > https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L686 > > which returns -EFAULT when the vma is VM_DONTEXPAND or VM_PFNMAP. > > Do you mean line 676? > > https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L676 > > I'm not sure whether it'll work for MREMAP_DONTUNMAP, since IIUC > MREMAP_DONTUNMAP only works for the remap case with no size change, however in > that case in vma_to_resize() we'll bail out even earlier than line 676 when > checking against the size: > > https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L667 > > So IIUC we'll still need the change as Hugh suggested previously. > > Thanks, > > -- > Peter Xu >
Re: [PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP
Hi Peter, Thank you as always for taking a look. This change relies on the existing check in vma_to_resize on line 686: https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L686 which returns -EFAULT when the vma is VM_DONTEXPAND or VM_PFNMAP. Thanks Brian On Wed, Mar 17, 2021 at 4:40 PM Peter Xu wrote: > > Hi, Brian, > > On Wed, Mar 17, 2021 at 12:13:33PM -0700, Brian Geffon wrote: > > Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This > > change will widen the support to include any mappings which are not > > VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support > > MREMAP_DONTUNMAP on mappings which may have been created from a memfd. > > > > This change which takes advantage of the existing check in vma_to_resize > > for non-VM_DONTEXPAND and non-VM_PFNMAP mappings will cause > > MREMAP_DONTUNMAP to return -EFAULT if such mappings are remapped. This > > behavior is consistent with existing behavior when using mremap with > > such mappings. > > > > Lokesh Gidra who works on the Android JVM, provided an explanation of how > > such a feature will improve Android JVM garbage collection: > > "Android is developing a new garbage collector (GC), based on userfaultfd. > > The garbage collector will use userfaultfd (uffd) on the java heap during > > compaction. On accessing any uncompacted page, the application threads will > > find it missing, at which point the thread will create the compacted page > > and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. > > Before starting this compaction, in a stop-the-world pause the heap will be > > mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive > > UFFD_EVENT_PAGEFAULT events after resuming execution. > > > > To speedup mremap operations, pagetable movement was optimized by moving > > PUD entries instead of PTE entries [1]. It was necessary as mremap of even > > modest sized memory ranges also took several milliseconds, and stopping the > > application for that long isn't acceptable in response-time sensitive > > cases. > > > > With UFFDIO_CONTINUE feature [2], it will be even more efficient to > > implement this GC, particularly the 'non-moveable' portions of the heap. > > It will also help in reducing the need to copy (UFFDIO_COPY) the pages. > > However, for this to work, the java heap has to be on a 'shared' vma. > > Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this > > patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap > > compaction." > > > > [1] > > https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ > > [2] > > https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ > > > > Signed-off-by: Brian Geffon > > --- > > mm/mremap.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/mm/mremap.c b/mm/mremap.c > > index ec8f840399ed..2c57dc4bc8b6 100644 > > --- a/mm/mremap.c > > +++ b/mm/mremap.c > > @@ -653,10 +653,6 @@ static struct vm_area_struct *vma_to_resize(unsigned > > long addr, > > return ERR_PTR(-EINVAL); > > } > > > > - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || > > - vma->vm_flags & VM_SHARED)) > > - return ERR_PTR(-EINVAL); > > - > > if (is_vm_hugetlb_page(vma)) > > return ERR_PTR(-EINVAL); > > The code change seems to be not aligned with what the commit message said. > Did > you perhaps forget to add the checks against VM_DONTEXPAND | VM_PFNMAP? I'm > guessing that (instead of commit message to be touched up) because you still > attached the revert patch, then that check seems to be needed. Thanks, > > -- > Peter Xu >
[PATCH v2 2/2] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99. As discussed in [1] this commit was a no-op because the mapping type was checked in vma_to_resize before move_vma is ever called. This meant that vm_ops->mremap() would never be called on such mappings. Furthermore, we've since expanded support of MREMAP_DONTUNMAP to non-anonymous mappings, and these special mappings are still protected by the existing check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT. 1. https://lkml.org/lkml/2020/12/28/2340 Signed-off-by: Brian Geffon --- arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- fs/aio.c | 5 + include/linux/mm.h| 2 +- mm/mmap.c | 6 +- mm/mremap.c | 2 +- 5 files changed, 5 insertions(+), 12 deletions(-) diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c index e916646adc69..0daf2f1cf7a8 100644 --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, struct file *filp) return 0; } -static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long flags) +static int pseudo_lock_dev_mremap(struct vm_area_struct *area) { /* Not supported */ return -EINVAL; diff --git a/fs/aio.c b/fs/aio.c index 1f32da13d39e..76ce0cc3ee4e 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx) } } -static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) +static int aio_ring_mremap(struct vm_area_struct *vma) { struct file *file = vma->vm_file; struct mm_struct *mm = vma->vm_mm; struct kioctx_table *table; int i, res = -EINVAL; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - spin_lock(&mm->ioctx_lock); rcu_read_lock(); table = rcu_dereference(mm->ioctx_table); diff --git a/include/linux/mm.h b/include/linux/mm.h index 77e64e3eac80..8c3729eb3e38 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -570,7 +570,7 @@ struct vm_operations_struct { void (*close)(struct vm_area_struct * area); /* Called any time before splitting to check if it's allowed */ int (*may_split)(struct vm_area_struct *area, unsigned long addr); - int (*mremap)(struct vm_area_struct *area, unsigned long flags); + int (*mremap)(struct vm_area_struct *area); /* * Called by mprotect() to make driver-specific permission * checks before mprotect() is finalised. The VMA must not diff --git a/mm/mmap.c b/mm/mmap.c index 3f287599a7a3..9d7651e4e1fe 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct vm_area_struct *vma) return ((struct vm_special_mapping *)vma->vm_private_data)->name; } -static int special_mapping_mremap(struct vm_area_struct *new_vma, - unsigned long flags) +static int special_mapping_mremap(struct vm_area_struct *new_vma) { struct vm_special_mapping *sm = new_vma->vm_private_data; - if (flags & MREMAP_DONTUNMAP) - return -EINVAL; - if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) return -EFAULT; diff --git a/mm/mremap.c b/mm/mremap.c index 2c57dc4bc8b6..b1f7bc43ece9 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (moved_len < old_len) { err = -ENOMEM; } else if (vma->vm_ops && vma->vm_ops->mremap) { - err = vma->vm_ops->mremap(new_vma, flags); + err = vma->vm_ops->mremap(new_vma); } if (unlikely(err)) { -- 2.31.0.rc2.261.g7f71774620-goog
[PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This change will widen the support to include any mappings which are not VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. This change which takes advantage of the existing check in vma_to_resize for non-VM_DONTEXPAND and non-VM_PFNMAP mappings will cause MREMAP_DONTUNMAP to return -EFAULT if such mappings are remapped. This behavior is consistent with existing behavior when using mremap with such mappings. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ Signed-off-by: Brian Geffon --- mm/mremap.c | 4 1 file changed, 4 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index ec8f840399ed..2c57dc4bc8b6 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -653,10 +653,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EINVAL); } - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || - vma->vm_flags & VM_SHARED)) - return ERR_PTR(-EINVAL); - if (is_vm_hugetlb_page(vma)) return ERR_PTR(-EINVAL); -- 2.31.0.rc2.261.g7f71774620-goog
Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP
Hi Hugh, Thanks for this suggestion, responses in line. > A better patch would say: > > - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || > - vma->vm_flags & VM_SHARED)) > + if ((flags & MREMAP_DONTUNMAP) && > + (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP))) > return ERR_PTR(-EINVAL); > > VM_DONTEXPAND is what has long been used on special mappings, to prevent > surprises from mremap changing the size of the mapping: MREMAP_DONTUNMAP > introduced a different way of expanding the mapping, so VM_DONTEXPAND > still seems a reasonable name (I've thrown in VM_PFNMAP there because > it's in the VM_DONTEXPAND test lower down: for safety I guess, and best > if both behave the same - though one says -EINVAL and the other -EFAULT). I like this idea and am happy to mail a new patch. I think it may make sense to bring the lower block up here so that it becomes more clear that it's not duplicate code and that the MREMAP_DONTUNMAP case returns -EINVAL and other cases return -EFAULT. I wonder if the -EFAULT error code would have made more sense from the start for both cases, do you have any thoughts on changing the error code at this point? > With that VM_DONTEXPAND check in, Dmitry's commit cd544fd1dc92 > ("mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio") > can still be reverted (as you agreed on 28th December), even though > vma_is_anonymous() will no longer protect it. I agree and if Dmitry does not have time I would be happy to mail a revert to cd544fd1dc92 as we discussed in [1]. Dmitry, would you like me to do that? > Was there an mremap(2) man page update for MREMAP_DONTUNMAP? > Whether or not there was before, it ought to get one now. Yes, the mremap(2) man page was updated when this flag was added and it will require a further update to reflect this expanded mapping support. Thanks Brian 1. https://lkml.org/lkml/2020/12/28/2340
[PATCH] eventfd: Introduce EFD_ZERO_ON_WAKE
This patch introduces a new flag to eventfd, called EFD_ZERO_ON_WAKE. This change is primarily introduced for use cases which do not care about the value stored in the eventfd itself. Such existing use cases require an additional read syscall to clear the count. This flag provides the following guarantees: (1) Writes can never block or return EAGAIN. The reason this is true is because we don't actually need to store the value and as a result the internal value is only changed between 0 and 1 and back to 0. Therefore POLLERR and POLLOUT are never possible outcomes. A poll with POLLOUT or a write will always immediately return regardless of EFD_NONBLOCK. (2) Read / POLLIN result in the internal value being reset to 0. When EFD_NONBLOCK is set reads when the internal value is 0 will immediately return with EAGAIN, as it always has. Similiarly, when a read is performed without EFD_NONBLOCK it will block until a write occurs. In both cases after the read completes successfully the internal value is reset to 0. When polling with POLLIN, upon return of a POLLIN event the internal value will be reset to 0. Signed-off-by: Brian Geffon --- fs/eventfd.c| 39 +-- include/linux/eventfd.h | 8 +++- 2 files changed, 44 insertions(+), 3 deletions(-) diff --git a/fs/eventfd.c b/fs/eventfd.c index e265b6dd4f34..56bf04d6461e 100644 --- a/fs/eventfd.c +++ b/fs/eventfd.c @@ -172,8 +172,21 @@ static __poll_t eventfd_poll(struct file *file, poll_table *wait) */ count = READ_ONCE(ctx->count); - if (count > 0) + if (count > 0) { + if ((ctx->flags & EFD_ZERO_ON_WAKE) && + (poll_requested_events(wait) & EPOLLIN)) { + /* +* We're going to cause a wake on EPOLLIN, we need to zero the count. +* We validate that EPOLLIN is a requested event because if the user +* did something odd like POLLPRI we wouldn't want to zero the count +* if no wake happens. +*/ + spin_lock_irq(&ctx->wqh.lock); + ctx->count = 0; + spin_unlock_irq(&ctx->wqh.lock); + } events |= EPOLLIN; + } if (count == ULLONG_MAX) events |= EPOLLERR; if (ULLONG_MAX - 1 > count) @@ -239,8 +252,11 @@ static ssize_t eventfd_read(struct kiocb *iocb, struct iov_iter *to) __add_wait_queue(&ctx->wqh, &wait); for (;;) { set_current_state(TASK_INTERRUPTIBLE); - if (ctx->count) + if (ctx->count) { + if (ctx->flags & EFD_ZERO_ON_WAKE) + ctx->count = 0; break; + } if (signal_pending(current)) { __remove_wait_queue(&ctx->wqh, &wait); __set_current_state(TASK_RUNNING); @@ -280,6 +296,18 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c return -EINVAL; spin_lock_irq(&ctx->wqh.lock); res = -EAGAIN; + + /* +* In the case of EFD_ZERO_ON_WAKE the actual count is never needed, for this +* reason we only adjust it to set it from 0 to 1 or 1 to 0. This means that +* write will never return EWOULDBLOCK or block, because there is always +* going to be enough space to write as the amount we will increment could +* be at most 1 as it's clamped below. Additionally, we know that POLLERR +* cannot be returned when EFD_ZERO_ON_WAKE is used for the same reason. +*/ + if (ctx->flags & EFD_ZERO_ON_WAKE) + ucnt = (ctx->count == 0) ? 1 : 0; + if (ULLONG_MAX - ctx->count > ucnt) res = sizeof(ucnt); else if (!(file->f_flags & O_NONBLOCK)) { @@ -414,9 +442,16 @@ static int do_eventfd(unsigned int count, int flags) BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC); BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK); + /* O_NOFOLLOW has been repurposed as EFD_ZERO_ON_WAKE */ + BUILD_BUG_ON(EFD_ZERO_ON_WAKE != O_NOFOLLOW); + if (flags & ~EFD_FLAGS_SET) return -EINVAL; + /* The semaphore semantics would be lost if using EFD_ZERO_ON_WAKE */ + if ((flags & EFD_ZERO_ON_WAKE) && (flags & EFD_SEMAPHORE)) + return -EINVAL; + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL); if (!ctx) return -ENOMEM; diff --git a/include/linux/eventfd.h b/include
Re: [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C
On Thu, Mar 4, 2021 at 2:16 PM Andy Lutomirski wrote: > > ret_from_fork is written in asm, slightly differently, for x86_32 and > x86_64. Convert it to C. > > This is a straight conversion without any particular cleverness. As a > further cleanup, the code that sets up the ret_from_fork argument registers > could be adjusted to put the arguments in the correct registers. An alternative would be to stash the function pointer and argument in the pt_regs of the new kthread task, and test for PF_KTHREAD instead. That would eliminate the need to pass those two values to the C version of ret_from_fork(). > This will cause the ORC unwinder to find pt_regs even for kernel threads on > x86_64. This seems harmless. > > The 32-bit comment above the now-deleted schedule_tail_wrapper was > obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the > same problem more cleanly. > > Cc: Josh Poimboeuf > Signed-off-by: Andy Lutomirski > --- > arch/x86/entry/common.c | 23 ++ > arch/x86/entry/entry_32.S| 51 +--- > arch/x86/entry/entry_64.S| 33 + > arch/x86/include/asm/switch_to.h | 2 +- > arch/x86/kernel/process.c| 2 +- > arch/x86/kernel/process_32.c | 2 +- > arch/x86/kernel/unwind_orc.c | 2 +- > 7 files changed, 43 insertions(+), 72 deletions(-) > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > index 95776f16c1cb..ef1c65938a6b 100644 > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall) > return -ENOSYS; > } > > +void ret_from_fork(struct task_struct *prev, > + int (*kernel_thread_fn)(void *), > + void *kernel_thread_arg, > + struct pt_regs *user_regs); > + > +__visible void noinstr ret_from_fork(struct task_struct *prev, > +int (*kernel_thread_fn)(void *), > +void *kernel_thread_arg, > +struct pt_regs *user_regs) > +{ > + instrumentation_begin(); > + > + schedule_tail(prev); > + > + if (kernel_thread_fn) { This should have an unlikely(), as kernel threads should be the rare case. -- Brian Gerst
Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP
I apologize, this patch didn't include my signed off by, here it is: Signed-off-by: Brian Geffon On Wed, Mar 3, 2021 at 9:53 AM Brian Geffon wrote: > > Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This > change > will widen the support to include shmem mappings. The primary use case > is to support MREMAP_DONTUNMAP on mappings which may have been created from > a memfd. > > Lokesh Gidra who works on the Android JVM, provided an explanation of how such > a feature will improve Android JVM garbage collection: > "Android is developing a new garbage collector (GC), based on userfaultfd. The > garbage collector will use userfaultfd (uffd) on the java heap during > compaction. > On accessing any uncompacted page, the application threads will find it > missing, > at which point the thread will create the compacted page and then use > UFFDIO_COPY > ioctl to get it mapped and then resume execution. Before starting this > compaction, > in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that > the > java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming > execution. > > To speedup mremap operations, pagetable movement was optimized by moving PUD > entries > instead of PTE entries [1]. It was necessary as mremap of even modest sized > memory > ranges also took several milliseconds, and stopping the application for that > long > isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE > feature [2], > it will be even more efficient to implement this GC, particularly the > 'non-moveable' > portions of the heap. It will also help in reducing the need to copy > (UFFDIO_COPY) > the pages. However, for this to work, the java heap has to be on a 'shared' > vma. > Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this > patch will > enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." > > [1] > https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ > [2] > https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ > --- > mm/mremap.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/mm/mremap.c b/mm/mremap.c > index ec8f840399ed..6934d199da54 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -653,8 +653,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long > addr, > return ERR_PTR(-EINVAL); > } > > - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || > - vma->vm_flags & VM_SHARED)) > + if (flags & MREMAP_DONTUNMAP && !(vma_is_anonymous(vma) || > vma_is_shmem(vma))) > return ERR_PTR(-EINVAL); > > if (is_vm_hugetlb_page(vma)) > -- > 2.31.0.rc0.254.gbdcc3b1a9d-goog >
[PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This change will widen the support to include shmem mappings. The primary use case is to support MREMAP_DONTUNMAP on mappings which may have been created from a memfd. Lokesh Gidra who works on the Android JVM, provided an explanation of how such a feature will improve Android JVM garbage collection: "Android is developing a new garbage collector (GC), based on userfaultfd. The garbage collector will use userfaultfd (uffd) on the java heap during compaction. On accessing any uncompacted page, the application threads will find it missing, at which point the thread will create the compacted page and then use UFFDIO_COPY ioctl to get it mapped and then resume execution. Before starting this compaction, in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming execution. To speedup mremap operations, pagetable movement was optimized by moving PUD entries instead of PTE entries [1]. It was necessary as mremap of even modest sized memory ranges also took several milliseconds, and stopping the application for that long isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature [2], it will be even more efficient to implement this GC, particularly the 'non-moveable' portions of the heap. It will also help in reducing the need to copy (UFFDIO_COPY) the pages. However, for this to work, the java heap has to be on a 'shared' vma. Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction." [1] https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/ [2] https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/ --- mm/mremap.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/mremap.c b/mm/mremap.c index ec8f840399ed..6934d199da54 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -653,8 +653,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, return ERR_PTR(-EINVAL); } - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) || - vma->vm_flags & VM_SHARED)) + if (flags & MREMAP_DONTUNMAP && !(vma_is_anonymous(vma) || vma_is_shmem(vma))) return ERR_PTR(-EINVAL); if (is_vm_hugetlb_page(vma)) -- 2.31.0.rc0.254.gbdcc3b1a9d-goog
Re: [PATCH v4 5/5] ibmvfc: reinitialize sub-CRQs and perform channel enquiry after LPM
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 2/5] ibmvfc: fix invalid sub-CRQ handles after hard reset
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v2 5/5] ibmvfc: reinitialize sub-CRQs and perform channel enquiry after LPM
On 2/25/21 2:48 PM, Tyrel Datwyler wrote: > A live partition migration (LPM) results in a CRQ disconnect similar to > a hard reset. In this LPM case the hypervisor moslty perserves the CRQ > transport such that it simply needs to be reenabled. However, the > capabilities may have changed such as fewer channels, or no channels at > all. Further, its possible that there may be sub-CRQ support, but no > channel support. The CRQ reenable path currently doesn't take any of > this into consideration. > > For simpilicty release and reinitialize sub-CRQs during reenable, and > set do_enquiry and using_channels with the appropriate values to trigger > channel renegotiation. > > Signed-off-by: Tyrel Datwyler > --- > drivers/scsi/ibmvscsi/ibmvfc.c | 13 + > 1 file changed, 13 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c > index 4ac2c442e1e2..9ae6be56e375 100644 > --- a/drivers/scsi/ibmvscsi/ibmvfc.c > +++ b/drivers/scsi/ibmvscsi/ibmvfc.c > @@ -903,6 +903,9 @@ static int ibmvfc_reenable_crq_queue(struct ibmvfc_host > *vhost) > { > int rc = 0; > struct vio_dev *vdev = to_vio_dev(vhost->dev); > + unsigned long flags; > + > + ibmvfc_release_sub_crqs(vhost); > > /* Re-enable the CRQ */ > do { > @@ -914,6 +917,16 @@ static int ibmvfc_reenable_crq_queue(struct ibmvfc_host > *vhost) > if (rc) > dev_err(vhost->dev, "Error enabling adapter (rc=%d)\n", rc); > > + spin_lock_irqsave(vhost->host->host_lock, flags); > + spin_lock(vhost->crq.q_lock); > + vhost->do_enquiry = 1; > + vhost->using_channels = 0; > + > + ibmvfc_init_sub_crqs(vhost); > + > + spin_unlock(vhost->crq.q_lock); > + spin_unlock_irqrestore(vhost->host->host_lock, flags); ibmvfc_init_sub_crqs can sleep, for multiple reasons, so you can't hold a lock when you call it. There is a GFP_KERNEL allocation in it, and the patch before this one adds an msleep in an error path. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v2 4/5] ibmvfc: store return code of H_FREE_SUB_CRQ during cleanup
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
[PATCH] mwifiex: don't print SSID to logs
There are a few reasons not to dump SSIDs as-is in kernel logs: 1) they're not guaranteed to be any particular text encoding (UTF-8, ASCII, ...) in general 2) it's somewhat redundant; the BSSID should be enough to uniquely identify the AP/STA to which we're connecting 3) BSSIDs have an easily-recognized format, whereas SSIDs do not (they are free-form) 4) other common drivers (e.g., everything based on mac80211) get along just fine by only including BSSIDs when logging state transitions Additional notes on reason #3: this is important for the privacy-conscious, especially when providing tools that convey kernel logs on behalf of a user -- e.g., when reporting bugs. So for example, it's easy to automatically filter logs for MAC addresses, but it's much harder to filter SSIDs out of unstructured text. Signed-off-by: Brian Norris --- drivers/net/wireless/marvell/mwifiex/cfg80211.c | 11 --- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c b/drivers/net/wireless/marvell/mwifiex/cfg80211.c index a2ed268ce0da..0961f4a5e415 100644 --- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c +++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c @@ -2300,8 +2300,7 @@ mwifiex_cfg80211_assoc(struct mwifiex_private *priv, size_t ssid_len, is_scanning_required = 1; } else { mwifiex_dbg(priv->adapter, MSG, - "info: trying to associate to '%.*s' bssid %pM\n", - req_ssid.ssid_len, (char *)req_ssid.ssid, + "info: trying to associate to bssid %pM\n", bss->bssid); memcpy(&priv->cfg_bssid, bss->bssid, ETH_ALEN); break; @@ -2378,8 +2377,7 @@ mwifiex_cfg80211_connect(struct wiphy *wiphy, struct net_device *dev, } mwifiex_dbg(adapter, INFO, - "info: Trying to associate to %.*s and bssid %pM\n", - (int)sme->ssid_len, (char *)sme->ssid, sme->bssid); + "info: Trying to associate to bssid %pM\n", sme->bssid); if (!mwifiex_stop_bg_scan(priv)) cfg80211_sched_scan_stopped_locked(priv->wdev.wiphy, 0); @@ -2512,9 +2510,8 @@ mwifiex_cfg80211_join_ibss(struct wiphy *wiphy, struct net_device *dev, goto done; } - mwifiex_dbg(priv->adapter, MSG, - "info: trying to join to %.*s and bssid %pM\n", - params->ssid_len, (char *)params->ssid, params->bssid); + mwifiex_dbg(priv->adapter, MSG, "info: trying to join to bssid %pM\n", + params->bssid); mwifiex_set_ibss_params(priv, params); -- 2.30.0.617.g56c4b15f3c-goog
Re: [PATCH v2] nand: brcmnand: fix OOB R/W with Hamming ECC
On Wed, Feb 24, 2021 at 12:02 AM Álvaro Fernández Rojas wrote: > Fixes: 27c5b17cd1b1 ("mtd: nand: add NAND driver "library" for Broadcom STB > NAND controller") FWIW, I could believe this was broken. We weren't testing Hamming ECC (nor JFFS2) at the time, so it could easily have obvious bugs like this. And since I got this far, and I'm still in MAINTAINERS, I guess: Acked-by: Brian Norris
Re: [PATCH 4/4] ibmvfc: store return code of H_FREE_SUB_CRQ during cleanup
On 2/11/21 12:57 PM, Tyrel Datwyler wrote: > diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c > index ba6fcf9cbc57..23b803ac4a13 100644 > --- a/drivers/scsi/ibmvscsi/ibmvfc.c > +++ b/drivers/scsi/ibmvscsi/ibmvfc.c > @@ -5670,7 +5670,7 @@ static int ibmvfc_register_scsi_channel(struct > ibmvfc_host *vhost, > > irq_failed: > do { > - plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address, > scrq->cookie); > + rc = plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address, > scrq->cookie); > } while (rc == H_BUSY || H_IS_LONG_BUSY(rc)); Other places in the driver where we get a busy return code back we have an msleep(100). Should we be doing that here as well? Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH 3/4] ibmvfc: treat H_CLOSED as success during sub-CRQ registration
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH 2/4] ibmvfc: fix invalid sub-CRQ handles after hard reset
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote: > On 13.01.21 22:53, Dave Chinner wrote: > > [...] > > I agree that a throttling fix is needed, but I'm trying to > > understand the scope and breadth of the problem first instead of > > jumping the gun and making the wrong fix for the wrong reasons that > > just papers over the underlying problems that the throttling bug has > > made us aware of... > > Are you still working on this? > > If it takes more time to understand the potential underlying problem, the fix > for the problem at hand should be applied. > > This is a real world problem, accidentally found in the wild. It appears very > rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , > 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix > use-after-free on CIL context on shutdown") which silently added a condition > to the wakeup. The condition is based on a wrong assumption. > > Why is this "papering over"? If a reminder was needed, there were better ways > than randomly hanging the system. > > Why is > > if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > wake_up_all(&cil->xc_push_wait); > > , which doesn't work reliably, preferable to > > if (waitqueue_active(&cil->xc_push_wait)) > wake_up_all(&cil->xc_push_wait); > > which does? > JFYI, Dave followed up with a patch a couple weeks or so ago: https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/ Brian > Best > Donald > > > Cheers, > > > > Dave >
Re: [PATCH 1/4] ibmvfc: simplify handling of sub-CRQ initialization
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH] xfs: fix boolreturn.cocci warnings
On Thu, Feb 11, 2021 at 04:09:16AM +0800, kernel test robot wrote: > From: kernel test robot > > fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function > 'xfs_log_need_covered' with return type bool > > Return statements in functions returning bool should use > true/false instead of 1/0. > Generated by: scripts/coccinelle/misc/boolreturn.cocci > > Fixes: 37444fc4cc39 ("xfs: lift writable fs check up into log worker task") > CC: Brian Foster > Reported-by: kernel test robot > Signed-off-by: kernel test robot > --- Reviewed-by: Brian Foster > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git > xfs-5.12-merge > head: 560ab6c0d12ebccabb83638abe23a7875b946f9a > commit: 37444fc4cc398266fe0f71a9c0925620d44fb76a [25/36] xfs: lift writable > fs check up into log worker task > > xfs_log.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -1059,7 +1059,7 @@ xfs_log_need_covered( > boolneeded = false; > > if (!xlog_cil_empty(log)) > - return 0; > + return false; > > spin_lock(&log->l_icloglock); > switch (log->l_covered_state) { >
Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io
On Tue, Feb 09, 2021 at 07:43:14AM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote: > > On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote: > > > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote: > > > > There was a v2 inline that incorporated some directed feedback. > > > > Otherwise there were questions and ideas about making the whole thing > > > > faster, but I've no idea if that addresses the problem or not (if so, > > > > that would be an entirely different set of patches). I'll wait and see > > > > what Darrick thinks about this and rebase/repost if the approach is > > > > agreeable.. > > > > > > There is always the school of thought that says that the best way to > > > get people to focus on this is to rebase and repost. Otherwise, they > > > are all too likely to assume that you lost interest in this. > > > > I was hoping that a better solution would emerge for clearing > > PageWriteback on hundreds of thousands of pages, but nothing easy popped > > out. > > > > The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends > > to completion workqueue" gives me unease because who's to say if marking > > 262,144 pages on a particular CPU will actually stall it long enough to > > trip the hangcheck? Is the number lower on (say) some pokey NAS box > > with a lot of storage but a slow CPU? > > It's also not the right thing to do given the IO completion > workqueue is a bound workqueue. Anything that is doing large amounts > of CPU intensive work should be on a unbound workqueue so that the > scheduler can bounce it around different CPUs as needed. > > Quite frankly, the problem is a huge long ioend chain being built by > the submission code. We need to keep ioend completion overhead down. > It runs in either softirq or bound workqueue context and so > individual items of work that are performed in this context must not > be -unbounded- in size or time. Unbounded ioend chains are bad for > IO latency, they are bad for memory reclaim and they are bad for CPU > scheduling. > > As I've said previously, we gain nothing by aggregating ioends past > a few tens of megabytes of submitted IO. The batching gains are > completely diminished once we've got enough IO in flight to keep the > submission queue full. We're talking here about gigabytes of > sequential IOs in a single ioend chain which are 2-3 orders of > magnitude larger than needed for optimal background IO submission > and completion efficiency and throughput. IOWs, we really should be > limiting the ioend chain length at submission time, not trying to > patch over bad completion behaviour that results from sub-optimal IO > submission behaviour... > That was the patch I posted prior to the aforementioned set. Granted, it was an RFC, but for reference: https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/ (IIRC, you also had a variant that was essentially the same change.) The discussion that followed in that thread was around the preference to move completion of large chains into workqueue context instead of breaking up the chains. The series referenced in my first reply fell out of that as a targeted fix for the stall warning. > > That said, /some/ threshold is probably better than no threshold. Could > > someone try to confirm if that series of Brian's fixes this problem too? > > 262144 pages is still too much work to be doing in a single softirq > IO completion callback. It's likely to be too much work for a bound > workqueue, too, especially when you consider that the workqueue > completion code will merge sequential ioends into one ioend, hence > making the IO completion loop counts bigger and latency problems worse > rather than better... > That was just a conservative number picked based on observation of the original report (10+ GB ioends IIRC). I figured the review cycle would involve narrowing it down to something more generically reasonable (10s-100s of MB?) once we found an acceptable approach (and hopefully received some testing feedback), but we've never really got to that point.. Brian > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io
On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote: > On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote: > > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote: > > > On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote: > > > > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote: > > > > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote: > > > > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote: > > > > > > > Dear Linux folks, > > > > > > > > > > > > > > > > > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux > > > > > > > 5.4.57, we > > > > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, > > > > > > > January 18th, > > > > > > > 2021). > > > > > > > > > > > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`. > > > > > > > > > > > > > > ``` > > > > > > > [0.00] Linux version 5.4.57.mx64.340 > > > > > > > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP > > > > > > > Tue Aug 11 > > > > > > > 13:20:33 CEST 2020 > > > > > > > […] > > > > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU > > > > > > > [48962.987511] rcu: 4-: (20999 ticks this GP) > > > > > > > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696 > > > > > > > [48962.998805](t=21017 jiffies g=14529009 q=32263) > > > > > > > [48963.004074] Task dump for CPU 4: > > > > > > > [48963.007689] kworker/4:2 R running task0 25587 > > > > > > > 2 > > > > > > > 0x80004008 > > > > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io > > > > > > > [48963.020570] Call Trace: > > > > > > > [48963.023311] > > > > > > > [48963.025560] sched_show_task+0x11e/0x150 > > > > > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0 > > > > > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770 > > > > > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60 > > > > > > > [48963.044106] update_process_times+0x24/0x60 > > > > > > > [48963.048791] tick_sched_timer+0x37/0x70 > > > > > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0 > > > > > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10 > > > > > > > [48963.062744] hrtimer_interrupt+0xe5/0x240 > > > > > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130 > > > > > > > [48963.072407] apic_timer_interrupt+0xf/0x20 > > > > > > > [48963.076994] > > > > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10 > > > > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 > > > > > > > ed ea 5e > > > > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 > > > > > > > 56 9d > > > > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d > > > > > > > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 > > > > > > > ORIG_RAX: > > > > > > > ff13 > > > > > > > [48963.115003] RAX: 82407588 RBX: 82407580 RCX: > > > > > > > 82407588 > > > > > > > [48963.122994] RDX: 82407588 RSI: 0202 RDI: > > > > > > > 82407580 > > > > > > > [48963.130989] RBP: 0202 R08: 8203ea00 R09: > > > > > > > 0001 > > > > > > > [48963.138982] R10: c9000738fbb8 R11: 0001 R12: > > > > > > > 82407588 > > > > > > > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15: > > > > > > > ea005e7ae600 > > > > > > > [48963.154971] wake_up_page_bit+0xe0/0x100 > > > > > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0 > > > > > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0 > &
Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io
On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote: > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote: > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote: > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote: > > > > Dear Linux folks, > > > > > > > > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux > > > > 5.4.57, we > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January > > > > 18th, > > > > 2021). > > > > > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`. > > > > > > > > ``` > > > > [0.00] Linux version 5.4.57.mx64.340 > > > > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue > > > > Aug 11 > > > > 13:20:33 CEST 2020 > > > > […] > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU > > > > [48962.987511] rcu: 4-: (20999 ticks this GP) > > > > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696 > > > > [48962.998805] (t=21017 jiffies g=14529009 q=32263) > > > > [48963.004074] Task dump for CPU 4: > > > > [48963.007689] kworker/4:2 R running task0 25587 2 > > > > 0x80004008 > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io > > > > [48963.020570] Call Trace: > > > > [48963.023311] > > > > [48963.025560] sched_show_task+0x11e/0x150 > > > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0 > > > > [48963.034545] rcu_sched_clock_irq+0x502/0x770 > > > > [48963.039322] ? tick_sched_do_timer+0x60/0x60 > > > > [48963.044106] update_process_times+0x24/0x60 > > > > [48963.048791] tick_sched_timer+0x37/0x70 > > > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0 > > > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10 > > > > [48963.062744] hrtimer_interrupt+0xe5/0x240 > > > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130 > > > > [48963.072407] apic_timer_interrupt+0xf/0x20 > > > > [48963.076994] > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10 > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea > > > > 5e > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d > > > > > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d > > > > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 ORIG_RAX: > > > > ff13 > > > > [48963.115003] RAX: 82407588 RBX: 82407580 RCX: > > > > 82407588 > > > > [48963.122994] RDX: 82407588 RSI: 0202 RDI: > > > > 82407580 > > > > [48963.130989] RBP: 0202 R08: 8203ea00 R09: > > > > 0001 > > > > [48963.138982] R10: c9000738fbb8 R11: 0001 R12: > > > > 82407588 > > > > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15: > > > > ea005e7ae600 > > > > [48963.154971] wake_up_page_bit+0xe0/0x100 > > > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0 > > > > [48963.163857] xfs_end_ioend+0xcf/0x1a0 > > > > [48963.167958] xfs_end_io+0xa4/0xd0 > > > > [48963.171672] process_one_work+0x1e5/0x410 > > > > [48963.176163] worker_thread+0x2d/0x3c0 > > > > [48963.180265] ? cancel_delayed_work+0x90/0x90 > > > > [48963.185048] kthread+0x117/0x130 > > > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70 > > > > [48963.194321] ret_from_fork+0x35/0x40 > > > > ``` > > > > > > > > As it’s just log level INFO, is there anything what should be done, or > > > > was > > > > the system probably just “overloaded”? > > > > > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y > > > rather than CONFIG_PREEMPT_VOLUNTARY=y. > > > > > > If so, and if the problem is that you are temporarily overdriving xfs I/O, > > > one approach would be as follows: > > > > > > > > > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > > > index f16d5f1..06be426 100644 > > > --- a/fs/xfs/xfs_aops.c &
Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io
On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote: > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote: > > Dear Linux folks, > > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th, > > 2021). > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`. > > > > ``` > > [0.00] Linux version 5.4.57.mx64.340 > > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11 > > 13:20:33 CEST 2020 > > […] > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU > > [48962.987511] rcu: 4-: (20999 ticks this GP) > > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696 > > [48962.998805] (t=21017 jiffies g=14529009 q=32263) > > [48963.004074] Task dump for CPU 4: > > [48963.007689] kworker/4:2 R running task0 25587 2 > > 0x80004008 > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io > > [48963.020570] Call Trace: > > [48963.023311] > > [48963.025560] sched_show_task+0x11e/0x150 > > [48963.029957] rcu_dump_cpu_stacks+0x70/0xa0 > > [48963.034545] rcu_sched_clock_irq+0x502/0x770 > > [48963.039322] ? tick_sched_do_timer+0x60/0x60 > > [48963.044106] update_process_times+0x24/0x60 > > [48963.048791] tick_sched_timer+0x37/0x70 > > [48963.053089] __hrtimer_run_queues+0x11f/0x2b0 > > [48963.057960] ? recalibrate_cpu_khz+0x10/0x10 > > [48963.062744] hrtimer_interrupt+0xe5/0x240 > > [48963.067235] smp_apic_timer_interrupt+0x6f/0x130 > > [48963.072407] apic_timer_interrupt+0xf/0x20 > > [48963.076994] > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10 > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d > > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 ORIG_RAX: > > ff13 > > [48963.115003] RAX: 82407588 RBX: 82407580 RCX: > > 82407588 > > [48963.122994] RDX: 82407588 RSI: 0202 RDI: > > 82407580 > > [48963.130989] RBP: 0202 R08: 8203ea00 R09: > > 0001 > > [48963.138982] R10: c9000738fbb8 R11: 0001 R12: > > 82407588 > > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15: > > ea005e7ae600 > > [48963.154971] wake_up_page_bit+0xe0/0x100 > > [48963.159366] xfs_destroy_ioend+0xce/0x1c0 > > [48963.163857] xfs_end_ioend+0xcf/0x1a0 > > [48963.167958] xfs_end_io+0xa4/0xd0 > > [48963.171672] process_one_work+0x1e5/0x410 > > [48963.176163] worker_thread+0x2d/0x3c0 > > [48963.180265] ? cancel_delayed_work+0x90/0x90 > > [48963.185048] kthread+0x117/0x130 > > [48963.188663] ? kthread_create_worker_on_cpu+0x70/0x70 > > [48963.194321] ret_from_fork+0x35/0x40 > > ``` > > > > As it’s just log level INFO, is there anything what should be done, or was > > the system probably just “overloaded”? > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y > rather than CONFIG_PREEMPT_VOLUNTARY=y. > > If so, and if the problem is that you are temporarily overdriving xfs I/O, > one approach would be as follows: > > > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c > index f16d5f1..06be426 100644 > --- a/fs/xfs/xfs_aops.c > +++ b/fs/xfs/xfs_aops.c > @@ -390,6 +390,7 @@ xfs_end_io( > list_del_init(&ioend->io_list); > xfs_ioend_try_merge(ioend, &completion_list); > xfs_end_ioend(ioend); > + cond_resched(); > } > } > > FWIW, this looks quite similar to the problem I attempted to fix with these patches: https://lore.kernel.org/linux-xfs/20201002153357.56409-1-bfos...@redhat.com/ Brian > > If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your > problem is likely massive lock contention in wake_up_page_bit(), or > perhaps someone having failed to release that lock. The usual way to > work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this > is often not what you want enabled in production. > > Darrick, thoughts from an xfs perspective? > > Thanx, Paul >
[PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper
When a static function is annotated with INDIRECT_CALLABLE_SCOPE and CONFIG_RETPOLINE is set, the static keyword is removed. Sometimes the function needs to be exported but EXPORT_SYMBOL can't be used because if CONFIG_RETPOLINE is not set, we will attempt to export a static symbol. This patch introduces a new indirect call wrapper: EXPORT_INDIRECT_CALLABLE. This basically does EXPORT_SYMBOL when CONFIG_RETPOLINE is set, but does nothing when it's not. Reported-by: Stephen Rothwell Signed-off-by: Brian Vazquez --- include/linux/indirect_call_wrapper.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index 54c02c84906a..a8345c8a613d 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -36,6 +36,7 @@ #define INDIRECT_CALLABLE_DECLARE(f) f #define INDIRECT_CALLABLE_SCOPE +#define EXPORT_INDIRECT_CALLABLE(f)EXPORT_SYMBOL(f) #else #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__) @@ -44,6 +45,7 @@ #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__) #define INDIRECT_CALLABLE_DECLARE(f) #define INDIRECT_CALLABLE_SCOPEstatic +#define EXPORT_INDIRECT_CALLABLE(f) #endif /* -- 2.30.0.365.g02bc693789-goog
[PATCH net-next 2/2] net: fix building errors on powerpc when CONFIG_RETPOLINE is not set
This commit fixes the errores reported when building for powerpc: ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL Fixes: f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu") Fixes: bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check functions") Reported-by: Stephen Rothwell Signed-off-by: Brian Vazquez --- net/ipv4/route.c | 4 ++-- net/ipv6/route.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 9e6537709794..be31e2446470 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst, return NULL; return dst; } -EXPORT_SYMBOL(ipv4_dst_check); +EXPORT_INDIRECT_CALLABLE(ipv4_dst_check); static void ipv4_send_dest_unreach(struct sk_buff *skb) { @@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ipv4_mtu); +EXPORT_INDIRECT_CALLABLE(ipv4_mtu); static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 8d9e053dc071..0d1784b0d65d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ip6_dst_check(struct dst_entry *dst, return dst_ret; } -EXPORT_SYMBOL(ip6_dst_check); +EXPORT_INDIRECT_CALLABLE(ip6_dst_check); static struct dst_entry *ip6_negative_advice(struct dst_entry *dst) { @@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ip6_mtu); +EXPORT_INDIRECT_CALLABLE(ip6_mtu); /* MTU selection: * 1. mtu on route is locked - use it -- 2.30.0.365.g02bc693789-goog
[PATCH net-next 2/2] net: fix building errors on powerpc when CONFIG_RETPOLINE is not set
This commit fixes the errores reported when building for powerpc: ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL Fixes: f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu") Fixes: bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check functions") Reported-by: Stephen Rothwell Signed-off-by: Brian Vazquez --- net/ipv4/route.c | 4 ++-- net/ipv6/route.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 9e6537709794..be31e2446470 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst, return NULL; return dst; } -EXPORT_SYMBOL(ipv4_dst_check); +EXPORT_INDIRECT_CALLABLE(ipv4_dst_check); static void ipv4_send_dest_unreach(struct sk_buff *skb) { @@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ipv4_mtu); +EXPORT_INDIRECT_CALLABLE(ipv4_mtu); static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 8d9e053dc071..0d1784b0d65d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ip6_dst_check(struct dst_entry *dst, return dst_ret; } -EXPORT_SYMBOL(ip6_dst_check); +EXPORT_INDIRECT_CALLABLE(ip6_dst_check); static struct dst_entry *ip6_negative_advice(struct dst_entry *dst) { @@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ip6_mtu); +EXPORT_INDIRECT_CALLABLE(ip6_mtu); /* MTU selection: * 1. mtu on route is locked - use it -- 2.30.0.365.g02bc693789-goog
Re: [PATCH] xfs: fix unused variable warning
On Thu, Feb 04, 2021 at 05:03:44PM +0100, Arnd Bergmann wrote: > From: Arnd Bergmann > > When debugging is disabled, the ASSERT() is left out and > the 'log' variable becomes unused: > > fs/xfs/xfs_log.c::16: error: unused variable 'log' > [-Werror,-Wunused-variable] > > Remove the variable declaration and open-code it inside > of the assertion. > > Fixes: 303591a0a947 ("xfs: cover the log during log quiesce") > Signed-off-by: Arnd Bergmann > --- I sent basically the same patch[1] about a week ago, but either one is fine with me: Reviewed-by: Brian Foster [1] https://lore.kernel.org/linux-xfs/20210125132616.GB2047559@bfoster/ > fs/xfs/xfs_log.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 58699881c100..d8b814227734 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -1108,12 +1108,11 @@ static int > xfs_log_cover( > struct xfs_mount*mp) > { > - struct xlog *log = mp->m_log; > int error = 0; > boolneed_covered; > > - ASSERT((xlog_cil_empty(log) && xlog_iclogs_empty(log) && > - !xfs_ail_min_lsn(log->l_ailp)) || > + ASSERT((xlog_cil_empty(mp->m_log) && xlog_iclogs_empty(mp->m_log) && > + !xfs_ail_min_lsn(mp->m_log->l_ailp)) || > XFS_FORCED_SHUTDOWN(mp)); > > if (!xfs_log_writable(mp)) > -- > 2.29.2 >
Re: [PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper
Yeah, I'm also not seeing it on patchwork. But I did get the email on both corp and personal email. So maybe something is failing at patchwork? On Thu, Feb 4, 2021 at 9:50 AM George McCollister wrote: > > I don't see the second patch. > > Regards, > George McCollister
Re: linux-next: build failure after merge of the net-next tree
Hi all, I've just sent the patch series: https://patchwork.kernel.org/project/netdevbpf/list/?series=428099. Thanks, Brian On Thu, Feb 4, 2021 at 7:36 AM Vladimir Oltean wrote: > > Hi Brian, > > On Wed, Feb 03, 2021 at 07:52:08PM -0800, Brian Vazquez wrote: > > Hi Stephen, thanks for the report. I'm having trouble trying to > > compile for ppc, but I believe this should fix the problem, could you > > test this patch, please? Thanks! > > Could you please submit the patch formally to net-next? Thanks.
[PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper
When a static function is annotated with INDIRECT_CALLABLE_SCOPE and CONFIG_RETPOLINE is set, the static keyword is removed. Sometimes the function needs to be exported but EXPORT_SYMBOL can't be used because if CONFIG_RETPOLINE is not set, we will attempt to export a static symbol. This patch introduces a new indirect call wrapper: EXPORT_INDIRECT_CALLABLE. This basically does EXPORT_SYMBOL when CONFIG_RETPOLINE is set, but does nothing when it's not. Reported-by: Stephen Rothwell Signed-off-by: Brian Vazquez --- include/linux/indirect_call_wrapper.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index 54c02c84906a..a8345c8a613d 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -36,6 +36,7 @@ #define INDIRECT_CALLABLE_DECLARE(f) f #define INDIRECT_CALLABLE_SCOPE +#define EXPORT_INDIRECT_CALLABLE(f)EXPORT_SYMBOL(f) #else #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__) @@ -44,6 +45,7 @@ #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__) #define INDIRECT_CALLABLE_DECLARE(f) #define INDIRECT_CALLABLE_SCOPEstatic +#define EXPORT_INDIRECT_CALLABLE(f) #endif /* -- 2.30.0.365.g02bc693789-goog
Re: linux-next: build failure after merge of the net-next tree
Hi Stephen, thanks for the report. I'm having trouble trying to compile for ppc, but I believe this should fix the problem, could you test this patch, please? Thanks! diff --git a/include/linux/indirect_call_wrapper.h b/include/linux/indirect_call_wrapper.h index 54c02c84906a..077f96be65c6 100644 --- a/include/linux/indirect_call_wrapper.h +++ b/include/linux/indirect_call_wrapper.h @@ -36,6 +36,7 @@ #define INDIRECT_CALLABLE_DECLARE(f) f #define INDIRECT_CALLABLE_SCOPE +#define INDIRECT_CALLABLE_EXPORT(f)EXPORT_SYMBOL(f) #else #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__) @@ -44,6 +45,7 @@ #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__) #define INDIRECT_CALLABLE_DECLARE(f) #define INDIRECT_CALLABLE_SCOPEstatic +#define INDIRECT_CALLABLE_EXPORT(f) #endif /* diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 9e6537709794..9dd8ff3887b7 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst, return NULL; return dst; } -EXPORT_SYMBOL(ipv4_dst_check); +INDIRECT_CALLABLE_EXPORT(ipv4_dst_check); static void ipv4_send_dest_unreach(struct sk_buff *skb) { @@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ipv4_mtu); +INDIRECT_CALLABLE_EXPORT(ipv4_mtu); static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f447f82e6819..75d6a0db1fa6 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry *ip6_dst_check(struct dst_entry *dst, return dst_ret; } -EXPORT_SYMBOL(ip6_dst_check); +INDIRECT_CALLABLE_EXPORT(ip6_dst_check); static struct dst_entry *ip6_negative_advice(struct dst_entry *dst) { @@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } -EXPORT_SYMBOL(ip6_mtu); +INDIRECT_CALLABLE_EXPORT(ip6_mtu); /* MTU selection: * 1. mtu on route is locked - use it On Wed, Feb 3, 2021 at 5:33 PM Stephen Rothwell wrote: > > Hi all, > > After merging the net-next tree, today's linux-next build (powerpc > ppc64_defconfig) failed like this: > > ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL > ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL > ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL > ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL > > Caused by commits > > f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu") > bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check > functions") > > I have used the net-next tree from next-20210203 fot today. > > -- > Cheers, > Stephen Rothwell
Re: [PATCH] mwifiex: Report connected BSS with cfg80211_connect_bss()
On Sun, Jan 31, 2021 at 11:07 PM Yen-lin Lai wrote: > When a network is moved or reconfigured on the different channel, there > can be multiple BSSes with the same BSSID and SSID in scan result > before the old one expires. Then, it can cause cfg80211_connect_result > to map current_bss to a bss with the wrong channel. > > Let mwifiex_cfg80211_assoc return the selected BSS and then the caller > can report it cfg80211_connect_bss. > > Signed-off-by: Yen-lin Lai This seems sane to me: Reviewed-by: Brian Norris
[PATCH net-next v3 2/4] net: use indirect call helpers for dst_output
This patch avoids the indirect call for the common case: ip6_output and ip_output Signed-off-by: Brian Vazquez --- include/net/dst.h | 8 +++- net/ipv4/ip_output.c | 1 + net/ipv6/ip6_output.c | 1 + 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/include/net/dst.h b/include/net/dst.h index 98cf6e8c06c4..3932e9931f08 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -436,10 +436,16 @@ static inline void dst_set_expires(struct dst_entry *dst, int timeout) dst->expires = expires; } +INDIRECT_CALLABLE_DECLARE(int ip6_output(struct net *, struct sock *, +struct sk_buff *)); +INDIRECT_CALLABLE_DECLARE(int ip_output(struct net *, struct sock *, +struct sk_buff *)); /* Output packet to network from transport. */ static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff *skb) { - return skb_dst(skb)->output(net, sk, skb); + return INDIRECT_CALL_INET(skb_dst(skb)->output, + ip6_output, ip_output, + net, sk, skb); } INDIRECT_CALLABLE_DECLARE(int ip6_input(struct sk_buff *)); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 959b94e32f2b..3aab53beb4ea 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -434,6 +434,7 @@ int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb) ip_finish_output, !(IPCB(skb)->flags & IPSKB_REROUTED)); } +EXPORT_SYMBOL(ip_output); /* * copy saddr and daddr, possibly using 64bit load/stores diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 117cd95df213..ff4f9ebcf7f6 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -217,6 +217,7 @@ int ip6_output(struct net *net, struct sock *sk, struct sk_buff *skb) ip6_finish_output, !(IP6CB(skb)->flags & IP6SKB_REROUTED)); } +EXPORT_SYMBOL(ip6_output); bool ip6_autoflowlabel(struct net *net, const struct ipv6_pinfo *np) { -- 2.30.0.365.g02bc693789-goog
[PATCH net-next v3 3/4] net: use indirect call helpers for dst_mtu
This patch avoids the indirect call for the common case: ip6_mtu and ipv4_mtu Signed-off-by: Brian Vazquez --- include/net/dst.h | 4 +++- net/ipv4/route.c | 6 -- net/ipv6/route.c | 6 -- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 3932e9931f08..9f474a79ed7d 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -194,9 +194,11 @@ dst_feature(const struct dst_entry *dst, u32 feature) return dst_metric(dst, RTAX_FEATURES) & feature; } +INDIRECT_CALLABLE_DECLARE(unsigned int ip6_mtu(const struct dst_entry *)); +INDIRECT_CALLABLE_DECLARE(unsigned int ipv4_mtu(const struct dst_entry *)); static inline u32 dst_mtu(const struct dst_entry *dst) { - return dst->ops->mtu(dst); + return INDIRECT_CALL_INET(dst->ops->mtu, ip6_mtu, ipv4_mtu, dst); } /* RTT metrics are stored in milliseconds for user ABI, but used as jiffies */ diff --git a/net/ipv4/route.c b/net/ipv4/route.c index e26652ff7059..4fac91f8bd6c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -135,7 +135,8 @@ static int ip_rt_gc_timeout __read_mostly = RT_GC_TIMEOUT; static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ipv4_default_advmss(const struct dst_entry *dst); -static unsigned int ipv4_mtu(const struct dst_entry *dst); +INDIRECT_CALLABLE_SCOPE +unsigned int ipv4_mtu(const struct dst_entry *dst); static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst); static void ipv4_link_failure(struct sk_buff *skb); static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, @@ -1311,7 +1312,7 @@ static unsigned int ipv4_default_advmss(const struct dst_entry *dst) return min(advmss, IPV4_MAX_PMTU - header_size); } -static unsigned int ipv4_mtu(const struct dst_entry *dst) +INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst) { const struct rtable *rt = (const struct rtable *)dst; unsigned int mtu = rt->rt_pmtu; @@ -1333,6 +1334,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +EXPORT_SYMBOL(ipv4_mtu); static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 41d8f801b75f..4d83700d5602 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -83,7 +83,8 @@ enum rt6_nud_state { static struct dst_entry*ip6_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ip6_default_advmss(const struct dst_entry *dst); -static unsigned int ip6_mtu(const struct dst_entry *dst); +INDIRECT_CALLABLE_SCOPE +unsigned int ip6_mtu(const struct dst_entry *dst); static struct dst_entry *ip6_negative_advice(struct dst_entry *); static voidip6_dst_destroy(struct dst_entry *); static voidip6_dst_ifdown(struct dst_entry *, @@ -3089,7 +3090,7 @@ static unsigned int ip6_default_advmss(const struct dst_entry *dst) return mtu; } -static unsigned int ip6_mtu(const struct dst_entry *dst) +INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst) { struct inet6_dev *idev; unsigned int mtu; @@ -3111,6 +3112,7 @@ static unsigned int ip6_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +EXPORT_SYMBOL(ip6_mtu); /* MTU selection: * 1. mtu on route is locked - use it -- 2.30.0.365.g02bc693789-goog
[PATCH net-next v3 4/4] net: indirect call helpers for ipv4/ipv6 dst_check functions
This patch avoids the indirect call for the common case: ip6_dst_check and ipv4_dst_check Signed-off-by: Brian Vazquez --- include/net/dst.h | 7 ++- net/core/sock.c | 12 ++-- net/ipv4/route.c| 7 +-- net/ipv4/tcp_ipv4.c | 5 - net/ipv6/route.c| 7 +-- net/ipv6/tcp_ipv6.c | 5 - 6 files changed, 34 insertions(+), 9 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 9f474a79ed7d..26f134ad3a25 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -459,10 +459,15 @@ static inline int dst_input(struct sk_buff *skb) ip6_input, ip_local_deliver, skb); } +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *, + u32)); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie) { if (dst->obsolete) - dst = dst->ops->check(dst, cookie); + dst = INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, +ipv4_dst_check, dst, cookie); return dst; } diff --git a/net/core/sock.c b/net/core/sock.c index 648a5cb46209..0ed98f20448a 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -526,11 +526,17 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb, } EXPORT_SYMBOL(__sk_receive_skb); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *, + u32)); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie) { struct dst_entry *dst = __sk_dst_get(sk); - if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) { + if (dst && dst->obsolete && + INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check, + dst, cookie) == NULL) { sk_tx_queue_clear(sk); sk->sk_dst_pending_confirm = 0; RCU_INIT_POINTER(sk->sk_dst_cache, NULL); @@ -546,7 +552,9 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie) { struct dst_entry *dst = sk_dst_get(sk); - if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) { + if (dst && dst->obsolete && + INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check, + dst, cookie) == NULL) { sk_dst_reset(sk); dst_release(dst); return NULL; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 4fac91f8bd6c..9e6537709794 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -133,7 +133,8 @@ static int ip_rt_gc_timeout __read_mostly = RT_GC_TIMEOUT; * Interface to generic destination cache. */ -static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); +INDIRECT_CALLABLE_SCOPE +struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ipv4_default_advmss(const struct dst_entry *dst); INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst); @@ -1188,7 +1189,8 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk) } EXPORT_SYMBOL_GPL(ipv4_sk_redirect); -static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie) +INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst, +u32 cookie) { struct rtable *rt = (struct rtable *) dst; @@ -1204,6 +1206,7 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie) return NULL; return dst; } +EXPORT_SYMBOL(ipv4_dst_check); static void ipv4_send_dest_unreach(struct sk_buff *skb) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 777306b5bc22..611039207d30 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1649,6 +1649,8 @@ u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph, return mss; } +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); /* The socket must have it's spinlock held when we get * here, unless it is a TCP_LISTEN socket. * @@ -1668,7 +1670,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) sk_mark_napi_id(sk, skb); if (dst) { if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif || - !dst->ops->check(ds
[PATCH net-next v3 0/4] net: use INDIRECT_CALL in some dst_ops
This patch series uses the INDIRECT_CALL wrappers in some dst_ops functions to mitigate retpoline costs. Benefits depend on the platform as described below. Background: The kernel rewrites the retpoline code at __x86_indirect_thunk_r11 depending on the CPU's requirements. The INDIRECT_CALL wrappers provide hints on possible targets and save the retpoline overhead using a direct call in case the target matches one of the hints. The retpoline overhead for the following three cases has been measured by Luigi Rizzo in microbenchmarks, using CPU performance counters, and cover reasonably well the range of possible retpoline overheads compared to a plain indirect call (in equal conditions, specifically with predicted branch, hot cache): - just "jmp *(%r11)" on modern platforms like Intel Cascadelake. In this case the overhead is just 2 clock cycles: - "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs. In this case the lfence is blocked until pending reads complete, so the actual overhead depends on previous instructions. The best case we have measured 15 clock cycles of overhead. - worst case, e.g. skylake, the full retpoline is used __x86_indirect_thunk_r11: call set_u_target capture_speculation: pause lfence jmp capture_speculation .align 16 set_up_target:mov %r11, (%rsp) ret In this case the overhead has been measured in 35-40 clock cycles. The actual time saved hence depends on the platform and current clock speed (which varies heavily, especially when C-states are active). Also note that actual benefit might be lower than expected if the longer retpoline overlaps with some pending memory read. MEASUREMENTS: The INDIRECT_CALL wrappers in this patchset involve the processing of incoming SYN and generation of syncookies. Hence, the test has been run by configuring a receiving host with a single NIC rx queue, disabling RPS and RFS so that all processing occurs on the same core. An external source generates SYN fast enough to saturate the receiving CPU. We ran two sets of experiments, with and without the dst_output patch, comparing the number of syncookies generated over a 20s period in multiple runs. Assuming the CPU is saturated, the time per packet is t = number_of_packets/total_time and if the two datasets have statistically meaningful difference, the difference in times between the two cases gives an estimate of the benefits from one INDIRECT_CALL. Here are the experimental results: Skylake Syncookies over 20s (5 tests) --- indirect9166325 9182023 9170093 9134014 9171082 retpoline 9099308 9126350 9154841 9056377 9122376 Computing the stats on the ns_pkt = 20e6/total_packets gives the following: $ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp x /tmp/sk-indirect + /tmp/sk-retp +--+ |x xx x + x+ + + +| ||__M__A___|_|M_A___| | +--+ N Min MaxMedian AvgStddev x 5 2.17817e-06 2.18962e-06 2.181e-06 2.182292e-06 4.3252133e-09 + 5 2.18464e-06 2.20839e-06 2.19241e-06 2.194974e-06 8.8695958e-09 Difference at 95.0% confidence 1.2682e-08 +/- 1.01766e-08 0.581132% +/- 0.466326% (Student's t, pooled s = 6.97772e-09) This suggests a difference of 13ns +/- 10ns Our expectation from microbenchmarks was 35-40 cycles per call, but part of the gains may be eaten by stalls from pending memory reads. For Cascadelake: Cascadelake Syncookies over 20s (5 tests) - indirect 10339797 10297547 10366826 10378891 10384854 retpoline10332674 10366805 10320374 10334272 10374087 Computing the stats on the ns_pkt = 20e6/total_packets gives no meaningful difference even at just 80% (this was expected): $ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp x /tmp/cl-indirect + /tmp/cl-retp +--+ | xx + * x + ++x| ||__|_M_A_A___M|___| | +--+ N Min MaxMedian AvgStddev x 5 1.92588e-06 1.94221e-06 1.92923e-06 1.931716e-06 6.6936746e-09 + 5 1.92788e-06 1.93791e-06 1.93531e-06 1.933188e-06 4.3734106e-09 No difference proven at 80.0% confidence Changed in v3: - fix From: tag - provide measurements Changed in v2: -fix build issues reported by kernel test robot Brian Vazquez (4): net:
[PATCH net-next v3 1/4] net: use indirect call helpers for dst_input
This patch avoids the indirect call for the common case: ip_local_deliver and ip6_input Signed-off-by: Brian Vazquez --- include/net/dst.h | 6 +- net/ipv4/ip_input.c | 1 + 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/net/dst.h b/include/net/dst.h index 10f0a8399867..98cf6e8c06c4 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -18,6 +18,7 @@ #include #include #include +#include struct sk_buff; @@ -441,10 +442,13 @@ static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff *s return skb_dst(skb)->output(net, sk, skb); } +INDIRECT_CALLABLE_DECLARE(int ip6_input(struct sk_buff *)); +INDIRECT_CALLABLE_DECLARE(int ip_local_deliver(struct sk_buff *)); /* Input packet from network to transport. */ static inline int dst_input(struct sk_buff *skb) { - return skb_dst(skb)->input(skb); + return INDIRECT_CALL_INET(skb_dst(skb)->input, + ip6_input, ip_local_deliver, skb); } static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie) diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c index b0c244af1e4d..3a025c011971 100644 --- a/net/ipv4/ip_input.c +++ b/net/ipv4/ip_input.c @@ -253,6 +253,7 @@ int ip_local_deliver(struct sk_buff *skb) net, NULL, skb, skb->dev, NULL, ip_local_deliver_finish); } +EXPORT_SYMBOL(ip_local_deliver); static inline bool ip_rcv_options(struct sk_buff *skb, struct net_device *dev) { -- 2.30.0.365.g02bc693789-goog
Re: [PATCH] drm/msm/mdp5: Fix wait-for-commit for cmd panels
On Wed, Jan 27, 2021 at 05:24:40PM +0200, Iskren Chernev wrote: > Before the offending commit in msm_atomic_commit_tail wait_flush was > called once per frame, after the commit was submitted. After it > wait_flush is also called at the beginning to ensure previous > potentially async commits are done. > > For cmd panels the source of wait_flush is a ping-pong irq notifying > a completion. The completion needs to be notified with complete_all so > multiple waiting parties (new async committers) can proceed. > > Signed-off-by: Iskren Chernev > Suggested-by: Rob Clark > Fixes: 2d99ced787e3d ("drm/msm: async commit support") Nice job tracking down this fix! Reviewed-by: Brian Masney Tested-by: Brian Masney
Re: [PATCH] soc: qcom: ocmem: don't return NULL in of_get_ocmem
On Sat, Jan 30, 2021 at 03:23:49PM +0100, Luca Weiss wrote: > If ocmem probe fails for whatever reason, of_get_ocmem returned NULL. > Without this, users must check for both NULL and IS_ERR on the returned > pointer - which didn't happen in drivers/gpu/drm/msm/adreno/adreno_gpu.c > leading to a NULL pointer dereference. > > Fixes: 88c1e9404f1d ("soc: qcom: add OCMEM driver") > Signed-off-by: Luca Weiss Reviewed-by: Brian Masney
Re: [PATCH V2 3/6] x86_32/sysenter: switch to the task stack without emptying the entry stack
On Mon, Jan 25, 2021 at 11:35 AM Lai Jiangshan wrote: > > From: Lai Jiangshan > > Like the way x86_64 uses the "old" stack, we can save the entry stack > pointer to a register and switch to the task stack. So that we have > space on the "old" stack to save more things or scratch registers. > > Signed-off-by: Lai Jiangshan > --- > arch/x86/entry/entry_32.S | 11 +-- > 1 file changed, 5 insertions(+), 6 deletions(-) > > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S > index 3b4d1a63d1f0..4513702ba45d 100644 > --- a/arch/x86/entry/entry_32.S > +++ b/arch/x86/entry/entry_32.S > @@ -905,19 +905,18 @@ SYM_FUNC_START(entry_SYSENTER_32) > pushl %eax > BUG_IF_WRONG_CR3 no_user_check=1 > SWITCH_TO_KERNEL_CR3 scratch_reg=%eax > - popl%eax > - popfl > > - /* Stack empty again, switch to task stack */ > - movlTSS_entry2task_stack(%esp), %esp > + /* Switch to task stack */ > + movl%esp, %eax > + movl(2*4+TSS_entry2task_stack)(%esp), %esp > > .Lsysenter_past_esp: > pushl $__USER_DS /* pt_regs->ss */ > pushl $0 /* pt_regs->sp (placeholder) */ > - pushfl /* pt_regs->flags (except IF = 0) */ > + pushl 4(%eax) /* pt_regs->flags (except IF = 0) */ __KERNEL_DS isn't loaded at this point, so this needs an explicit %ss: override. You probably didn't catch this because the default __USER_DS was still loaded. > pushl $__USER_CS /* pt_regs->cs */ > pushl $0 /* pt_regs->ip = 0 (placeholder) */ > - pushl %eax/* pt_regs->orig_ax */ > + pushl (%eax) /* pt_regs->orig_ax */ Add an %ss: override here too. > SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest, stack already switched > */ > > /* > -- > 2.19.1.6.gb485710b > -- Brian Gerst
Re: linux-next: build warning after merge of the xfs tree
On Mon, Jan 25, 2021 at 09:55:32AM +1100, Stephen Rothwell wrote: > Hi all, > > After merging the xfs tree, today's linux-next build (powerpc > ppc64_defconfig) produced this warning: > > fs/xfs/xfs_log.c: In function 'xfs_log_cover': > fs/xfs/xfs_log.c::16: warning: unused variable 'log' [-Wunused-variable] > | struct xlog *log = mp->m_log; > |^~~ > > Introduced by commit > > 303591a0a947 ("xfs: cover the log during log quiesce") > Oops, patch below. Feel free to apply or squash into the original commit. Brian --- 8< --- >From 6078f06e2bd4c82111a85a2032c39a56654b0be6 Mon Sep 17 00:00:00 2001 From: Brian Foster Date: Mon, 25 Jan 2021 08:22:56 -0500 Subject: [PATCH] xfs: fix unused log variable in xfs_log_cover() The log variable is only used in kernels with asserts enabled. Remove it and open code the dereference to avoid unused variable warnings. Signed-off-by: Brian Foster --- fs/xfs/xfs_log.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 58699881c100..d8b814227734 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1108,12 +1108,11 @@ static int xfs_log_cover( struct xfs_mount*mp) { - struct xlog *log = mp->m_log; int error = 0; boolneed_covered; - ASSERT((xlog_cil_empty(log) && xlog_iclogs_empty(log) && - !xfs_ail_min_lsn(log->l_ailp)) || + ASSERT((xlog_cil_empty(mp->m_log) && xlog_iclogs_empty(mp->m_log) && + !xfs_ail_min_lsn(mp->m_log->l_ailp)) || XFS_FORCED_SHUTDOWN(mp)); if (!xfs_log_writable(mp)) -- 2.26.2
Re: [PATCH 1/4] ARM: dts: qcom: msm8974: add gpu support
On Sun, Jan 24, 2021 at 03:56:06PM +0100, Konrad Dybcio wrote: > Hi, > > > > + gpu_opp_table: opp_table { > > + status = "disabled"; > > > Is there a good reason to disable this? > > > > + opp-8 { > > + opp-hz = /bits/ 64 <8>; > > + }; > > No. A330 can't go lightspeed (unless there's some secret ultra-binned > msm8974pro-v4-ad-5g). [1] Hi Iskren, I believe that the Nexus 5 GPU frequencies are documented in the downstream kernel at: https://github.com/AICP/kernel_lge_hammerhead/blob/n7.1/arch/arm/boot/dts/msm8974-gpu.dtsi#L67 I am fairly certain that the qcom,bus-freq property is an index into the qcom,msm-bus,vectors-KBps property above. This will map to the interconnect and operating points in the upstream kernel. Note that the actual implementation in a3xx_gpu.c and a4xx_gpu.c currently has this snippet to set the bus speed: /* * Set the ICC path to maximum speed for now by multiplying the fastest * frequency by the bus width (8). We'll want to scale this later on to * improve battery life. */ icc_set_bw(icc_path, 0, Bps_to_icc(gpu->fast_rate) * 8); icc_set_bw(ocmem_icc_path, 0, Bps_to_icc(gpu->fast_rate) * 8); This should be fine for the time being. You'll want to document it correctly in device tree though. If the v2 changes too much, then feel free to drop my name from the patch. I thought that I had made these changes already but apparently not. :/ Brian
Re: [PATCH] xfs: set inode size after creating symlink
On Thu, Jan 21, 2021 at 09:19:12AM -0600, Jeffrey Mitchell wrote: > When XFS creates a new symlink, it writes its size to disk but not to the > VFS inode. This causes i_size_read() to return 0 for that symlink until > it is re-read from disk, for example when the system is rebooted. > > I found this inconsistency while protecting directories with eCryptFS. > The command "stat path/to/symlink/in/ecryptfs" will report "Size: 0" if > the symlink was created after the last reboot on an XFS root. > > Call i_size_write() in xfs_symlink() > > Signed-off-by: Jeffrey Mitchell > --- Reviewed-by: Brian Foster > fs/xfs/xfs_symlink.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c > index 1f43fd7f3209..c835827ae389 100644 > --- a/fs/xfs/xfs_symlink.c > +++ b/fs/xfs/xfs_symlink.c > @@ -300,6 +300,7 @@ xfs_symlink( > } > ASSERT(pathlen == 0); > } > + i_size_write(VFS_I(ip), ip->i_d.di_size); > > /* >* Create the directory entry for the symlink. > -- > 2.25.1 >
Re: [PATCH] marvell/mwifiex: replace one-element array with flexible-array member.
One more thing, for context: On Tue, Jan 19, 2021 at 11:11 AM Brian Norris wrote: > On Fri, Jan 15, 2021 at 1:39 AM Jiapeng Zhong > wrote: > > > > Fix the follow coccicheck warnings: > > > > ./drivers/net/wireless/marvell/mwifiex/fw.h: WARNING use flexible-array > > member instead(https://www.kernel.org/doc/html/latest/process/ > > deprecated.html#zero-length-and-one-element-arrays) > > > > Reported-by: Abaci Robot > > Signed-off-by: Jiapeng Zhong > > Past experience unfortunately requires me to ask: did you test your > changes? I understand that's a mostly legit warning, and a good > deprecation notice, but that doesn't mean this is the right fix. One > probably should instead audit the usage sites to see if they *are* > already making proper sizeof (or other) comparisons, and if not, fix > those first. And if any sites *are* doing correct sizeof computations > using the existing struct layouts, then you are probably breaking > them. > > Or if you have audited the usage of these structs, it's nice to make a > small explanation of why this is correct, so I (and other readers) > don't have to ask these questions :) FYI, there are others who I believe are making similar blind changes to this code: [PATCH 1/1] mwifiex: Fix possible buffer overflows in mwifiex_config_scan https://lore.kernel.org/linux-wireless/CA+ASDXPkLg2GGFJTt25YO7wae==yahftf8jxu520pl_vzat...@mail.gmail.com/ For that particular case (the 'ssid' field in mwifiex_ie_types_wildcard_ssid_params), the previous patch-er was incorrect, and I believe your change is a better one. But neither of you provided useful analysis. Brian
Re: [PATCH] marvell/mwifiex: replace one-element array with flexible-array member.
Hi, On Fri, Jan 15, 2021 at 1:39 AM Jiapeng Zhong wrote: > > Fix the follow coccicheck warnings: > > ./drivers/net/wireless/marvell/mwifiex/fw.h: WARNING use flexible-array > member instead(https://www.kernel.org/doc/html/latest/process/ > deprecated.html#zero-length-and-one-element-arrays) > > Reported-by: Abaci Robot > Signed-off-by: Jiapeng Zhong Past experience unfortunately requires me to ask: did you test your changes? I understand that's a mostly legit warning, and a good deprecation notice, but that doesn't mean this is the right fix. One probably should instead audit the usage sites to see if they *are* already making proper sizeof (or other) comparisons, and if not, fix those first. And if any sites *are* doing correct sizeof computations using the existing struct layouts, then you are probably breaking them. Or if you have audited the usage of these structs, it's nice to make a small explanation of why this is correct, so I (and other readers) don't have to ask these questions :) Regards, Brian
RE: [PATCH 04/18] arch: hexagon: Don't select HAVE_OPROFILE
> -Original Message- > From: Viresh Kumar > Sent: Thursday, January 14, 2021 5:35 AM ... > The "oprofile" user-space tools don't use the kernel OPROFILE support any > more, and haven't in a long time. User-space has been converted to the perf > interfaces. > > Don't select HAVE_OPROFILE for hexagon anymore. Acked-by: Brian Cain > Suggested-by: Christoph Hellwig > Suggested-by: Linus Torvalds > Signed-off-by: Viresh Kumar > --- > arch/hexagon/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index > 6e00c16a36b5..44a409967af1 100644 > --- a/arch/hexagon/Kconfig > +++ b/arch/hexagon/Kconfig > @@ -7,7 +7,6 @@ config HEXAGON > select ARCH_32BIT_OFF_T > select ARCH_HAS_SYNC_DMA_FOR_DEVICE > select ARCH_NO_PREEMPT > - select HAVE_OPROFILE > # Other pending projects/to-do items. > # select HAVE_REGS_AND_STACK_ACCESS_API > # select HAVE_HW_BREAKPOINT if PERF_EVENTS > -- > 2.25.0.rc1.19.g042ed3e048af
RE: [PATCH] hexagon: remove CONFIG_EXPERIMENTAL from defconfigs
> -Original Message- > From: Randy Dunlap ... > Since CONFIG_EXPERIMENTAL was removed in 2013, go ahead and drop it > from any defconfig files. Acked-by: Brian Cain > Fixes: 3d374d09f16f ("final removal of CONFIG_EXPERIMENTAL") > Signed-off-by: Randy Dunlap > Cc: Kees Cook > Cc: Greg Kroah-Hartman > Cc: Brian Cain > Cc: linux-hexa...@vger.kernel.org > Cc: Andrew Morton > --- > arch/hexagon/configs/comet_defconfig |1 - > 1 file changed, 1 deletion(-) > > --- linux-next-20210114.orig/arch/hexagon/configs/comet_defconfig > +++ linux-next-20210114/arch/hexagon/configs/comet_defconfig > @@ -1,7 +1,6 @@ > CONFIG_SMP=y > CONFIG_DEFAULT_MMAP_MIN_ADDR=0 > CONFIG_HZ_100=y > -CONFIG_EXPERIMENTAL=y > CONFIG_CROSS_COMPILE="hexagon-" > CONFIG_LOCALVERSION="-smp" > # CONFIG_LOCALVERSION_AUTO is not set
Re: [PATCH v5 00/21] ibmvfc: initial MQ development/enablement
Tyrel, I think this patch series is looking pretty good. I don't think we need to wait for resolution of the can_queue issue being discussed, since that is an issue that exists prior to this patch series and this patch series doesn't make the issue any worse. Let's work that separately. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v5 21/21] ibmvfc: provide modules parameters for MQ settings
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v5 18/21] ibmvfc: send Cancel MAD down each hw scsi channel
Reviewed-by: Brian King -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement
On 1/13/21 7:27 PM, Ming Lei wrote: > On Wed, Jan 13, 2021 at 11:13:07AM -0600, Brian King wrote: >> On 1/12/21 6:33 PM, Tyrel Datwyler wrote: >>> On 1/12/21 2:54 PM, Brian King wrote: >>>> On 1/11/21 5:12 PM, Tyrel Datwyler wrote: >>>>> Introduce several new vhost fields for managing MQ state of the adapter >>>>> as well as initial defaults for MQ enablement. >>>>> >>>>> Signed-off-by: Tyrel Datwyler >>>>> --- >>>>> drivers/scsi/ibmvscsi/ibmvfc.c | 8 >>>>> drivers/scsi/ibmvscsi/ibmvfc.h | 9 + >>>>> 2 files changed, 17 insertions(+) >>>>> >>>>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c >>>>> b/drivers/scsi/ibmvscsi/ibmvfc.c >>>>> index ba95438a8912..9200fe49c57e 100644 >>>>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c >>>>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c >>>>> @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = { >>>>> .max_sectors = IBMVFC_MAX_SECTORS, >>>>> .shost_attrs = ibmvfc_attrs, >>>>> .track_queue_depth = 1, >>>>> + .host_tagset = 1, >>>> >>>> This doesn't seem right. You are setting host_tagset, which means you want >>>> a >>>> shared, host wide, tag set for commands. It also means that the total >>>> queue depth for the host is can_queue. However, it looks like you are >>>> allocating >>>> max_requests events for each sub crq, which means you are over allocating >>>> memory. >>> >>> With the shared tagset yes the queue depth for the host is can_queue, but >>> this >>> also implies that the max queue depth for each hw queue is also can_queue. >>> So, >>> in the worst case that all commands are queued down the same hw queue we >>> need an >>> event pool with can_queue commands. >>> >>>> >>>> Looking at this closer, we might have bigger problems. There is a host wide >>>> max number of commands that the VFC host supports, which gets returned on >>>> NPIV Login. This value can change across a live migration event. >>> >>> From what I understand the max commands can only become less. >>> >>>> >>>> The ibmvfc driver, which does the same thing the lpfc driver does, modifies >>>> can_queue on the scsi_host *after* the tag set has been allocated. This >>>> looks >>>> to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like >>>> we look at can_queue once the tag set is setup, and I'm not seeing a good >>>> way >>>> to dynamically change the host queue depth once the tag set is setup. >>>> >>>> Unless I'm missing something, our best options appear to either be to >>>> implement >>>> our own host wide busy reference counting, which doesn't sound very good, >>>> or >>>> we need to add some API to block / scsi that allows us to dynamically >>>> change >>>> can_queue. >>> >>> Changing can_queue won't do use any good with the shared tagset becasue each >>> queue still needs to be able to queue can_queue number of commands in the >>> worst >>> case. >> >> The issue I'm trying to highlight here is the following scenario: >> >> 1. We set shost->can_queue, then call scsi_add_host, which allocates the tag >> set. >> >> 2. On our NPIV login response from the VIOS, we might get a lower value than >> we >> initially set in shost->can_queue, so we update it, but nobody ever looks at >> it >> again, and we don't have any protection against sending too many commands to >> the host. >> >> >> Basically, we no longer have any code that ensures we don't send more >> commands to the VIOS than we are told it supports. According to the >> architecture, >> if we actually do this, the VIOS will do an h_free_crq, which would be a bit >> of a bug on our part. >> >> I don't think it was ever clearly defined in the API that a driver can >> change shost->can_queue after calling scsi_add_host, but up until >> commit 6eb045e092efefafc6687409a6fa6d1dabf0fb69, this worked and now >> it doesn't. > > Actually it isn't related with commit 6eb045e092ef, because > blk_mq_alloc_tag_set() > uses .can_queue to create driver tag sbitmap and requ
Re: [PATCH v4 00/21] ibmvfc: initial MQ development
With the exception of the few comments I've shared, the rest of this looks good to me and you can add my: Reviewed-by: Brian King Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 21/21] ibmvfc: provide modules parameters for MQ settings
On 1/11/21 5:12 PM, Tyrel Datwyler wrote: > @@ -5880,12 +5936,13 @@ static int ibmvfc_probe(struct vio_dev *vdev, const > struct vio_device_id *id) > > shost->transportt = ibmvfc_transport_template; > shost->can_queue = max_requests; > + shost->can_queue = (max_requests / nr_scsi_hw_queues); This looks to be in conflict with the fact that the first patch requested a shared tag set, right? > shost->max_lun = max_lun; > shost->max_id = max_targets; > shost->max_sectors = IBMVFC_MAX_SECTORS; > shost->max_cmd_len = IBMVFC_MAX_CDB_LEN; > shost->unique_id = shost->host_no; > - shost->nr_hw_queues = IBMVFC_MQ ? IBMVFC_SCSI_HW_QUEUES : 1; > + shost->nr_hw_queues = mq_enabled ? nr_scsi_hw_queues : 1; You might want to range check this, to make sure its sane. > > vhost = shost_priv(shost); > INIT_LIST_HEAD(&vhost->targets); > @@ -5897,8 +5954,8 @@ static int ibmvfc_probe(struct vio_dev *vdev, const > struct vio_device_id *id) > vhost->log_level = log_level; > vhost->task_set = 1; > > - vhost->mq_enabled = IBMVFC_MQ; > - vhost->client_scsi_channels = IBMVFC_SCSI_CHANNELS; > + vhost->mq_enabled = mq_enabled; > + vhost->client_scsi_channels = min(nr_scsi_hw_queues, nr_scsi_channels); > vhost->using_channels = 0; > vhost->do_enquiry = 1; > > -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement
On 1/12/21 6:33 PM, Tyrel Datwyler wrote: > On 1/12/21 2:54 PM, Brian King wrote: >> On 1/11/21 5:12 PM, Tyrel Datwyler wrote: >>> Introduce several new vhost fields for managing MQ state of the adapter >>> as well as initial defaults for MQ enablement. >>> >>> Signed-off-by: Tyrel Datwyler >>> --- >>> drivers/scsi/ibmvscsi/ibmvfc.c | 8 >>> drivers/scsi/ibmvscsi/ibmvfc.h | 9 + >>> 2 files changed, 17 insertions(+) >>> >>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c >>> index ba95438a8912..9200fe49c57e 100644 >>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c >>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c >>> @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = { >>> .max_sectors = IBMVFC_MAX_SECTORS, >>> .shost_attrs = ibmvfc_attrs, >>> .track_queue_depth = 1, >>> + .host_tagset = 1, >> >> This doesn't seem right. You are setting host_tagset, which means you want a >> shared, host wide, tag set for commands. It also means that the total >> queue depth for the host is can_queue. However, it looks like you are >> allocating >> max_requests events for each sub crq, which means you are over allocating >> memory. > > With the shared tagset yes the queue depth for the host is can_queue, but this > also implies that the max queue depth for each hw queue is also can_queue. So, > in the worst case that all commands are queued down the same hw queue we need > an > event pool with can_queue commands. > >> >> Looking at this closer, we might have bigger problems. There is a host wide >> max number of commands that the VFC host supports, which gets returned on >> NPIV Login. This value can change across a live migration event. > > From what I understand the max commands can only become less. > >> >> The ibmvfc driver, which does the same thing the lpfc driver does, modifies >> can_queue on the scsi_host *after* the tag set has been allocated. This looks >> to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like >> we look at can_queue once the tag set is setup, and I'm not seeing a good way >> to dynamically change the host queue depth once the tag set is setup. >> >> Unless I'm missing something, our best options appear to either be to >> implement >> our own host wide busy reference counting, which doesn't sound very good, or >> we need to add some API to block / scsi that allows us to dynamically change >> can_queue. > > Changing can_queue won't do use any good with the shared tagset becasue each > queue still needs to be able to queue can_queue number of commands in the > worst > case. The issue I'm trying to highlight here is the following scenario: 1. We set shost->can_queue, then call scsi_add_host, which allocates the tag set. 2. On our NPIV login response from the VIOS, we might get a lower value than we initially set in shost->can_queue, so we update it, but nobody ever looks at it again, and we don't have any protection against sending too many commands to the host. Basically, we no longer have any code that ensures we don't send more commands to the VIOS than we are told it supports. According to the architecture, if we actually do this, the VIOS will do an h_free_crq, which would be a bit of a bug on our part. I don't think it was ever clearly defined in the API that a driver can change shost->can_queue after calling scsi_add_host, but up until commit 6eb045e092efefafc6687409a6fa6d1dabf0fb69, this worked and now it doesn't. I started looking through drivers that do this, and so far, it looks like the following drivers do: ibmvfc, lpfc, aix94xx, libfc, BusLogic, and likely others... We probably need an API that lets us change shost->can_queue dynamically. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement
On 1/11/21 5:12 PM, Tyrel Datwyler wrote: > Introduce several new vhost fields for managing MQ state of the adapter > as well as initial defaults for MQ enablement. > > Signed-off-by: Tyrel Datwyler > --- > drivers/scsi/ibmvscsi/ibmvfc.c | 8 > drivers/scsi/ibmvscsi/ibmvfc.h | 9 + > 2 files changed, 17 insertions(+) > > diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c > index ba95438a8912..9200fe49c57e 100644 > --- a/drivers/scsi/ibmvscsi/ibmvfc.c > +++ b/drivers/scsi/ibmvscsi/ibmvfc.c > @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = { > .max_sectors = IBMVFC_MAX_SECTORS, > .shost_attrs = ibmvfc_attrs, > .track_queue_depth = 1, > + .host_tagset = 1, This doesn't seem right. You are setting host_tagset, which means you want a shared, host wide, tag set for commands. It also means that the total queue depth for the host is can_queue. However, it looks like you are allocating max_requests events for each sub crq, which means you are over allocating memory. Looking at this closer, we might have bigger problems. There is a host wide max number of commands that the VFC host supports, which gets returned on NPIV Login. This value can change across a live migration event. The ibmvfc driver, which does the same thing the lpfc driver does, modifies can_queue on the scsi_host *after* the tag set has been allocated. This looks to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like we look at can_queue once the tag set is setup, and I'm not seeing a good way to dynamically change the host queue depth once the tag set is setup. Unless I'm missing something, our best options appear to either be to implement our own host wide busy reference counting, which doesn't sound very good, or we need to add some API to block / scsi that allows us to dynamically change can_queue. Thanks, Brian -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH v4 18/21] ibmvfc: send Cancel MAD down each hw scsi channel
On 1/11/21 5:12 PM, Tyrel Datwyler wrote: > diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c > index b0b0212344f3..24e1278acfeb 100644 > --- a/drivers/scsi/ibmvscsi/ibmvfc.c > +++ b/drivers/scsi/ibmvscsi/ibmvfc.c > @@ -2418,18 +2418,79 @@ static struct ibmvfc_event *ibmvfc_init_tmf(struct > ibmvfc_queue *queue, > return evt; > } > > -/** > - * ibmvfc_cancel_all - Cancel all outstanding commands to the device > - * @sdev:scsi device to cancel commands > - * @type:type of error recovery being performed > - * > - * This sends a cancel to the VIOS for the specified device. This does > - * NOT send any abort to the actual device. That must be done separately. > - * > - * Returns: > - * 0 on success / other on failure > - **/ > -static int ibmvfc_cancel_all(struct scsi_device *sdev, int type) > +static int ibmvfc_cancel_all_mq(struct scsi_device *sdev, int type) > +{ > + struct ibmvfc_host *vhost = shost_priv(sdev->host); > + struct ibmvfc_event *evt, *found_evt, *temp; > + struct ibmvfc_queue *queues = vhost->scsi_scrqs.scrqs; > + unsigned long flags; > + int num_hwq, i; > + LIST_HEAD(cancelq); > + u16 status; > + > + ENTER; > + spin_lock_irqsave(vhost->host->host_lock, flags); > + num_hwq = vhost->scsi_scrqs.active_queues; > + for (i = 0; i < num_hwq; i++) { > + spin_lock(queues[i].q_lock); > + spin_lock(&queues[i].l_lock); > + found_evt = NULL; > + list_for_each_entry(evt, &queues[i].sent, queue_list) { > + if (evt->cmnd && evt->cmnd->device == sdev) { > + found_evt = evt; > + break; > + } > + } > + spin_unlock(&queues[i].l_lock); > + I really don't like the way the first for loop grabs all the q_locks, and then there is a second for loop that drops all these locks... I think this code would be cleaner if it looked like: if (found_evt && vhost->logged_in) { evt = ibmvfc_init_tmf(&queues[i], sdev, type); evt->sync_iu = &queues[i].cancel_rsp; ibmvfc_send_event(evt, vhost, default_timeout); list_add_tail(&evt->cancel, &cancelq); } spin_unlock(queues[i].q_lock); } > + if (!found_evt) > + continue; > + > + if (vhost->logged_in) { > + evt = ibmvfc_init_tmf(&queues[i], sdev, type); > + evt->sync_iu = &queues[i].cancel_rsp; > + ibmvfc_send_event(evt, vhost, default_timeout); > + list_add_tail(&evt->cancel, &cancelq); > + } > + } > + > + for (i = 0; i < num_hwq; i++) > + spin_unlock(queues[i].q_lock); > + spin_unlock_irqrestore(vhost->host->host_lock, flags); > + > + if (list_empty(&cancelq)) { > + if (vhost->log_level > IBMVFC_DEFAULT_LOG_LEVEL) > + sdev_printk(KERN_INFO, sdev, "No events found to > cancel\n"); > + return 0; > + } > + > + sdev_printk(KERN_INFO, sdev, "Cancelling outstanding commands.\n"); > + > + list_for_each_entry_safe(evt, temp, &cancelq, cancel) { > + wait_for_completion(&evt->comp); > + status = be16_to_cpu(evt->queue->cancel_rsp.mad_common.status); You probably want a list_del(&evt->cancel) here. > + ibmvfc_free_event(evt); > + > + if (status != IBMVFC_MAD_SUCCESS) { > + sdev_printk(KERN_WARNING, sdev, "Cancel failed with > rc=%x\n", status); > + switch (status) { > + case IBMVFC_MAD_DRIVER_FAILED: > + case IBMVFC_MAD_CRQ_ERROR: > + /* Host adapter most likely going through reset, return > success to > + * the caller will wait for the command being cancelled > to get returned > + */ > + break; > + default: > + break; I think this default break needs to be a return -EIO. > + } > + } > + } > + > + sdev_printk(KERN_INFO, sdev, "Successfully cancelled outstanding > commands\n"); > + return 0; > +} > + > +static int ibmvfc_cancel_all_sq(struct scsi_device *sdev, int type) > { > struct ibmvfc_host *vhost = shost_priv(sdev->host); > struct ibmvfc_event *evt, *found_evt; > @@ -2498,6 +2559,27 @@ static int ibmvfc_cancel_all(struct scsi_device *sdev, > int type) > return 0; > } > -- Brian King Power Linux I/O IBM Linux Technology Center
Re: [PATCH 1/1] mwifiex: Fix possible buffer overflows in mwifiex_config_scan
(Note: this is version 1; there's a later version posted, which does not have a v2 tag...) https://lore.kernel.org/linux-wireless/20201208150951.35866-1-ruc_zhangxiao...@163.com/ On Sat, Jan 9, 2021 at 7:11 AM Peter Seiderer wrote: > On Tue, 8 Dec 2020 20:45:23 +0800, Xiaohui Zhang > wrote: > > From: Zhang Xiaohui > > mwifiex_config_scan() calls memcpy() without checking > > the destination size may trigger a buffer overflower, > > which a local user could use to cause denial of service > > or the execution of arbitrary code. > > Fix it by putting the length check before calling memcpy(). > > > > Signed-off-by: Zhang Xiaohui > > --- > > drivers/net/wireless/marvell/mwifiex/scan.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/scan.c > > b/drivers/net/wireless/marvell/mwifiex/scan.c > > index c2a685f63..b1d90678f 100644 > > --- a/drivers/net/wireless/marvell/mwifiex/scan.c > > +++ b/drivers/net/wireless/marvell/mwifiex/scan.c > > @@ -930,6 +930,8 @@ mwifiex_config_scan(struct mwifiex_private *priv, > > "DIRECT-", 7)) > > wildcard_ssid_tlv->max_ssid_length = 0xfe; > > > > + if (ssid_len > 1) > > + ssid_len = 1; > > Why do your believe the available size is only '1'? A SSID is expected > to be of size IEEE80211_MAX_SSID_LE/32 and the wildcard_ssid_tlv pointer > is casted from tlv_pos (some lines above) which is a pointer/index into > scan_cfg_out->tlv_buf... I pointed out this discrepancy already, taking a slightly different approach: https://lore.kernel.org/linux-wireless/ca+asdxpvu5s0vm0aocyqln090u3bwa_nv358ywkpxuu223u...@mail.gmail.com/ > And the define (line 44) indicates there should be enough space for a SSID: > > 42 /* Memory needed to store a max number/size WildCard SSID TLV for a > firmware > 43 scan */ > 44 #define WILDCARD_SSID_TLV_MAX_SIZE \ > 45 (MWIFIEX_MAX_SSID_LIST_LENGTH * \ > 46 (sizeof(struct mwifiex_ie_types_wildcard_ssid_params) \ > 47 + IEEE80211_MAX_SSID_LEN)) Ah, good catch. So this may not be a true overflow issue at all, even if it's confusing and bad code. The "problem" is that this piece of the struct is variable-length, and unless we're going to dynamically resize/reallocate the whole buffer, we have to assume the initial allocation was large enough (and per your note, it is). > For sure something to improve here instead of using a confusing 'u8 ssid[1]' > in struct mwifiex_ie_types_wildcard_ssid_params... Yep. Brian
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote: > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > > > If the value goes below the limit while some threads are > > > > > already waiting but before the push worker gets to it, these threads > > > > > are > > > > > not woken. > > > > > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > > > optimization. This is possible, because we hold the xc_push_lock > > > > > spinlock, which prevents additions to the waitqueue. > > > > > > > > > > Signed-off-by: Donald Buczek > > > > > --- > > > > > fs/xfs/xfs_log_cil.c | 2 +- > > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > > index b0ef071b3cb5..d620de8e217c 100644 > > > > > --- a/fs/xfs/xfs_log_cil.c > > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > > > /* > > > > >* Wake up any background push waiters now this context is > > > > > being pushed. > > > > >*/ > > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > > > wake_up_all(&cil->xc_push_wait); > > > > > > > > That just smells wrong to me. It *might* be correct, but this > > > > condition should pair with the sleep condition, as space used by a > > > > CIL context should never actually decrease > > > > > > > > > > ... but I'm a little confused by this assertion. The shadow buffer > > > allocation code refers to the possibility of shadow buffers falling out > > > that are smaller than currently allocated buffers. Further, the > > > _insert_format_items() code appears to explicitly optimize for this > > > possibility by reusing the active buffer, subtracting the old size/count > > > values from the diff variables and then reformatting the latest > > > (presumably smaller) item to the lv. > > > > Individual items might shrink, but the overall transaction should > > grow. Think of a extent to btree conversion of an inode fork. THe > > data in the inode fork decreases from a list of extents to a btree > > root block pointer, so the inode item shrinks. But then we add a new > > btree root block that contains all the extents + the btree block > > header, and it gets rounded up to ithe 128 byte buffer logging chunk > > size. > > > > IOWs, while the inode item has decreased in size, the overall > > space consumed by the transaction has gone up and so the CIL ctx > > used_space should increase. Hence we can't just look at individual > > log items and whether they have decreased in size - we have to look > > at all the items in the transaction to understand how the space used > > in that transaction has changed. i.e. it's the aggregation of all > > items in the transaction that matter here, not so much the > > individual items. > > > > Ok, that makes more sense... > > > > Of course this could just be implementation detail. I haven't dug into > > > the details in the remainder of this thread and I don't have specific > > > examples off the top of my head, but perhaps based on the ability of > > > various structures to change formats and the ability of log vectors to > > > shrink in size, shouldn't we expect the possibility of a CIL context to > > > shrink in size as well? Just from poking around the CIL it seems like > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > > > for recalculating split res as well)... > > > > Yes, there may be situations where it decreases. It may be this is > > fine, but the assumption *I've made* in lots of the CIL push code is > > that ctx->used_space rarely, if ever, will go backwards. > > > > ... and rarely seems a bit more pragmatic than never. > FWIW, a cursory look at the inode size/format code (motivated by Donald's recent log dump that appear
Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support
On Fri, Jan 8, 2021 at 1:59 PM Andy Lutomirski wrote: > > The implementation was rather buggy. It unconditionally marked PTEs > read-only, even for VM_SHARED mappings. I'm not sure whether this is > actually a problem, but it certainly seems unwise. More importantly, it > released the mmap lock before flushing the TLB, which could allow a racing > CoW operation to falsely believe that the underlying memory was not > writable. > > I can't find any users at all of this mechanism, so just remove it. > > Cc: Andrea Arcangeli > Cc: Linux-MM > Cc: Jason Gunthorpe > Cc: x...@kernel.org > Cc: Linus Torvalds > Cc: Matthew Wilcox > Cc: Jann Horn > Cc: Jan Kara > Cc: Yu Zhao > Cc: Peter Xu > Signed-off-by: Andy Lutomirski > --- > arch/x86/include/uapi/asm/vm86.h | 2 +- > arch/x86/kernel/vm86_32.c| 55 ++-- > 2 files changed, 10 insertions(+), 47 deletions(-) > > diff --git a/arch/x86/include/uapi/asm/vm86.h > b/arch/x86/include/uapi/asm/vm86.h > index d2ee4e307ef8..50004fb4590d 100644 > --- a/arch/x86/include/uapi/asm/vm86.h > +++ b/arch/x86/include/uapi/asm/vm86.h > @@ -106,7 +106,7 @@ struct vm86_struct { > /* > * flags masks > */ > -#define VM86_SCREEN_BITMAP 0x0001 > +#define VM86_SCREEN_BITMAP 0x0001/* no longer supported */ > > struct vm86plus_info_struct { > unsigned long force_return_for_pic:1; > diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c > index 764573de3996..28b9e8d511e1 100644 > --- a/arch/x86/kernel/vm86_32.c > +++ b/arch/x86/kernel/vm86_32.c > @@ -160,49 +160,6 @@ void save_v86_state(struct kernel_vm86_regs *regs, int > retval) > do_exit(SIGSEGV); > } > > -static void mark_screen_rdonly(struct mm_struct *mm) > -{ > - struct vm_area_struct *vma; > - spinlock_t *ptl; > - pgd_t *pgd; > - p4d_t *p4d; > - pud_t *pud; > - pmd_t *pmd; > - pte_t *pte; > - int i; > - > - mmap_write_lock(mm); > - pgd = pgd_offset(mm, 0xA); > - if (pgd_none_or_clear_bad(pgd)) > - goto out; > - p4d = p4d_offset(pgd, 0xA); > - if (p4d_none_or_clear_bad(p4d)) > - goto out; > - pud = pud_offset(p4d, 0xA); > - if (pud_none_or_clear_bad(pud)) > - goto out; > - pmd = pmd_offset(pud, 0xA); > - > - if (pmd_trans_huge(*pmd)) { > - vma = find_vma(mm, 0xA); > - split_huge_pmd(vma, pmd, 0xA); > - } > - if (pmd_none_or_clear_bad(pmd)) > - goto out; > - pte = pte_offset_map_lock(mm, pmd, 0xA, &ptl); > - for (i = 0; i < 32; i++) { > - if (pte_present(*pte)) > - set_pte(pte, pte_wrprotect(*pte)); > - pte++; > - } > - pte_unmap_unlock(pte, ptl); > -out: > - mmap_write_unlock(mm); > - flush_tlb_mm_range(mm, 0xA, 0xA + 32*PAGE_SIZE, PAGE_SHIFT, > false); > -} > - > - > - > static int do_vm86_irq_handling(int subfunction, int irqnumber); > static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus); > > @@ -282,6 +239,15 @@ static long do_sys_vm86(struct vm86plus_struct __user > *user_vm86, bool plus) > offsetof(struct vm86_struct, int_revectored))) > return -EFAULT; > > + > + /* VM86_SCREEN_BITMAP had numerous bugs and appears to have no users. > */ > + if (v.flags & VM86_SCREEN_BITMAP) { > + char comm[TASK_COMM_LEN]; > + > + pr_info_once("vm86: '%s' uses VM86_SCREEN_BITMAP, which is no > longer supported\n", get_task_comm(comm, current); > + return -EINVAL; > + } > + > memset(&vm86regs, 0, sizeof(vm86regs)); > > vm86regs.pt.bx = v.regs.ebx; > @@ -370,9 +336,6 @@ static long do_sys_vm86(struct vm86plus_struct __user > *user_vm86, bool plus) > update_task_stack(tsk); > preempt_enable(); > > - if (vm86->flags & VM86_SCREEN_BITMAP) > - mark_screen_rdonly(tsk->mm); > - > memcpy((struct kernel_vm86_regs *)regs, &vm86regs, sizeof(vm86regs)); > return regs->ax; > } You can also remove screen_bitmap from struct vm86 (the kernel internal structure), and the check_v8086_mode() function. -- Brian Gerst
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote: > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote: > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > > > If the value goes below the limit while some threads are > > > > already waiting but before the push worker gets to it, these threads are > > > > not woken. > > > > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > > > optimization. This is possible, because we hold the xc_push_lock > > > > spinlock, which prevents additions to the waitqueue. > > > > > > > > Signed-off-by: Donald Buczek > > > > --- > > > > fs/xfs/xfs_log_cil.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > > > index b0ef071b3cb5..d620de8e217c 100644 > > > > --- a/fs/xfs/xfs_log_cil.c > > > > +++ b/fs/xfs/xfs_log_cil.c > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > > > /* > > > > * Wake up any background push waiters now this context is > > > > being pushed. > > > > */ > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > > > + if (waitqueue_active(&cil->xc_push_wait)) > > > > wake_up_all(&cil->xc_push_wait); > > > > > > That just smells wrong to me. It *might* be correct, but this > > > condition should pair with the sleep condition, as space used by a > > > CIL context should never actually decrease > > > > > > > ... but I'm a little confused by this assertion. The shadow buffer > > allocation code refers to the possibility of shadow buffers falling out > > that are smaller than currently allocated buffers. Further, the > > _insert_format_items() code appears to explicitly optimize for this > > possibility by reusing the active buffer, subtracting the old size/count > > values from the diff variables and then reformatting the latest > > (presumably smaller) item to the lv. > > Individual items might shrink, but the overall transaction should > grow. Think of a extent to btree conversion of an inode fork. THe > data in the inode fork decreases from a list of extents to a btree > root block pointer, so the inode item shrinks. But then we add a new > btree root block that contains all the extents + the btree block > header, and it gets rounded up to ithe 128 byte buffer logging chunk > size. > > IOWs, while the inode item has decreased in size, the overall > space consumed by the transaction has gone up and so the CIL ctx > used_space should increase. Hence we can't just look at individual > log items and whether they have decreased in size - we have to look > at all the items in the transaction to understand how the space used > in that transaction has changed. i.e. it's the aggregation of all > items in the transaction that matter here, not so much the > individual items. > Ok, that makes more sense... > > Of course this could just be implementation detail. I haven't dug into > > the details in the remainder of this thread and I don't have specific > > examples off the top of my head, but perhaps based on the ability of > > various structures to change formats and the ability of log vectors to > > shrink in size, shouldn't we expect the possibility of a CIL context to > > shrink in size as well? Just from poking around the CIL it seems like > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0 > > for recalculating split res as well)... > > Yes, there may be situations where it decreases. It may be this is > fine, but the assumption *I've made* in lots of the CIL push code is > that ctx->used_space rarely, if ever, will go backwards. > ... and rarely seems a bit more pragmatic than never. > e.g. we run the first transaction into the CIL, it steals the sapce > needed for the cil checkpoint headers for the transaciton. Then if > the space returned by the item formatting is negative (because it is > in the AIL and being relogged), the CIL checkpoint now doesn't have > the space reserved it needs to run a checkpoint. That transaction is > a sync transaction, so it forces the log, and now we push the CIL > without sufficient reservation to write out the log headers and the > items we just formatted > Hmmm... that seem
Re: [PATCH] xfs: Wake CIL push waiters more reliably
On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote: > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote: > > Threads, which committed items to the CIL, wait in the xc_push_wait > > waitqueue when used_space in the push context goes over a limit. These > > threads need to be woken when the CIL is pushed. > > > > The CIL push worker tries to avoid the overhead of calling wake_all() > > when there are no waiters waiting. It does so by checking the same > > condition which caused the waits to happen. This, however, is > > unreliable, because ctx->space_used can actually decrease when items are > > recommitted. > > When does this happen? > > Do you have tracing showing the operation where the relogged item > has actually gotten smaller? By definition, relogging in the CIL > should only grow the size of the object in the CIL because it must > relog all the existing changes on top of the new changed being made > to the object. Hence the CIL reservation should only ever grow. > > IOWs, returning negative lengths from the formatting code is > unexpected and probably a bug and requires further investigation, > not papering over the occurrence with broadcast wakeups... > I agree that this warrants a bit more explanation and analysis before changing the current code... > > If the value goes below the limit while some threads are > > already waiting but before the push worker gets to it, these threads are > > not woken. > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an > > optimization. This is possible, because we hold the xc_push_lock > > spinlock, which prevents additions to the waitqueue. > > > > Signed-off-by: Donald Buczek > > --- > > fs/xfs/xfs_log_cil.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > > index b0ef071b3cb5..d620de8e217c 100644 > > --- a/fs/xfs/xfs_log_cil.c > > +++ b/fs/xfs/xfs_log_cil.c > > @@ -670,7 +670,7 @@ xlog_cil_push_work( > > /* > > * Wake up any background push waiters now this context is being pushed. > > */ > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) > > + if (waitqueue_active(&cil->xc_push_wait)) > > wake_up_all(&cil->xc_push_wait); > > That just smells wrong to me. It *might* be correct, but this > condition should pair with the sleep condition, as space used by a > CIL context should never actually decrease > ... but I'm a little confused by this assertion. The shadow buffer allocation code refers to the possibility of shadow buffers falling out that are smaller than currently allocated buffers. Further, the _insert_format_items() code appears to explicitly optimize for this possibility by reusing the active buffer, subtracting the old size/count values from the diff variables and then reformatting the latest (presumably smaller) item to the lv. Of course this could just be implementation detail. I haven't dug into the details in the remainder of this thread and I don't have specific examples off the top of my head, but perhaps based on the ability of various structures to change formats and the ability of log vectors to shrink in size, shouldn't we expect the possibility of a CIL context to shrink in size as well? Just from poking around the CIL it seems like the surrounding code supports it (xlog_cil_insert_items() checks len > 0 for recalculating split res as well)... Brian > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com >
Re: [PATCH 1/4] ARM: dts: qcom: msm8974: add gpu support
On Wed, Dec 30, 2020 at 05:51:29PM +0200, Iskren Chernev wrote: > From: Brian Masney > > Add support for the a3xx GPU > > Signed-off-by: Brian Masney Reviewed-by: Brian Masney
Re: [PATCH 1/2] drm/msm: Call msm_init_vram before binding the gpu
On Wed, Dec 30, 2020 at 05:29:42PM +0200, Iskren Chernev wrote: > From: Craig Tatlor > > vram.size is needed when binding a gpu without an iommu and is defined > in msm_init_vram(), so run that before binding it. > > Signed-off-by: Craig Tatlor For the series: Reviewed-by: Brian Masney Next time, please include a cover letter so that tags added to the cover letter can be applied to all patches in the series via patchwork. Brian
Re: [PATCH 2/6] mm/mremap: For MREMAP_DONTUNMAP check security_vm_enough_memory_mm()
Ah, you're right. This individual patch looks good to me. Brian On Mon, Dec 28, 2020 at 11:12 AM Dmitry Safonov wrote: > > On 12/28/20 6:21 PM, Brian Geffon wrote: > > This looks good to me with a small comment. > > > >> if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) { > >> /* OOM: unable to split vma, just get accounts right */ > >> - if (vm_flags & VM_ACCOUNT) > >> + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) > >> vm_acct_memory(new_len >> PAGE_SHIFT); > > > > Checking MREMAP_DONTUNMAP in the do_munmap path is unnecessary as > > MREMAP_DONTUNMAP will have already returned by this point. > > In this code it is also used as err-path. In case move_page_tables() > fails to move all page tables or .mremap() callback fails, the new VMA > is unmapped. > > IOW, MREMAP_DONTUNMAP returns under: > : if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) { > > -- > Dima
Re: [PATCH 2/6] mm/mremap: For MREMAP_DONTUNMAP check security_vm_enough_memory_mm()
This looks good to me with a small comment. > if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) { > /* OOM: unable to split vma, just get accounts right */ > - if (vm_flags & VM_ACCOUNT) > + if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) > vm_acct_memory(new_len >> PAGE_SHIFT); Checking MREMAP_DONTUNMAP in the do_munmap path is unnecessary as MREMAP_DONTUNMAP will have already returned by this point.
Re: [PATCH 3/6] mremap: Don't allow MREMAP_DONTUNMAP on special_mappings and aio
I don't think this situation can ever happen MREMAP_DONTUNMAP is already restricted to anonymous mappings (defined as not having vm_ops) and vma_to_resize checks that the mapping is anonymous before move_vma is called. On Mon, Oct 12, 2020 at 6:34 PM Dmitry Safonov wrote: > > As kernel expect to see only one of such mappings, any further > operations on the VMA-copy may be unexpected by the kernel. > Maybe it's being on the safe side, but there doesn't seem to be any > expected use-case for this, so restrict it now. > > Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") > Signed-off-by: Dmitry Safonov > --- > arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +- > fs/aio.c | 5 - > include/linux/mm.h| 2 +- > mm/mmap.c | 6 +- > mm/mremap.c | 2 +- > 5 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c > b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c > index 0daf2f1cf7a8..e916646adc69 100644 > --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c > +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c > @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, > struct file *filp) > return 0; > } > > -static int pseudo_lock_dev_mremap(struct vm_area_struct *area) > +static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long > flags) > { > /* Not supported */ > return -EINVAL; > diff --git a/fs/aio.c b/fs/aio.c > index d5ec30385566..3be3c0f77548 100644 > --- a/fs/aio.c > +++ b/fs/aio.c > @@ -324,13 +324,16 @@ static void aio_free_ring(struct kioctx *ctx) > } > } > > -static int aio_ring_mremap(struct vm_area_struct *vma) > +static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags) > { > struct file *file = vma->vm_file; > struct mm_struct *mm = vma->vm_mm; > struct kioctx_table *table; > int i, res = -EINVAL; > > + if (flags & MREMAP_DONTUNMAP) > + return -EINVAL; > + > spin_lock(&mm->ioctx_lock); > rcu_read_lock(); > table = rcu_dereference(mm->ioctx_table); > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 16b799a0522c..fd51a4a1f722 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -550,7 +550,7 @@ struct vm_operations_struct { > void (*open)(struct vm_area_struct * area); > void (*close)(struct vm_area_struct * area); > int (*split)(struct vm_area_struct * area, unsigned long addr); > - int (*mremap)(struct vm_area_struct * area); > + int (*mremap)(struct vm_area_struct *area, unsigned long flags); > vm_fault_t (*fault)(struct vm_fault *vmf); > vm_fault_t (*huge_fault)(struct vm_fault *vmf, > enum page_entry_size pe_size); > diff --git a/mm/mmap.c b/mm/mmap.c > index bdd19f5b994e..50f853b0ec39 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3372,10 +3372,14 @@ static const char *special_mapping_name(struct > vm_area_struct *vma) > return ((struct vm_special_mapping *)vma->vm_private_data)->name; > } > > -static int special_mapping_mremap(struct vm_area_struct *new_vma) > +static int special_mapping_mremap(struct vm_area_struct *new_vma, > + unsigned long flags) > { > struct vm_special_mapping *sm = new_vma->vm_private_data; > > + if (flags & MREMAP_DONTUNMAP) > + return -EINVAL; > + > if (WARN_ON_ONCE(current->mm != new_vma->vm_mm)) > return -EFAULT; > > diff --git a/mm/mremap.c b/mm/mremap.c > index c248f9a52125..898e9818ba6d 100644 > --- a/mm/mremap.c > +++ b/mm/mremap.c > @@ -384,7 +384,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, > if (moved_len < old_len) { > err = -ENOMEM; > } else if (vma->vm_ops && vma->vm_ops->mremap) { > - err = vma->vm_ops->mremap(new_vma); > + err = vma->vm_ops->mremap(new_vma, flags); > } > > if (unlikely(err)) { > -- > 2.28.0 >
[tip: x86/urgent] fanotify: Fix sys_fanotify_mark() on native x86-32
The following commit has been merged into the x86/urgent branch of tip: Commit-ID: 2ca408d9c749c32288bc28725f9f12ba30299e8f Gitweb: https://git.kernel.org/tip/2ca408d9c749c32288bc28725f9f12ba30299e8f Author:Brian Gerst AuthorDate:Mon, 30 Nov 2020 17:30:59 -05:00 Committer: Borislav Petkov CommitterDate: Mon, 28 Dec 2020 11:58:59 +01:00 fanotify: Fix sys_fanotify_mark() on native x86-32 Commit 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments") converted native x86-32 which take 64-bit arguments to use the compat handlers to allow conversion to passing args via pt_regs. sys_fanotify_mark() was however missed, as it has a general compat handler. Add a config option that will use the syscall wrapper that takes the split args for native 32-bit. [ bp: Fix typo in Kconfig help text. ] Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments") Reported-by: Paweł Jasiak Signed-off-by: Brian Gerst Signed-off-by: Borislav Petkov Acked-by: Jan Kara Acked-by: Andy Lutomirski Link: https://lkml.kernel.org/r/20201130223059.101286-1-brge...@gmail.com --- arch/Kconfig | 6 ++ arch/x86/Kconfig | 1 + fs/notify/fanotify/fanotify_user.c | 17 +++-- include/linux/syscalls.h | 24 4 files changed, 38 insertions(+), 10 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index 78c6f05..24862d1 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -1105,6 +1105,12 @@ config HAVE_ARCH_PFN_VALID config ARCH_SUPPORTS_DEBUG_PAGEALLOC bool +config ARCH_SPLIT_ARG64 + bool + help + If a 32-bit architecture requires 64-bit arguments to be split into + pairs of 32-bit arguments, select this option. + source "kernel/gcov/Kconfig" source "scripts/gcc-plugins/Kconfig" diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7b6dd10..21f8511 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -19,6 +19,7 @@ config X86_32 select KMAP_LOCAL select MODULES_USE_ELF_REL select OLD_SIGACTION + select ARCH_SPLIT_ARG64 config X86_64 def_bool y diff --git a/fs/notify/fanotify/fanotify_user.c b/fs/notify/fanotify/fanotify_user.c index 3e01d8f..dcab112 100644 --- a/fs/notify/fanotify/fanotify_user.c +++ b/fs/notify/fanotify/fanotify_user.c @@ -1285,26 +1285,23 @@ fput_and_out: return ret; } +#ifndef CONFIG_ARCH_SPLIT_ARG64 SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags, __u64, mask, int, dfd, const char __user *, pathname) { return do_fanotify_mark(fanotify_fd, flags, mask, dfd, pathname); } +#endif -#ifdef CONFIG_COMPAT -COMPAT_SYSCALL_DEFINE6(fanotify_mark, +#if defined(CONFIG_ARCH_SPLIT_ARG64) || defined(CONFIG_COMPAT) +SYSCALL32_DEFINE6(fanotify_mark, int, fanotify_fd, unsigned int, flags, - __u32, mask0, __u32, mask1, int, dfd, + SC_ARG64(mask), int, dfd, const char __user *, pathname) { - return do_fanotify_mark(fanotify_fd, flags, -#ifdef __BIG_ENDIAN - ((__u64)mask0 << 32) | mask1, -#else - ((__u64)mask1 << 32) | mask0, -#endif -dfd, pathname); + return do_fanotify_mark(fanotify_fd, flags, SC_VAL64(__u64, mask), + dfd, pathname); } #endif diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index f3929af..7688bc9 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -251,6 +251,30 @@ static inline int is_syscall_trace_event(struct trace_event_call *tp_event) static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) #endif /* __SYSCALL_DEFINEx */ +/* For split 64-bit arguments on 32-bit architectures */ +#ifdef __LITTLE_ENDIAN +#define SC_ARG64(name) u32, name##_lo, u32, name##_hi +#else +#define SC_ARG64(name) u32, name##_hi, u32, name##_lo +#endif +#define SC_VAL64(type, name) ((type) name##_hi << 32 | name##_lo) + +#ifdef CONFIG_COMPAT +#define SYSCALL32_DEFINE1 COMPAT_SYSCALL_DEFINE1 +#define SYSCALL32_DEFINE2 COMPAT_SYSCALL_DEFINE2 +#define SYSCALL32_DEFINE3 COMPAT_SYSCALL_DEFINE3 +#define SYSCALL32_DEFINE4 COMPAT_SYSCALL_DEFINE4 +#define SYSCALL32_DEFINE5 COMPAT_SYSCALL_DEFINE5 +#define SYSCALL32_DEFINE6 COMPAT_SYSCALL_DEFINE6 +#else +#define SYSCALL32_DEFINE1 SYSCALL_DEFINE1 +#define SYSCALL32_DEFINE2 SYSCALL_DEFINE2 +#define SYSCALL32_DEFINE3 SYSCALL_DEFINE3 +#define SYSCALL32_DEFINE4 SYSCALL_DEFINE4 +#define SYSCALL32_DEFINE5 SYSCALL_DEFINE5 +#define SYSCALL32_DEFINE6 SYSCALL_DEFINE6 +#endif + /* * Called before coming back to user-mode. Returning to user
Re: v5.10.1 xfs deadlock
On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote: > On 17.12.20 20:43, Brian Foster wrote: > > On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote: > > > Dear xfs developer, > > > > > > I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs > > > filesystems on md raid6 raids. > > > > > > The stress test was essentially `cp -a`ing a Linux source repository with > > > two threads in parallel on each filesystem. > > > > > > After about on hour, the processes to one filesystem (md1) blocked, 30 > > > minutes later the process to the other filesystem (md0) did. > > > > > > root 7322 2167 0 Dec16 pts/100:00:06 cp -a > > > /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP > > > root 7329 2169 0 Dec16 pts/100:00:05 cp -a > > > /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP > > > root 13856 2170 0 Dec16 pts/100:00:08 cp -a > > > /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP > > > root 13899 2168 0 Dec16 pts/100:00:05 cp -a > > > /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP > > > Do you have any indication of whether these workloads actually hung or just became incredibly slow? > > > Some info from the system (all stack traces, slabinfo) is available here: > > > https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt > > > > > > It stands out, that there are many (549 for md0, but only 10 for md1) > > > "xfs-conv" threads all with stacks like this > > > > > > [<0>] xfs_log_commit_cil+0x6cc/0x7c0 > > > [<0>] __xfs_trans_commit+0xab/0x320 > > > [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0 > > > [<0>] xfs_end_ioend+0xc6/0x110 > > > [<0>] xfs_end_io+0xad/0xe0 > > > [<0>] process_one_work+0x1dd/0x3e0 > > > [<0>] worker_thread+0x2d/0x3b0 > > > [<0>] kthread+0x118/0x130 > > > [<0>] ret_from_fork+0x22/0x30 > > > > > > xfs_log_commit_cil+0x6cc is > > > > > >xfs_log_commit_cil() > > > xlog_cil_push_background(log) > > >xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock); > > > This looks like the transaction commit throttling code. That was introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push"). The purpose of that change was to prevent the CIL from growing too large. FWIW, I don't recall that being a functional problem so it should be possible to simply remove that blocking point and see if that avoids the problem or if we simply stall out somewhere else, if you wanted to give that a test. > > > Some other threads, including the four "cp" commands are also blocking at > > > xfs_log_commit_cil+0x6cc > > > > > > There are also single "flush" process for each md device with this stack > > > signature: > > > > > > [<0>] xfs_map_blocks+0xbf/0x400 > > > [<0>] iomap_do_writepage+0x15e/0x880 > > > [<0>] write_cache_pages+0x175/0x3f0 > > > [<0>] iomap_writepages+0x1c/0x40 > > > [<0>] xfs_vm_writepages+0x59/0x80 > > > [<0>] do_writepages+0x4b/0xe0 > > > [<0>] __writeback_single_inode+0x42/0x300 > > > [<0>] writeback_sb_inodes+0x198/0x3f0 > > > [<0>] __writeback_inodes_wb+0x5e/0xc0 > > > [<0>] wb_writeback+0x246/0x2d0 > > > [<0>] wb_workfn+0x26e/0x490 > > > [<0>] process_one_work+0x1dd/0x3e0 > > > [<0>] worker_thread+0x2d/0x3b0 > > > [<0>] kthread+0x118/0x130 > > > [<0>] ret_from_fork+0x22/0x30 > > > Is writeback still blocked as such or was this just a transient stack? > > > xfs_map_blocks+0xbf is the > > > > > > xfs_ilock(ip, XFS_ILOCK_SHARED); > > > > > > in xfs_map_blocks(). > > > > > > The system is low on free memory > > > > > > MemTotal: 197587764 kB > > > MemFree: 2196496 kB > > > MemAvailable: 189895408 kB > > > > > > but responsive. > > > > > > I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from > > > linux-scsi) , but it is unlikely that this blocking is related
Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery
On Thu, Dec 17, 2020 at 2:57 PM Ben Greear wrote: > On 12/17/20 2:24 PM, Brian Norris wrote: > > I'd also note that we don't operate in AP mode -- only STA -- and IIRC > > Ben, you've complained about AP mode in the past. > > I complain about all sorts of things, but I'm usually running > station mode :) Hehe, fair :) Maybe I'm mixed up. But I do get the feeling that specifically within the ath10k family, there are wildly different use cases (mobile, PC, AP) and chips (and firmware) that tend to go along with them, and that those use cases get a fairly different population of {developers, testers, reporters}. So claiming "feature X works" pretty much always has to be couched in which chips, firmware, and use case. And there's certainly some wisdom in these sections: https://wireless.wiki.kernel.org/en/users/drivers/ath10k/submittingpatches#hardware_families https://wireless.wiki.kernel.org/en/users/drivers/ath10k/submittingpatches#tested-on_tag > Do you actually see iwlwifi stations stay associated through > firmware crashes? Yes. > Anyway, happy to hear some have seamless recovery, and in that case, > I have no objections to the patch. OK! I hope I'm not the only one with such results, because then I still might question my sanity (and test coverage), but that's still my understanding. BTW, I haven't yet closely reviewed the patch series myself, but I ACK the concept. Brian
Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery
On Tue, Dec 15, 2020 at 10:51:13PM +0530, Youghandhar Chintala wrote: > From: Rakesh Pillai I meant to mention in my other reply: the threading on this series is broken (as in, it doesn't exist). It looks like you're using git-send-email (good!), but somehow it doesn't have any In-Reply-To or References (bad!). Did you send all your mail in one invocation, or did you send them as separate git-send-email commands? Anyway, please investigate what when wrong so you can get this right in the future. For one, this affects Patchwork's ability to group patch series (not to mention everybody who uses a decent mail reader, with proper threading). See for example the lore archive, which only is threading replies to this cover letter: https://lore.kernel.org/linux-wireless/20201215172113.5038-1-yough...@codeaurora.org/ Regards, Brian
Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery
On Tue, Dec 15, 2020 at 10:23:33AM -0800, Ben Greear wrote: > On 12/15/20 9:21 AM, Youghandhar Chintala wrote: > > From: Rakesh Pillai > > > > Currently in case of target hardware restart ,we just reconfig and > > re-enable the security keys and enable the network queues to start > > data traffic back from where it was interrupted. > > Are there any known mac80211 radios/drivers that *can* support seamless > restarts? > > If not, then just could always enable this feature in mac80211? I'm quite sure that iwlwifi intentionally supports a seamless restart. >From my experience with dealing with user reports, I don't recall any issues where restart didn't function as expected, unless there was some deeper underlying failure (e.g., hardware/power failure; driver bugs / lockups). I don't have very good stats for ath10k/QCA6174, but it survives our testing OK and I again don't recall any user-reported complaints in this area. I'd say this is a weaker example though, as I don't have as clear of data. (By contrast, ath10k/WCN399x, which Rakesh, et al, are patching here, does not pass our tests at all, and clearly fails to recover from "seamless" restarts, as noted in patch 3.) I'd also note that we don't operate in AP mode -- only STA -- and IIRC Ben, you've complained about AP mode in the past. Brian
Re: v5.10.1 xfs deadlock
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote: > Dear xfs developer, > > I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs > filesystems on md raid6 raids. > > The stress test was essentially `cp -a`ing a Linux source repository with two > threads in parallel on each filesystem. > > After about on hour, the processes to one filesystem (md1) blocked, 30 > minutes later the process to the other filesystem (md0) did. > > root 7322 2167 0 Dec16 pts/100:00:06 cp -a > /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP > root 7329 2169 0 Dec16 pts/100:00:05 cp -a > /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP > root 13856 2170 0 Dec16 pts/100:00:08 cp -a > /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP > root 13899 2168 0 Dec16 pts/100:00:05 cp -a > /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP > > Some info from the system (all stack traces, slabinfo) is available here: > https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt > > It stands out, that there are many (549 for md0, but only 10 for md1) > "xfs-conv" threads all with stacks like this > > [<0>] xfs_log_commit_cil+0x6cc/0x7c0 > [<0>] __xfs_trans_commit+0xab/0x320 > [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0 > [<0>] xfs_end_ioend+0xc6/0x110 > [<0>] xfs_end_io+0xad/0xe0 > [<0>] process_one_work+0x1dd/0x3e0 > [<0>] worker_thread+0x2d/0x3b0 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x22/0x30 > > xfs_log_commit_cil+0x6cc is > > xfs_log_commit_cil() > xlog_cil_push_background(log) > xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock); > > Some other threads, including the four "cp" commands are also blocking at > xfs_log_commit_cil+0x6cc > > There are also single "flush" process for each md device with this stack > signature: > > [<0>] xfs_map_blocks+0xbf/0x400 > [<0>] iomap_do_writepage+0x15e/0x880 > [<0>] write_cache_pages+0x175/0x3f0 > [<0>] iomap_writepages+0x1c/0x40 > [<0>] xfs_vm_writepages+0x59/0x80 > [<0>] do_writepages+0x4b/0xe0 > [<0>] __writeback_single_inode+0x42/0x300 > [<0>] writeback_sb_inodes+0x198/0x3f0 > [<0>] __writeback_inodes_wb+0x5e/0xc0 > [<0>] wb_writeback+0x246/0x2d0 > [<0>] wb_workfn+0x26e/0x490 > [<0>] process_one_work+0x1dd/0x3e0 > [<0>] worker_thread+0x2d/0x3b0 > [<0>] kthread+0x118/0x130 > [<0>] ret_from_fork+0x22/0x30 > > xfs_map_blocks+0xbf is the > > xfs_ilock(ip, XFS_ILOCK_SHARED); > > in xfs_map_blocks(). > > The system is low on free memory > > MemTotal: 197587764 kB > MemFree: 2196496 kB > MemAvailable: 189895408 kB > > but responsive. > > I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from > linux-scsi) , but it is unlikely that this blocking is related to that, > because the md block devices itself are responsive (`xxd /dev/md0` ) > > I can keep the system in the state for a while. Is there an idea what was > going from or an idea what data I could collect from the running system to > help? I have full debug info and could walk lists or retrieve data structures > with gdb. > It might be useful to dump the values under /sys/fs/xfs//log/* for each fs to get an idea of the state of the logs as well... Brian > Best > Donald >
[PATCH net-next v2 4/4] net: indirect call helpers for ipv4/ipv6 dst_check functions
From: brianvv This patch avoids the indirect call for the common case: ip6_dst_check and ipv4_dst_check Signed-off-by: brianvv --- include/net/dst.h | 7 ++- net/core/sock.c | 12 ++-- net/ipv4/route.c| 7 +-- net/ipv4/tcp_ipv4.c | 5 - net/ipv6/route.c| 7 +-- net/ipv6/tcp_ipv6.c | 5 - 6 files changed, 34 insertions(+), 9 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 9f474a79ed7d..26f134ad3a25 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -459,10 +459,15 @@ static inline int dst_input(struct sk_buff *skb) ip6_input, ip_local_deliver, skb); } +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *, + u32)); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie) { if (dst->obsolete) - dst = dst->ops->check(dst, cookie); + dst = INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, +ipv4_dst_check, dst, cookie); return dst; } diff --git a/net/core/sock.c b/net/core/sock.c index 4fd7e785f177..753b831a9d70 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -526,11 +526,17 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb, } EXPORT_SYMBOL(__sk_receive_skb); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *, + u32)); +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie) { struct dst_entry *dst = __sk_dst_get(sk); - if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) { + if (dst && dst->obsolete && + INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check, + dst, cookie) == NULL) { sk_tx_queue_clear(sk); sk->sk_dst_pending_confirm = 0; RCU_INIT_POINTER(sk->sk_dst_cache, NULL); @@ -546,7 +552,9 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie) { struct dst_entry *dst = sk_dst_get(sk); - if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) { + if (dst && dst->obsolete && + INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check, + dst, cookie) == NULL) { sk_dst_reset(sk); dst_release(dst); return NULL; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 4fac91f8bd6c..9e6537709794 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -133,7 +133,8 @@ static int ip_rt_gc_timeout __read_mostly = RT_GC_TIMEOUT; * Interface to generic destination cache. */ -static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); +INDIRECT_CALLABLE_SCOPE +struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ipv4_default_advmss(const struct dst_entry *dst); INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst); @@ -1188,7 +1189,8 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk) } EXPORT_SYMBOL_GPL(ipv4_sk_redirect); -static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie) +INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst, +u32 cookie) { struct rtable *rt = (struct rtable *) dst; @@ -1204,6 +1206,7 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie) return NULL; return dst; } +EXPORT_SYMBOL(ipv4_dst_check); static void ipv4_send_dest_unreach(struct sk_buff *skb) { diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index af2338294598..aba5061024c7 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1646,6 +1646,8 @@ u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr *iph, return mss; } +INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *, + u32)); /* The socket must have it's spinlock held when we get * here, unless it is a TCP_LISTEN socket. * @@ -1665,7 +1667,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb) sk_mark_napi_id(sk, skb); if (dst) { if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif || - !dst->ops->check(dst, 0)) { + !INDIRECT_CALL_1(dst->ops->check, ipv4_dst_check, +dst, 0)
[PATCH net-next v2 3/4] net: use indirect call helpers for dst_mtu
From: brianvv This patch avoids the indirect call for the common case: ip6_mtu and ipv4_mtu Signed-off-by: brianvv --- include/net/dst.h | 4 +++- net/ipv4/route.c | 6 -- net/ipv6/route.c | 6 -- 3 files changed, 11 insertions(+), 5 deletions(-) diff --git a/include/net/dst.h b/include/net/dst.h index 3932e9931f08..9f474a79ed7d 100644 --- a/include/net/dst.h +++ b/include/net/dst.h @@ -194,9 +194,11 @@ dst_feature(const struct dst_entry *dst, u32 feature) return dst_metric(dst, RTAX_FEATURES) & feature; } +INDIRECT_CALLABLE_DECLARE(unsigned int ip6_mtu(const struct dst_entry *)); +INDIRECT_CALLABLE_DECLARE(unsigned int ipv4_mtu(const struct dst_entry *)); static inline u32 dst_mtu(const struct dst_entry *dst) { - return dst->ops->mtu(dst); + return INDIRECT_CALL_INET(dst->ops->mtu, ip6_mtu, ipv4_mtu, dst); } /* RTT metrics are stored in milliseconds for user ABI, but used as jiffies */ diff --git a/net/ipv4/route.c b/net/ipv4/route.c index e26652ff7059..4fac91f8bd6c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -135,7 +135,8 @@ static int ip_rt_gc_timeout __read_mostly = RT_GC_TIMEOUT; static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ipv4_default_advmss(const struct dst_entry *dst); -static unsigned int ipv4_mtu(const struct dst_entry *dst); +INDIRECT_CALLABLE_SCOPE +unsigned int ipv4_mtu(const struct dst_entry *dst); static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst); static void ipv4_link_failure(struct sk_buff *skb); static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock *sk, @@ -1311,7 +1312,7 @@ static unsigned int ipv4_default_advmss(const struct dst_entry *dst) return min(advmss, IPV4_MAX_PMTU - header_size); } -static unsigned int ipv4_mtu(const struct dst_entry *dst) +INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst) { const struct rtable *rt = (const struct rtable *)dst; unsigned int mtu = rt->rt_pmtu; @@ -1333,6 +1334,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +EXPORT_SYMBOL(ipv4_mtu); static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr) { diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 188e114b29b4..22caee290b6c 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -83,7 +83,8 @@ enum rt6_nud_state { static struct dst_entry*ip6_dst_check(struct dst_entry *dst, u32 cookie); static unsigned int ip6_default_advmss(const struct dst_entry *dst); -static unsigned int ip6_mtu(const struct dst_entry *dst); +INDIRECT_CALLABLE_SCOPE +unsigned int ip6_mtu(const struct dst_entry *dst); static struct dst_entry *ip6_negative_advice(struct dst_entry *); static voidip6_dst_destroy(struct dst_entry *); static voidip6_dst_ifdown(struct dst_entry *, @@ -3089,7 +3090,7 @@ static unsigned int ip6_default_advmss(const struct dst_entry *dst) return mtu; } -static unsigned int ip6_mtu(const struct dst_entry *dst) +INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst) { struct inet6_dev *idev; unsigned int mtu; @@ -3111,6 +3112,7 @@ static unsigned int ip6_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +EXPORT_SYMBOL(ip6_mtu); /* MTU selection: * 1. mtu on route is locked - use it -- 2.29.2.576.ga3fc446d84-goog