Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-03-13 Thread Brian Foster
On Sat, Mar 09, 2024 at 12:14:23PM +0800, Hou Tao wrote:
> Hi,
> 
> On 2/29/2024 11:01 PM, Brian Foster wrote:
> > On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote:
> >> From: Hou Tao 
> >>
> >> When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
> >> module), if the cache of virtiofs is disabled, the read buffer will be
> >> passed to virtiofs through out_args[0].value instead of pages. Because
> >> virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
> >> will create a bounce buffer for the read buffer by using kmalloc() and
> >> copy the read buffer into bounce buffer. If the read buffer is large
> >> (e.g., 1MB), the allocation will incur significant stress on the memory
> >> subsystem.
> >>
> >> So instead of allocating bounce buffer by using kmalloc(), allocate a
> >> bounce buffer which is backed by scattered pages. The original idea is
> >> to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
> >> simplify the copy operations in the bounce buffer, use a bio_vec flex
> >> array to represent the argbuf. Also add an is_flat field in struct
> >> virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
> >> buffer.
> >>
> >> Signed-off-by: Hou Tao 
> >> ---
> >>  fs/fuse/virtio_fs.c | 163 
> >>  1 file changed, 149 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >> index f10fff7f23a0f..ffea684bd100d 100644
> >> --- a/fs/fuse/virtio_fs.c
> >> +++ b/fs/fuse/virtio_fs.c
> > ...
> >> @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
> >> work_struct *work)
> >>}
> >>  }
> >>  
> > ...  
> >>  static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf 
> >> *argbuf,
> >>  unsigned int offset,
> >>  const void *src, unsigned int len)
> >>  {
> >> -  memcpy(argbuf->buf + offset, src, len);
> >> +  struct iov_iter iter;
> >> +  unsigned int copied;
> >> +
> >> +  if (argbuf->is_flat) {
> >> +  memcpy(argbuf->f.buf + offset, src, len);
> >> +  return;
> >> +  }
> >> +
> >> +  iov_iter_bvec(&iter, ITER_DEST, argbuf->s.bvec,
> >> +argbuf->s.nr, argbuf->s.size);
> >> +  iov_iter_advance(&iter, offset);
> > Hi Hou,
> >
> > Just a random comment, but it seems a little inefficient to reinit and
> > readvance the iter like this on every copy/call. It looks like offset is
> > already incremented in the callers of the argbuf copy helpers. Perhaps
> > iov_iter could be lifted into the callers and passed down, or even just
> > include it in the argbuf structure and init it at alloc time?
> 
> Sorry for the late reply. Being busy with off-site workshop these days.
> 

No worries.

> I have tried a similar idea before in which iov_iter was saved directly
> in argbuf struct, but it didn't work out well. The reason is that for
> copy both in_args and out_args, an iov_iter is needed, but the direction
> is different. Currently the bi-directional io_vec is not supported, so
> the code have to initialize the iov_iter twice: one for copy from
> in_args and another one for copy to out_args.
> 

Ok, seems reasonable enough.

> For dio read initiated from kernel, both of its in_numargs and
> out_numargs is 1, so there will be only one iov_iter_advance() in
> virtio_fs_argbuf_copy_to_out_arg() and the offset is 64, so I think the
> overhead will be fine. For dio write initiated from kernel, its
> in_numargs is 2 and out_numargs is 1, so there will be two invocations
> of iov_iter_advance(). The first one with offset=64, and the another one
> with offset=round_up_page_size(64 + write_size), so the later one may
> introduce extra overhead. But compared with the overhead of data copy, I
> still think the overhead of calling iov_iter_advance() is fine.
> 

I'm not claiming the overhead is some practical problem here, but rather
that we shouldn't need to be concerned with it in the first place with
some cleaner code. It's been a bit since I first looked at this. I was
originally wondering about defining the iov_iter in the caller and pass
as a param, but on another look, do the lowest level helpers really need
to exist?

If you don't anticipate further users, IMO something like the diff below

Re: [PATCH v2 4/6] virtiofs: support bounce buffer backed by scattered pages

2024-02-29 Thread Brian Foster
On Wed, Feb 28, 2024 at 10:41:24PM +0800, Hou Tao wrote:
> From: Hou Tao 
> 
> When reading a file kept in virtiofs from kernel (e.g., insmod a kernel
> module), if the cache of virtiofs is disabled, the read buffer will be
> passed to virtiofs through out_args[0].value instead of pages. Because
> virtiofs can't get the pages for the read buffer, virtio_fs_argbuf_new()
> will create a bounce buffer for the read buffer by using kmalloc() and
> copy the read buffer into bounce buffer. If the read buffer is large
> (e.g., 1MB), the allocation will incur significant stress on the memory
> subsystem.
> 
> So instead of allocating bounce buffer by using kmalloc(), allocate a
> bounce buffer which is backed by scattered pages. The original idea is
> to use vmap(), but the use of GFP_ATOMIC is no possible for vmap(). To
> simplify the copy operations in the bounce buffer, use a bio_vec flex
> array to represent the argbuf. Also add an is_flat field in struct
> virtio_fs_argbuf to distinguish between kmalloc-ed and scattered bounce
> buffer.
> 
> Signed-off-by: Hou Tao 
> ---
>  fs/fuse/virtio_fs.c | 163 
>  1 file changed, 149 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index f10fff7f23a0f..ffea684bd100d 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
...
> @@ -408,42 +425,143 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   }
>  }
>  
...  
>  static void virtio_fs_argbuf_copy_from_in_arg(struct virtio_fs_argbuf 
> *argbuf,
> unsigned int offset,
> const void *src, unsigned int len)
>  {
> - memcpy(argbuf->buf + offset, src, len);
> + struct iov_iter iter;
> + unsigned int copied;
> +
> + if (argbuf->is_flat) {
> + memcpy(argbuf->f.buf + offset, src, len);
> + return;
> + }
> +
> + iov_iter_bvec(&iter, ITER_DEST, argbuf->s.bvec,
> +   argbuf->s.nr, argbuf->s.size);
> + iov_iter_advance(&iter, offset);

Hi Hou,

Just a random comment, but it seems a little inefficient to reinit and
readvance the iter like this on every copy/call. It looks like offset is
already incremented in the callers of the argbuf copy helpers. Perhaps
iov_iter could be lifted into the callers and passed down, or even just
include it in the argbuf structure and init it at alloc time?

Brian

> +
> + copied = _copy_to_iter(src, len, &iter);
> + WARN_ON_ONCE(copied != len);
>  }
>  
>  static unsigned int
> @@ -451,15 +569,32 @@ virtio_fs_argbuf_out_args_offset(struct 
> virtio_fs_argbuf *argbuf,
>const struct fuse_args *args)
>  {
>   unsigned int num_in = args->in_numargs - args->in_pages;
> + unsigned int offset = fuse_len_args(num_in,
> + (struct fuse_arg *)args->in_args);
>  
> - return fuse_len_args(num_in, (struct fuse_arg *)args->in_args);
> + if (argbuf->is_flat)
> + return offset;
> + return round_up(offset, PAGE_SIZE);
>  }
>  
>  static void virtio_fs_argbuf_copy_to_out_arg(struct virtio_fs_argbuf *argbuf,
>unsigned int offset, void *dst,
>unsigned int len)
>  {
> - memcpy(dst, argbuf->buf + offset, len);
> + struct iov_iter iter;
> + unsigned int copied;
> +
> + if (argbuf->is_flat) {
> + memcpy(dst, argbuf->f.buf + offset, len);
> + return;
> + }
> +
> + iov_iter_bvec(&iter, ITER_SOURCE, argbuf->s.bvec,
> +   argbuf->s.nr, argbuf->s.size);
> + iov_iter_advance(&iter, offset);
> +
> + copied = _copy_from_iter(dst, len, &iter);
> + WARN_ON_ONCE(copied != len);
>  }
>  
>  /*
> @@ -1154,7 +1289,7 @@ static unsigned int sg_init_fuse_args(struct 
> scatterlist *sg,
>   len = fuse_len_args(numargs - argpages, args);
>   if (len)
>   total_sgs += virtio_fs_argbuf_setup_sg(req->argbuf, *len_used,
> -len, &sg[total_sgs]);
> +&len, &sg[total_sgs]);
>  
>   if (argpages)
>   total_sgs += sg_init_fuse_pages(&sg[total_sgs],
> @@ -1199,7 +1334,7 @@ static int virtio_fs_enqueue_req(struct virtio_fs_vq 
> *fsvq,
>   }
>  
>   /* Use a bounce buffer since stack args cannot be mapped */
> - req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC);
> + req->argbuf = virtio_fs_argbuf_new(args, GFP_ATOMIC, true);
>   if (!req->argbuf) {
>   ret = -ENOMEM;
>   goto out;
> -- 
> 2.29.2
> 
> 




RE: [PATCH v3 4/5] arch,locking/atomic: hexagon: add arch_cmpxchg[64]_local

2023-11-22 Thread Brian Cain



> -Original Message-
> From: wuqiang.matt 
> Sent: Tuesday, November 21, 2023 8:24 AM
> To: ubiz...@gmail.com; mark.rutl...@arm.com; vgu...@kernel.org; Brian
> Cain ; jo...@southpole.se;
> stefan.kristians...@saunalahti.fi; sho...@gmail.com; ch...@zankel.net;
> jcmvb...@gmail.com; ge...@linux-m68k.org; andi.sh...@linux.intel.com;
> mi...@kernel.org; pal...@rivosinc.com; andrzej.ha...@intel.com;
> a...@arndb.de; pet...@infradead.org; mhira...@kernel.org
> Cc: linux-a...@vger.kernel.org; linux-snps-...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux-hexa...@vger.kernel.org; linux-
> openr...@vger.kernel.org; linux-trace-ker...@vger.kernel.org;
> mat...@163.com; li...@roeck-us.net; wuqiang.matt
> ; kernel test robot 
> Subject: [PATCH v3 4/5] arch,locking/atomic: hexagon: add
> arch_cmpxchg[64]_local
> 
> WARNING: This email originated from outside of Qualcomm. Please be wary of
> any links or attachments, and do not enable macros.
> 
> hexagonc hasn't arch_cmpxhg_local implemented, which causes
> building failures for any references of try_cmpxchg_local,
> reported by the kernel test robot.
> 
> This patch implements arch_cmpxchg[64]_local with the native
> cmpxchg variant if the corresponding data size is supported,
> otherwise generci_cmpxchg[64]_local is to be used.
> 
> Reported-by: kernel test robot 
> Closes: https://lore.kernel.org/oe-kbuild-all/202310272207.tLPflya4-
> l...@intel.com/
> 
> Signed-off-by: wuqiang.matt 
> Reviewed-by: Masami Hiramatsu (Google) 
> ---
>  arch/hexagon/include/asm/cmpxchg.h | 51 +-
>  1 file changed, 50 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/hexagon/include/asm/cmpxchg.h
> b/arch/hexagon/include/asm/cmpxchg.h
> index bf6cf5579cf4..302fa30f25aa 100644
> --- a/arch/hexagon/include/asm/cmpxchg.h
> +++ b/arch/hexagon/include/asm/cmpxchg.h
> @@ -8,6 +8,8 @@
>  #ifndef _ASM_CMPXCHG_H
>  #define _ASM_CMPXCHG_H
> 
> +#include 
> +
>  /*
>   * __arch_xchg - atomically exchange a register and a memory location
>   * @x: value to swap
> @@ -51,13 +53,15 @@ __arch_xchg(unsigned long x, volatile void *ptr, int
> size)
>   *  variable casting.
>   */
> 
> -#define arch_cmpxchg(ptr, old, new)\
> +#define __cmpxchg_32(ptr, old, new)\
>  ({ \
> __typeof__(ptr) __ptr = (ptr);  \
> __typeof__(*(ptr)) __old = (old);   \
> __typeof__(*(ptr)) __new = (new);   \
> __typeof__(*(ptr)) __oldval = 0;\
> \
> +   BUILD_BUG_ON(sizeof(*(ptr)) != 4);  \
> +   \
> asm volatile(   \
> "1: %0 = memw_locked(%1);\n"\
> "   { P0 = cmp.eq(%0,%2);\n"\
> @@ -72,4 +76,49 @@ __arch_xchg(unsigned long x, volatile void *ptr, int size)
> __oldval;   \
>  })
> 
> +#define __cmpxchg(ptr, old, val, size) \
> +({ \
> +   __typeof__(*(ptr)) oldval;  \
> +   \
> +   switch (size) { \
> +   case 4: \
> +   oldval = __cmpxchg_32(ptr, old, val);   \
> +   break;  \
> +   default:\
> +   BUILD_BUG();\
> +   oldval = val;   \
> +   break;  \
> +   }   \
> +   \
> +   oldval; \
> +})
> +
> +#define arch_cmpxchg(ptr, o, n)__cmpxchg((ptr), (o), (n), 
> sizeof(*(ptr)))
> +
> +/*
> + * always make arch_cmpxchg[64]_local available, native cmpxchg
> + * will be used if available, then generic_cmpxchg[64]_local
> + */
> +#include 
> +
> +#define arch_cmpxchg_local(ptr, old, val)  \
> +({

Re: [PATCH 1/2] dt-bindings: mwifiex: document use with the SD8777 chipset

2023-11-02 Thread Brian Norris
On Sun, Oct 29, 2023 at 12:08:16PM +0100, Karel Balej wrote:
> Document the corresponding compatible string for the use of this driver
> with the Marvell SD8777 wireless chipset.
> 
> Signed-off-by: Karel Balej 

FWIW, the binding looks fine from mwifiex point of view, so:

Acked-by: Brian Norris 

But see cover letter. We can't merge driver support without a
linux-firmware-compatible (or otherwise redistributable) firmware, so
NAK for the series.


Re: [PATCH 0/2] net: mwifiex: add support for the SD8777 chipset

2023-11-02 Thread Brian Norris
On Sun, Oct 29, 2023 at 12:08:15PM +0100, Karel Balej wrote:
> The driver requires proprietary firmware which is not yet part of
> linux-firmware, but it is packaged in postmarketOS.

You gotta get that done:

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#new_driver

"have firmware images submitted for linux-firmware with an acceptable
license allowing redistribution"

We can't have a driver requesting a mrvl/sd8777_uapsta.bin firmware that
isn't available for anyone [1].

Until that's done, NAK.

[1] I think you might be referring to this:
https://github.com/xcover3/android_vendor_samsung_xcover3lte/commit/6e324b43b32dc607327d89148dd5d83a14429ee6

But I don't see any license info, so I don't think that's going to be
appropriate for linux-firmware.


Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-14 Thread Brian Foster
On Wed, Sep 14, 2022 at 05:38:02PM +0800, Yang, Xiao/杨 晓 wrote:
> On 2022/9/14 14:44, Yang, Xiao/杨 晓 wrote:
> > On 2022/9/9 21:01, Brian Foster wrote:
> > > Yes.. I don't recall all the internals of the tools and test, but IIRC
> > > it relied on discard to perform zeroing between checkpoints or some such
> > > and avoid spurious failures. The purpose of running on dm-thin was
> > > merely to provide reliable discard zeroing behavior on the target device
> > > and thus to allow the test to run reliably.
> > Hi Brian,
> > 
> > As far as I know, generic/470 was original designed to verify
> > mmap(MAP_SYNC) on the dm-log-writes device enabling DAX. Due to the
> > reason, we need to ensure that all underlying devices under
> > dm-log-writes device support DAX. However dm-thin device never supports
> > DAX so
> > running generic/470 with dm-thin device always returns "not run".
> > 
> > Please see the difference between old and new logic:
> > 
> >old logic  new logic
> > ---
> > log-writes device(DAX) log-writes device(DAX)
> >  |   |
> > PMEM0(DAX) + PMEM1(DAX)   Thin device(non-DAX) + PMEM1(DAX)
> >|
> >  PMEM0(DAX)
> > ---
> > 
> > We think dm-thin device is not a good solution for generic/470, is there
> > any other solution to support both discard zero and DAX?
> 
> Hi Brian,
> 
> I have sent a patch[1] to revert your fix because I think it's not good for
> generic/470 to use thin volume as my revert patch[1] describes:
> [1] 
> https://lore.kernel.org/fstests/20220914090625.32207-1-yangx...@fujitsu.com/T/#u
> 

I think the history here is that generic/482 was changed over first in
commit 65cc9a235919 ("generic/482: use thin volume as data device"), and
then sometime later we realized generic/455,457,470 had the same general
flaw and were switched over. The dm/dax compatibility thing was probably
just an oversight, but I am a little curious about that because it should
have been obvious that the change caused the test to no longer run. Did
something change after that to trigger that change in behavior?

> With the revert, generic/470 can always run successfully on my environment
> so I wonder how to reproduce the out-of-order replay issue on XFS v5
> filesystem?
> 

I don't quite recall the characteristics of the failures beyond that we
were seeing spurious test failures with generic/482 that were due to
essentially putting the fs/log back in time in a way that wasn't quite
accurate due to the clearing by the logwrites tool not taking place. If
you wanted to reproduce in order to revisit that, perhaps start with
generic/482 and let it run in a loop for a while and see if it
eventually triggers a failure/corruption..?

> PS: I want to reproduce the issue and try to find a better solution to fix
> it.
> 

It's been a while since I looked at any of this tooling to semi-grok how
it works. Perhaps it could learn to rely on something more explicit like
zero range (instead of discard?) or fall back to manual zeroing? If the
eventual solution is simple and low enough overhead, it might make some
sense to replace the dmthin hack across the set of tests mentioned
above.

Brian

> Best Regards,
> Xiao Yang
> 
> > 
> > BTW, only log-writes, stripe and linear support DAX for now.
> 




Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition

2022-09-09 Thread Brian Foster
On Thu, Sep 08, 2022 at 09:46:04PM +0800, Shiyang Ruan wrote:
> 
> 
> 在 2022/8/4 8:51, Darrick J. Wong 写道:
> > On Wed, Aug 03, 2022 at 06:47:24AM +, ruansy.f...@fujitsu.com wrote:
> 
> ...
> 
> > > > > > > 
> > > > > > > BTW, since these patches (dax&reflink&rmap + THIS + pmem-unbind) 
> > > > > > > are
> > > > > > > waiting to be merged, is it time to think about "removing the
> > > > > > > experimental tag" again?  :)
> > > > > > 
> > > > > > It's probably time to take up that question again.
> > > > > > 
> > > > > > Yesterday I tried running generic/470 (aka the MAP_SYNC test) and it
> > > > > > didn't succeed because it sets up dmlogwrites atop dmthinp atop 
> > > > > > pmem,
> > > > > > and at least one of those dm layers no longer allows fsdax 
> > > > > > pass-through,
> > > > > > so XFS silently turned mount -o dax into -o dax=never. :(
> > > > > 
> > > > > Hi Darrick,
> > > > > 
> > > > > I tried generic/470 but it didn't run:
> > > > >  [not run] Cannot use thin-pool devices on DAX capable block 
> > > > > devices.
> > > > > 
> > > > > Did you modify the _require_dm_target() in common/rc?  I added 
> > > > > thin-pool
> > > > > to not to check dax capability:
> > > > > 
> > > > >case $target in
> > > > >stripe|linear|log-writes|thin-pool)  # add thin-pool here
> > > > >;;
> > > > > 
> > > > > then the case finally ran and it silently turned off dax as you said.
> > > > > 
> > > > > Are the steps for reproduction correct? If so, I will continue to
> > > > > investigate this problem.
> > > > 
> > > > Ah, yes, I did add thin-pool to that case statement.  Sorry I forgot to
> > > > mention that.  I suspect that the removal of dm support for pmem is
> > > > going to force us to completely redesign this test.  I can't really
> > > > think of how, though, since there's no good way that I know of to gain a
> > > > point-in-time snapshot of a pmem device.
> > > 
> > > Hi Darrick,
> > > 
> > >   > removal of dm support for pmem
> > > I think here we are saying about xfstest who removed the support, not
> > > kernel?
> > > 
> > > I found some xfstests commits:
> > > fc7b3903894a6213c765d64df91847f4460336a2  # common/rc: add the 
> > > restriction.
> > > fc5870da485aec0f9196a0f2bed32f73f6b2c664  # generic/470: use thin-pool
> > > 
> > > So, this case was never able to run since the second commit?  (I didn't
> > > notice the not run case.  I thought it was expected to be not run.)
> > > 
> > > And according to the first commit, the restriction was added because
> > > some of dm devices don't support dax.  So my understanding is: we should
> > > redesign the case to make the it work, and firstly, we should add dax
> > > support for dm devices in kernel.
> > 
> > dm devices used to have fsdax support; I think Christoph is actively
> > removing (or already has removed) all that support.
> > 
> > > In addition, is there any other testcase has the same problem?  so that
> > > we can deal with them together.
> > 
> > The last I checked, there aren't any that require MAP_SYNC or pmem aside
> > from g/470 and the three poison notification tests that you sent a few
> > days ago.
> > 
> > --D
> > 
> 
> Hi Darrick, Brian
> 
> I made a little investigation on generic/470.
> 
> This case was able to run before introducing thin-pool[1], but since that,
> it became 'Failed'/'Not Run' because thin-pool does not support DAX.  I have
> checked the log of thin-pool, it never supports DAX.  And, it's not someone
> has removed the fsdax support.  So, I think it's not correct to bypass the
> requirement conditions by adding 'thin-pool' to _require_dm_target().
> 
> As far as I known, to prevent out-of-order replay of dm-log-write, thin-pool
> was introduced (to provide discard zeroing).  Should we solve the
> 'out-of-order replay' issue instead of avoiding it by thin-pool? @Brian
> 

Yes.. I don't recall all the internals of the tools and test, but IIRC
it relied on discard to perform zeroing between checkpoints or some such
and avoid spurious failures. The purpose of running on dm-thin was
merely to provide reliable discard zeroing behavior on the target device
and thus to allow the test to run reliably.

Brian

> Besides, since it's not a fsdax problem, I think there is nothing need to be
> fixed in fsdax.  I'd like to help it solved, but I'm still wondering if we
> could back to the original topic("Remove Experimental Tag") firstly? :)
> 
> 
> [1] fc5870da485aec0f9196a0f2bed32f73f6b2c664 generic/470: use thin volume
> for dmlogwrites target device
> 
> 
> --
> Thanks,
> Ruan.
> 
> 




Re: [PATCH bpf-next v4 0/3] add batched ops for percpu array

2021-04-19 Thread Brian Vazquez
Sorry I missed  this.
I don't recall why I didn't add the array percpu variant, but LGTM.

Acked-by: Brian Vazquez 

On Fri, Apr 16, 2021 at 3:09 PM Martin KaFai Lau  wrote:
>
> On Thu, Apr 15, 2021 at 02:46:16PM -0300, Pedro Tammela wrote:
> > This patchset introduces batched operations for the per-cpu variant of
> > the array map.
> >
> > It also removes the percpu macros from 'bpf_util.h'. This change was
> > suggested by Andrii in a earlier iteration of this patchset.
> >
> > The tests were updated to reflect all the new changes.
> Acked-by: Martin KaFai Lau 


Re: [PATCH] wireless/marvell/mwifiex: Fix a double free in mwifiex_send_tdls_action_frame

2021-04-13 Thread Brian Norris
Hi,

On Wed, Apr 14, 2021 at 12:08:32AM +0800, lyl2...@mail.ustc.edu.cn wrote:
> 
> Hi,
>   maintianers.
> 
>   Sorry to disturb you, but this patch seems to be missed more than two weeks.
>   Could you help to review this patch? I am sure it won't take you much time.

You might take a look here:
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#checking_state_of_patches_from_patchwork
https://patchwork.kernel.org/project/linux-wireless/list/

Two weeks is not unheard of for waiting.

Anyway, you *did* succeed in catching my attention today, so I'll give
you a review, below:

> > -原始邮件-
> > 发件人: "Lv Yunlong" 
> > 发送时间: 2021-03-29 19:24:35 (星期一)
> > 收件人: amitkar...@gmail.com, ganapathi.b...@nxp.com, huxinming...@gmail.com, 
> > kv...@codeaurora.org, da...@davemloft.net, k...@kernel.org
> > 抄送: linux-wirel...@vger.kernel.org, net...@vger.kernel.org, 
> > linux-kernel@vger.kernel.org, "Lv Yunlong" 
> > 主题: [PATCH] wireless/marvell/mwifiex: Fix a double free in 
> > mwifiex_send_tdls_action_frame
> > 
> > In mwifiex_send_tdls_action_frame, it calls 
> > mwifiex_construct_tdls_action_frame
> > (..,skb). The skb will be freed in mwifiex_construct_tdls_action_frame() 
> > when
> > it is failed. But when mwifiex_construct_tdls_action_frame() returns error,
> > the skb will be freed in the second time by dev_kfree_skb_any(skb).
> > 
> > My patch removes the redundant dev_kfree_skb_any(skb) when
> > mwifiex_construct_tdls_action_frame() failed.
> > 
> > Fixes: b23bce2965680 ("mwifiex: add tdls_mgmt handler support")
> > Signed-off-by: Lv Yunlong 
> > ---
> >  drivers/net/wireless/marvell/mwifiex/tdls.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/net/wireless/marvell/mwifiex/tdls.c 
> > b/drivers/net/wireless/marvell/mwifiex/tdls.c
> > index 97bb87c3676b..8d4d0a9cf6ac 100644
> > --- a/drivers/net/wireless/marvell/mwifiex/tdls.c
> > +++ b/drivers/net/wireless/marvell/mwifiex/tdls.c
> > @@ -856,7 +856,6 @@ int mwifiex_send_tdls_action_frame(struct 
> > mwifiex_private *priv, const u8 *peer,
> > if (mwifiex_construct_tdls_action_frame(priv, peer, action_code,
> > dialog_token, status_code,
> > skb)) {
> > -   dev_kfree_skb_any(skb);

Good catch, and this looks correct for most cases, but I'll note that
you missed one issue: mwifiex_construct_tdls_action_frame() has a
default/error case ("Unknown TDLS action frame type") which does *not*
free this SKB, so you should patch that up at the same time. Otherwise,
you're trading a double-free for a potential memory leak.

With that fixed:

Reviewed-by: Brian Norris 

Thanks.

> > return -EINVAL;
> > }
> >  
> > -- 
> > 2.25.1
> > 


Re: [PATCH] xfs: fix return of uninitialized value in variable error

2021-04-09 Thread Brian Foster
On Fri, Apr 09, 2021 at 03:18:34PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> A previous commit removed a call to xfs_attr3_leaf_read that
> assigned an error return code to variable error. We now have
> a few early error return paths to label 'out' that return
> error if error is set; however error now is uninitialized
> so potentially garbage is being returned.  Fix this by setting
> error to zero to restore the original behaviour where error
> was zero at the label 'restart'.
> 
> Addresses-Coverity: ("Uninitialized scalar variable")
> Fixes: 07120f1abdff ("xfs: Add xfs_has_attr and subroutines")
> Signed-off-by: Colin Ian King 
> ---
>  fs/xfs/libxfs/xfs_attr.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index 472b3039eabb..902e5f7e6642 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -928,6 +928,7 @@ xfs_attr_node_addname(
>* Search to see if name already exists, and get back a pointer
>* to where it should go.
>*/
> + error = 0;
>   retval = xfs_attr_node_hasname(args, &state);
>   if (retval != -ENOATTR && retval != -EEXIST)
>   goto out;

I think it would be nicer to initialize at the top of the function as
opposed to try and "preserve" historical behavior, but that nit aside:

Reviewed-by: Brian Foster 

> -- 
> 2.30.2
> 



Re: [PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings

2021-03-26 Thread Brian Geffon
Hi Alex,
It has not landed yet, it's currently in Andrew's mm tree. I can reach
out again when it makes it into Linus' tree.

Brian


On Thu, Mar 25, 2021 at 2:34 PM Alejandro Colomar (man-pages)
 wrote:
>
> Hello Brian,
>
> Is this already merged in Linux?  I guess not, as I've seen a patch of
> yous for the kernel, right?
>
> Thanks,
>
> Alex
>
> On 3/23/21 7:25 PM, Brian Geffon wrote:
> > mremap(2) now supports MREMAP_DONTUNMAP with mapping types other
> > than private anonymous.
> >
> > Signed-off-by: Brian Geffon 
> > ---
> >   man2/mremap.2 | 13 ++---
> >   1 file changed, 2 insertions(+), 11 deletions(-)
> >
> > diff --git a/man2/mremap.2 b/man2/mremap.2
> > index 3ed0c0c0a..72acbc111 100644
> > --- a/man2/mremap.2
> > +++ b/man2/mremap.2
> > @@ -118,16 +118,6 @@ This flag, which must be used in conjunction with
> >   remaps a mapping to a new address but does not unmap the mapping at
> >   .IR old_address .
> >   .IP
> > -The
> > -.B MREMAP_DONTUNMAP
> > -flag can be used only with private anonymous mappings
> > -(see the description of
> > -.BR MAP_PRIVATE
> > -and
> > -.BR MAP_ANONYMOUS
> > -in
> > -.BR mmap (2)).
> > -.IP
> >   After completion,
> >   any access to the range specified by
> >   .IR old_address
> > @@ -227,7 +217,8 @@ was specified, but one or more pages in the range 
> > specified by
> >   .IR old_address
> >   and
> >   .IR old_size
> > -were not private anonymous;
> > +were part of a special mapping or the mapping is one that
> > +does not support merging or expanding;
> >   .IP *
> >   .B MREMAP_DONTUNMAP
> >   was specified and
> >
>
> --
> Alejandro Colomar
> Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
> http://www.alejandro-colomar.es/


[PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings

2021-03-23 Thread Brian Geffon
mremap(2) now supports MREMAP_DONTUNMAP with mapping types other
than private anonymous.

Signed-off-by: Brian Geffon 
---
 man2/mremap.2 | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/man2/mremap.2 b/man2/mremap.2
index 3ed0c0c0a..72acbc111 100644
--- a/man2/mremap.2
+++ b/man2/mremap.2
@@ -118,16 +118,6 @@ This flag, which must be used in conjunction with
 remaps a mapping to a new address but does not unmap the mapping at
 .IR old_address .
 .IP
-The
-.B MREMAP_DONTUNMAP
-flag can be used only with private anonymous mappings
-(see the description of
-.BR MAP_PRIVATE
-and
-.BR MAP_ANONYMOUS
-in
-.BR mmap (2)).
-.IP
 After completion,
 any access to the range specified by
 .IR old_address
@@ -227,7 +217,8 @@ was specified, but one or more pages in the range specified 
by
 .IR old_address
 and
 .IR old_size
-were not private anonymous;
+were part of a special mapping or the mapping is one that
+does not support merging or expanding;
 .IP *
 .B MREMAP_DONTUNMAP
 was specified and
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v5 3/3] selftests: Add a MREMAP_DONTUNMAP selftest for shmem

2021-03-23 Thread Brian Geffon
This test extends the current mremap tests to validate that
the MREMAP_DONTUNMAP operation can be performed on shmem mappings.

Signed-off-by: Brian Geffon 
---
 tools/testing/selftests/vm/mremap_dontunmap.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/tools/testing/selftests/vm/mremap_dontunmap.c 
b/tools/testing/selftests/vm/mremap_dontunmap.c
index 3a7b5ef0b0c6..f01dc4a85b0b 100644
--- a/tools/testing/selftests/vm/mremap_dontunmap.c
+++ b/tools/testing/selftests/vm/mremap_dontunmap.c
@@ -127,6 +127,57 @@ static void mremap_dontunmap_simple()
   "unable to unmap source mapping");
 }
 
+// This test validates that MREMAP_DONTUNMAP on a shared mapping works as 
expected.
+static void mremap_dontunmap_simple_shmem()
+{
+   unsigned long num_pages = 5;
+
+   int mem_fd = memfd_create("memfd", MFD_CLOEXEC);
+   BUG_ON(mem_fd < 0, "memfd_create");
+
+   BUG_ON(ftruncate(mem_fd, num_pages * page_size) < 0,
+   "ftruncate");
+
+   void *source_mapping =
+   mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE,
+MAP_FILE | MAP_SHARED, mem_fd, 0);
+   BUG_ON(source_mapping == MAP_FAILED, "mmap");
+
+   BUG_ON(close(mem_fd) < 0, "close");
+
+   memset(source_mapping, 'a', num_pages * page_size);
+
+   // Try to just move the whole mapping anywhere (not fixed).
+   void *dest_mapping =
+   mremap(source_mapping, num_pages * page_size, num_pages * page_size,
+  MREMAP_DONTUNMAP | MREMAP_MAYMOVE, NULL);
+   if (dest_mapping == MAP_FAILED && errno == EINVAL) {
+   // Old kernel which doesn't support MREMAP_DONTUNMAP on shmem.
+   BUG_ON(munmap(source_mapping, num_pages * page_size) == -1,
+   "unable to unmap source mapping");
+   return;
+   }
+
+   BUG_ON(dest_mapping == MAP_FAILED, "mremap");
+
+   // Validate that the pages have been moved, we know they were moved if
+   // the dest_mapping contains a's.
+   BUG_ON(check_region_contains_byte
+  (dest_mapping, num_pages * page_size, 'a') != 0,
+  "pages did not migrate");
+
+   // Because the region is backed by shmem, we will actually see the same
+   // memory at the source location still.
+   BUG_ON(check_region_contains_byte
+  (source_mapping, num_pages * page_size, 'a') != 0,
+  "source should have no ptes");
+
+   BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1,
+  "unable to unmap destination mapping");
+   BUG_ON(munmap(source_mapping, num_pages * page_size) == -1,
+  "unable to unmap source mapping");
+}
+
 // This test validates MREMAP_DONTUNMAP will move page tables to a specific
 // destination using MREMAP_FIXED, also while validating that the source
 // remains intact.
@@ -300,6 +351,7 @@ int main(void)
BUG_ON(page_buffer == MAP_FAILED, "unable to mmap a page.");
 
mremap_dontunmap_simple();
+   mremap_dontunmap_simple_shmem();
mremap_dontunmap_simple_fixed();
mremap_dontunmap_partial_mapping();
mremap_dontunmap_partial_mapping_overwrite();
-- 
2.31.0.291.g576ba9dcdaf-goog



[PATCH v5 2/3] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-23 Thread Brian Geffon
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called. This meant that
vm_ops->mremap() would never be called on such mappings. Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EINVAL.

1. https://lkml.org/lkml/2020/12/28/2340

Signed-off-by: Brian Geffon 
Acked-by: Hugh Dickins 
---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
 fs/aio.c  | 5 +
 include/linux/mm.h| 2 +-
 mm/mmap.c | 6 +-
 mm/mremap.c   | 2 +-
 5 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e916646adc69..0daf2f1cf7a8 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
-static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
flags)
+static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
 {
/* Not supported */
return -EINVAL;
diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..76ce0cc3ee4e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
}
 }
 
-static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
+static int aio_ring_mremap(struct vm_area_struct *vma)
 {
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct kioctx_table *table;
int i, res = -EINVAL;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..ecdc6e8dc5af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -570,7 +570,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-   int (*mremap)(struct vm_area_struct *area, unsigned long flags);
+   int (*mremap)(struct vm_area_struct *area);
/*
 * Called by mprotect() to make driver-specific permission
 * checks before mprotect() is finalised.   The VMA must not
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..9d7651e4e1fe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
vm_area_struct *vma)
return ((struct vm_special_mapping *)vma->vm_private_data)->name;
 }
 
-static int special_mapping_mremap(struct vm_area_struct *new_vma,
- unsigned long flags)
+static int special_mapping_mremap(struct vm_area_struct *new_vma)
 {
struct vm_special_mapping *sm = new_vma->vm_private_data;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
return -EFAULT;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index db5b8b28c2dd..d22629ff8f3c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (moved_len < old_len) {
err = -ENOMEM;
} else if (vma->vm_ops && vma->vm_ops->mremap) {
-   err = vma->vm_ops->mremap(new_vma, flags);
+   err = vma->vm_ops->mremap(new_vma);
}
 
if (unlikely(err)) {
-- 
2.31.0.291.g576ba9dcdaf-goog



[PATCH v5 1/3] mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings

2021-03-23 Thread Brian Geffon
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings.
This restriction was placed initially for simplicity and not because
there exists a technical reason to do so.

This change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL
if VM_DONTEXPAND or VM_PFNMAP mappings are specified.

Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection:
"Android is developing a new garbage collector (GC), based on userfaultfd.
The garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads will
find it missing, at which point the thread will create the compacted page
and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
Before starting this compaction, in a stop-the-world pause the heap will be
mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
UFFD_EVENT_PAGEFAULT events after resuming execution.

To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of even
modest sized memory ranges also took several milliseconds, and stopping the
application for that long isn't acceptable in response-time sensitive
cases.

With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."

[1] 
https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
[2] 
https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/

Signed-off-by: Brian Geffon 
Acked-by: Hugh Dickins 
Tested-by: Lokesh Gidra 
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..db5b8b28c2dd 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
return ERR_PTR(-EINVAL);
}
 
-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
+   if ((flags & MREMAP_DONTUNMAP) &&
+   (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
return ERR_PTR(-EINVAL);
 
if (is_vm_hugetlb_page(vma))
-- 
2.31.0.291.g576ba9dcdaf-goog



[PATCH] mremap.2: MREMAP_DONTUNMAP to reflect to supported mappings

2021-03-23 Thread Brian Geffon
mremap(2) now supports MREMAP_DONTUNMAP with mapping types other
than private anonymous.

Signed-off-by: Brian Geffon 
---
 man2/mremap.2 | 13 ++---
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/man2/mremap.2 b/man2/mremap.2
index 3ed0c0c0a..72acbc111 100644
--- a/man2/mremap.2
+++ b/man2/mremap.2
@@ -118,16 +118,6 @@ This flag, which must be used in conjunction with
 remaps a mapping to a new address but does not unmap the mapping at
 .IR old_address .
 .IP
-The
-.B MREMAP_DONTUNMAP
-flag can be used only with private anonymous mappings
-(see the description of
-.BR MAP_PRIVATE
-and
-.BR MAP_ANONYMOUS
-in
-.BR mmap (2)).
-.IP
 After completion,
 any access to the range specified by
 .IR old_address
@@ -227,7 +217,8 @@ was specified, but one or more pages in the range specified 
by
 .IR old_address
 and
 .IR old_size
-were not private anonymous;
+were part of a special mapping or the mapping is one that
+does not support merging or expanding;
 .IP *
 .B MREMAP_DONTUNMAP
 was specified and
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v4 1/3] mm: Extend MREMAP_DONTUNMAP to non-anonymous mappings

2021-03-23 Thread Brian Geffon
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings.
This restriction was placed initially for simplicity and not because
there exists a technical reason to do so.

This change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL
if VM_DONTEXPAND or VM_PFNMAP mappings are specified.

Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection:
"Android is developing a new garbage collector (GC), based on userfaultfd.
The garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads will
find it missing, at which point the thread will create the compacted page
and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
Before starting this compaction, in a stop-the-world pause the heap will be
mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
UFFD_EVENT_PAGEFAULT events after resuming execution.

To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of even
modest sized memory ranges also took several milliseconds, and stopping the
application for that long isn't acceptable in response-time sensitive
cases.

With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."

[1] 
https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
[2] 
https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/

Signed-off-by: Brian Geffon 
Acked-by: Hugh Dickins 
Tested-by: Lokesh Gidra 
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..db5b8b28c2dd 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
return ERR_PTR(-EINVAL);
}
 
-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
+   if ((flags & MREMAP_DONTUNMAP) &&
+   (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
return ERR_PTR(-EINVAL);
 
if (is_vm_hugetlb_page(vma))
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v4 3/3] selftests: Add a MREMAP_DONTUNMAP selftest for shmem

2021-03-23 Thread Brian Geffon
This test extends the current mremap tests to validate that
the MREMAP_DONTUNMAP operation can be performed on shmem mappings.

Signed-off-by: Brian Geffon 
---
 tools/testing/selftests/vm/mremap_dontunmap.c | 52 +++
 1 file changed, 52 insertions(+)

diff --git a/tools/testing/selftests/vm/mremap_dontunmap.c 
b/tools/testing/selftests/vm/mremap_dontunmap.c
index 3a7b5ef0b0c6..f01dc4a85b0b 100644
--- a/tools/testing/selftests/vm/mremap_dontunmap.c
+++ b/tools/testing/selftests/vm/mremap_dontunmap.c
@@ -127,6 +127,57 @@ static void mremap_dontunmap_simple()
   "unable to unmap source mapping");
 }
 
+// This test validates that MREMAP_DONTUNMAP on a shared mapping works as 
expected.
+static void mremap_dontunmap_simple_shmem()
+{
+   unsigned long num_pages = 5;
+
+   int mem_fd = memfd_create("memfd", MFD_CLOEXEC);
+   BUG_ON(mem_fd < 0, "memfd_create");
+
+   BUG_ON(ftruncate(mem_fd, num_pages * page_size) < 0,
+   "ftruncate");
+
+   void *source_mapping =
+   mmap(NULL, num_pages * page_size, PROT_READ | PROT_WRITE,
+MAP_FILE | MAP_SHARED, mem_fd, 0);
+   BUG_ON(source_mapping == MAP_FAILED, "mmap");
+
+   BUG_ON(close(mem_fd) < 0, "close");
+
+   memset(source_mapping, 'a', num_pages * page_size);
+
+   // Try to just move the whole mapping anywhere (not fixed).
+   void *dest_mapping =
+   mremap(source_mapping, num_pages * page_size, num_pages * page_size,
+  MREMAP_DONTUNMAP | MREMAP_MAYMOVE, NULL);
+   if (dest_mapping == MAP_FAILED && errno == EINVAL) {
+   // Old kernel which doesn't support MREMAP_DONTUNMAP on shmem.
+   BUG_ON(munmap(source_mapping, num_pages * page_size) == -1,
+   "unable to unmap source mapping");
+   return;
+   }
+
+   BUG_ON(dest_mapping == MAP_FAILED, "mremap");
+
+   // Validate that the pages have been moved, we know they were moved if
+   // the dest_mapping contains a's.
+   BUG_ON(check_region_contains_byte
+  (dest_mapping, num_pages * page_size, 'a') != 0,
+  "pages did not migrate");
+
+   // Because the region is backed by shmem, we will actually see the same
+   // memory at the source location still.
+   BUG_ON(check_region_contains_byte
+  (source_mapping, num_pages * page_size, 'a') != 0,
+  "source should have no ptes");
+
+   BUG_ON(munmap(dest_mapping, num_pages * page_size) == -1,
+  "unable to unmap destination mapping");
+   BUG_ON(munmap(source_mapping, num_pages * page_size) == -1,
+  "unable to unmap source mapping");
+}
+
 // This test validates MREMAP_DONTUNMAP will move page tables to a specific
 // destination using MREMAP_FIXED, also while validating that the source
 // remains intact.
@@ -300,6 +351,7 @@ int main(void)
BUG_ON(page_buffer == MAP_FAILED, "unable to mmap a page.");
 
mremap_dontunmap_simple();
+   mremap_dontunmap_simple_shmem();
mremap_dontunmap_simple_fixed();
mremap_dontunmap_partial_mapping();
mremap_dontunmap_partial_mapping_overwrite();
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v4 2/3] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-23 Thread Brian Geffon
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called. This meant that
vm_ops->mremap() would never be called on such mappings. Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT.

1. https://lkml.org/lkml/2020/12/28/2340

Signed-off-by: Brian Geffon 
Acked-by: Hugh Dickins 
---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
 fs/aio.c  | 5 +
 include/linux/mm.h| 2 +-
 mm/mmap.c | 6 +-
 mm/mremap.c   | 2 +-
 5 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e916646adc69..0daf2f1cf7a8 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
-static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
flags)
+static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
 {
/* Not supported */
return -EINVAL;
diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..76ce0cc3ee4e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
}
 }
 
-static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
+static int aio_ring_mremap(struct vm_area_struct *vma)
 {
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct kioctx_table *table;
int i, res = -EINVAL;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 64a71bf20536..ecdc6e8dc5af 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -570,7 +570,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-   int (*mremap)(struct vm_area_struct *area, unsigned long flags);
+   int (*mremap)(struct vm_area_struct *area);
/*
 * Called by mprotect() to make driver-specific permission
 * checks before mprotect() is finalised.   The VMA must not
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..9d7651e4e1fe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
vm_area_struct *vma)
return ((struct vm_special_mapping *)vma->vm_private_data)->name;
 }
 
-static int special_mapping_mremap(struct vm_area_struct *new_vma,
- unsigned long flags)
+static int special_mapping_mremap(struct vm_area_struct *new_vma)
 {
struct vm_special_mapping *sm = new_vma->vm_private_data;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
return -EFAULT;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index db5b8b28c2dd..d22629ff8f3c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (moved_len < old_len) {
err = -ENOMEM;
} else if (vma->vm_ops && vma->vm_ops->mremap) {
-   err = vma->vm_ops->mremap(new_vma, flags);
+   err = vma->vm_ops->mremap(new_vma);
}
 
if (unlikely(err)) {
-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [RFC 2/7] ath10k: Add support to process rx packet in thread

2021-03-22 Thread Brian Norris
On Mon, Mar 22, 2021 at 4:58 PM Ben Greear  wrote:
> On 7/22/20 6:00 AM, Felix Fietkau wrote:
> > On 2020-07-22 14:55, Johannes Berg wrote:
> >> On Wed, 2020-07-22 at 14:27 +0200, Felix Fietkau wrote:
> >>
> >>> I'm considering testing a different approach (with mt76 initially):
> >>> - Add a mac80211 rx function that puts processed skbs into a list
> >>> instead of handing them to the network stack directly.
> >>
> >> Would this be *after* all the mac80211 processing, i.e. in place of the
> >> rx-up-to-stack?
> > Yes, it would run all the rx handlers normally and then put the
> > resulting skbs into a list instead of calling netif_receive_skb or
> > napi_gro_frags.
>
> Whatever came of this?  I realized I'm running Felix's patch since his mt76
> driver needs it.  Any chance it will go upstream?

If you're asking about $subject (moving NAPI/RX to a thread), this
landed upstream recently:
http://git.kernel.org/linus/adbb4fb028452b1b0488a1a7b66ab856cdf20715

It needs a bit of coaxing to work on a WiFi driver (including: WiFi
drivers tend to have a different netdev for NAPI than they expose to
/sys/class/net/), but it's there.

I'm not sure if people had something else in mind in the stuff you're
quoting though.

Brian


Re: [PATCH 2/2] ibmvfc: make ibmvfc_wait_for_ops MQ aware

2021-03-19 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH 1/2] ibmvfc: fix potential race in ibmvfc_wait_for_ops

2021-03-19 Thread Brian King
Reviewed-by: Brian King 

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



[PATCH v3 2/2] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-17 Thread Brian Geffon
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called. This meant that
vm_ops->mremap() would never be called on such mappings. Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT.

1. https://lkml.org/lkml/2020/12/28/2340

Signed-off-by: Brian Geffon 
---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
 fs/aio.c  | 5 +
 include/linux/mm.h| 2 +-
 mm/mmap.c | 6 +-
 mm/mremap.c   | 2 +-
 5 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e916646adc69..0daf2f1cf7a8 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
-static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
flags)
+static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
 {
/* Not supported */
return -EINVAL;
diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..76ce0cc3ee4e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
}
 }
 
-static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
+static int aio_ring_mremap(struct vm_area_struct *vma)
 {
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct kioctx_table *table;
int i, res = -EINVAL;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..8c3729eb3e38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -570,7 +570,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-   int (*mremap)(struct vm_area_struct *area, unsigned long flags);
+   int (*mremap)(struct vm_area_struct *area);
/*
 * Called by mprotect() to make driver-specific permission
 * checks before mprotect() is finalised.   The VMA must not
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..9d7651e4e1fe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
vm_area_struct *vma)
return ((struct vm_special_mapping *)vma->vm_private_data)->name;
 }
 
-static int special_mapping_mremap(struct vm_area_struct *new_vma,
- unsigned long flags)
+static int special_mapping_mremap(struct vm_area_struct *new_vma)
 {
struct vm_special_mapping *sm = new_vma->vm_private_data;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
return -EFAULT;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index db5b8b28c2dd..d22629ff8f3c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (moved_len < old_len) {
err = -ENOMEM;
} else if (vma->vm_ops && vma->vm_ops->mremap) {
-   err = vma->vm_ops->mremap(new_vma, flags);
+   err = vma->vm_ops->mremap(new_vma);
}
 
if (unlikely(err)) {
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v3 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-17 Thread Brian Geffon
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.

This change will result in mremap(MREMAP_DONTUNMAP) returning -EINVAL
if VM_DONTEXPAND or VM_PFNMAP mappings are specified.

Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection:
"Android is developing a new garbage collector (GC), based on userfaultfd.
The garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads will
find it missing, at which point the thread will create the compacted page
and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
Before starting this compaction, in a stop-the-world pause the heap will be
mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
UFFD_EVENT_PAGEFAULT events after resuming execution.

To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of even
modest sized memory ranges also took several milliseconds, and stopping the
application for that long isn't acceptable in response-time sensitive
cases.

With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."

[1] 
https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
[2] 
https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/

Signed-off-by: Brian Geffon 
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..db5b8b28c2dd 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -653,8 +653,8 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
return ERR_PTR(-EINVAL);
}
 
-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
+   if ((flags & MREMAP_DONTUNMAP) &&
+   (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
return ERR_PTR(-EINVAL);
 
if (is_vm_hugetlb_page(vma))
-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-17 Thread Brian Geffon
You're 100% correct, I'll mail a new patch in a few

Brian


On Wed, Mar 17, 2021 at 5:19 PM Peter Xu  wrote:
>
> On Wed, Mar 17, 2021 at 04:44:25PM -0400, Brian Geffon wrote:
> > Hi Peter,
>
> Hi, Brian,
>
> > Thank you as always for taking a look. This change relies on the
> > existing check in vma_to_resize on line 686:
> > https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L686
> > which returns -EFAULT when the vma is VM_DONTEXPAND or VM_PFNMAP.
>
> Do you mean line 676?
>
> https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L676
>
> I'm not sure whether it'll work for MREMAP_DONTUNMAP, since IIUC
> MREMAP_DONTUNMAP only works for the remap case with no size change, however in
> that case in vma_to_resize() we'll bail out even earlier than line 676 when
> checking against the size:
>
> https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L667
>
> So IIUC we'll still need the change as Hugh suggested previously.
>
> Thanks,
>
> --
> Peter Xu
>


Re: [PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-17 Thread Brian Geffon
Hi Peter,
Thank you as always for taking a look. This change relies on the
existing check in vma_to_resize on line 686:
https://elixir.bootlin.com/linux/v5.12-rc3/source/mm/mremap.c#L686
which returns -EFAULT when the vma is VM_DONTEXPAND or VM_PFNMAP.

Thanks
Brian

On Wed, Mar 17, 2021 at 4:40 PM Peter Xu  wrote:
>
> Hi, Brian,
>
> On Wed, Mar 17, 2021 at 12:13:33PM -0700, Brian Geffon wrote:
> > Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
> > change will widen the support to include any mappings which are not
> > VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
> > MREMAP_DONTUNMAP on mappings which may have been created from a memfd.
> >
> > This change which takes advantage of the existing check in vma_to_resize
> > for non-VM_DONTEXPAND and non-VM_PFNMAP mappings will cause
> > MREMAP_DONTUNMAP to return -EFAULT if such mappings are remapped. This
> > behavior is consistent with existing behavior when using mremap with
> > such mappings.
> >
> > Lokesh Gidra who works on the Android JVM, provided an explanation of how
> > such a feature will improve Android JVM garbage collection:
> > "Android is developing a new garbage collector (GC), based on userfaultfd.
> > The garbage collector will use userfaultfd (uffd) on the java heap during
> > compaction. On accessing any uncompacted page, the application threads will
> > find it missing, at which point the thread will create the compacted page
> > and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
> > Before starting this compaction, in a stop-the-world pause the heap will be
> > mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
> > UFFD_EVENT_PAGEFAULT events after resuming execution.
> >
> > To speedup mremap operations, pagetable movement was optimized by moving
> > PUD entries instead of PTE entries [1]. It was necessary as mremap of even
> > modest sized memory ranges also took several milliseconds, and stopping the
> > application for that long isn't acceptable in response-time sensitive
> > cases.
> >
> > With UFFDIO_CONTINUE feature [2], it will be even more efficient to
> > implement this GC, particularly the 'non-moveable' portions of the heap.
> > It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
> > However, for this to work, the java heap has to be on a 'shared' vma.
> > Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
> > patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
> > compaction."
> >
> > [1] 
> > https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
> > [2] 
> > https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/
> >
> > Signed-off-by: Brian Geffon 
> > ---
> >  mm/mremap.c | 4 
> >  1 file changed, 4 deletions(-)
> >
> > diff --git a/mm/mremap.c b/mm/mremap.c
> > index ec8f840399ed..2c57dc4bc8b6 100644
> > --- a/mm/mremap.c
> > +++ b/mm/mremap.c
> > @@ -653,10 +653,6 @@ static struct vm_area_struct *vma_to_resize(unsigned 
> > long addr,
> >   return ERR_PTR(-EINVAL);
> >   }
> >
> > - if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> > - vma->vm_flags & VM_SHARED))
> > - return ERR_PTR(-EINVAL);
> > -
> >   if (is_vm_hugetlb_page(vma))
> >   return ERR_PTR(-EINVAL);
>
> The code change seems to be not aligned with what the commit message said.  
> Did
> you perhaps forget to add the checks against VM_DONTEXPAND | VM_PFNMAP?  I'm
> guessing that (instead of commit message to be touched up) because you still
> attached the revert patch, then that check seems to be needed.  Thanks,
>
> --
> Peter Xu
>


[PATCH v2 2/2] Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio"

2021-03-17 Thread Brian Geffon
This reverts commit cd544fd1dc9293c6702fab6effa63dac1cc67e99.

As discussed in [1] this commit was a no-op because the mapping type was
checked in vma_to_resize before move_vma is ever called. This meant that
vm_ops->mremap() would never be called on such mappings. Furthermore,
we've since expanded support of MREMAP_DONTUNMAP to non-anonymous
mappings, and these special mappings are still protected by the existing
check of !VM_DONTEXPAND and !VM_PFNMAP which will result in a -EFAULT.

1. https://lkml.org/lkml/2020/12/28/2340

Signed-off-by: Brian Geffon 
---
 arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
 fs/aio.c  | 5 +
 include/linux/mm.h| 2 +-
 mm/mmap.c | 6 +-
 mm/mremap.c   | 2 +-
 5 files changed, 5 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
index e916646adc69..0daf2f1cf7a8 100644
--- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
+++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
@@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
struct file *filp)
return 0;
 }
 
-static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
flags)
+static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
 {
/* Not supported */
return -EINVAL;
diff --git a/fs/aio.c b/fs/aio.c
index 1f32da13d39e..76ce0cc3ee4e 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -323,16 +323,13 @@ static void aio_free_ring(struct kioctx *ctx)
}
 }
 
-static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
+static int aio_ring_mremap(struct vm_area_struct *vma)
 {
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct kioctx_table *table;
int i, res = -EINVAL;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
spin_lock(&mm->ioctx_lock);
rcu_read_lock();
table = rcu_dereference(mm->ioctx_table);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 77e64e3eac80..8c3729eb3e38 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -570,7 +570,7 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
/* Called any time before splitting to check if it's allowed */
int (*may_split)(struct vm_area_struct *area, unsigned long addr);
-   int (*mremap)(struct vm_area_struct *area, unsigned long flags);
+   int (*mremap)(struct vm_area_struct *area);
/*
 * Called by mprotect() to make driver-specific permission
 * checks before mprotect() is finalised.   The VMA must not
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f287599a7a3..9d7651e4e1fe 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3403,14 +3403,10 @@ static const char *special_mapping_name(struct 
vm_area_struct *vma)
return ((struct vm_special_mapping *)vma->vm_private_data)->name;
 }
 
-static int special_mapping_mremap(struct vm_area_struct *new_vma,
- unsigned long flags)
+static int special_mapping_mremap(struct vm_area_struct *new_vma)
 {
struct vm_special_mapping *sm = new_vma->vm_private_data;
 
-   if (flags & MREMAP_DONTUNMAP)
-   return -EINVAL;
-
if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
return -EFAULT;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 2c57dc4bc8b6..b1f7bc43ece9 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -545,7 +545,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
if (moved_len < old_len) {
err = -ENOMEM;
} else if (vma->vm_ops && vma->vm_ops->mremap) {
-   err = vma->vm_ops->mremap(new_vma, flags);
+   err = vma->vm_ops->mremap(new_vma);
}
 
if (unlikely(err)) {
-- 
2.31.0.rc2.261.g7f71774620-goog



[PATCH v2 1/2] mm: Allow non-VM_DONTEXPAND and VM_PFNMAP mappings with MREMAP_DONTUNMAP

2021-03-17 Thread Brian Geffon
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This
change will widen the support to include any mappings which are not
VM_DONTEXPAND or VM_PFNMAP. The primary use case is to support
MREMAP_DONTUNMAP on mappings which may have been created from a memfd.

This change which takes advantage of the existing check in vma_to_resize
for non-VM_DONTEXPAND and non-VM_PFNMAP mappings will cause
MREMAP_DONTUNMAP to return -EFAULT if such mappings are remapped. This
behavior is consistent with existing behavior when using mremap with
such mappings.

Lokesh Gidra who works on the Android JVM, provided an explanation of how
such a feature will improve Android JVM garbage collection:
"Android is developing a new garbage collector (GC), based on userfaultfd.
The garbage collector will use userfaultfd (uffd) on the java heap during
compaction. On accessing any uncompacted page, the application threads will
find it missing, at which point the thread will create the compacted page
and then use UFFDIO_COPY ioctl to get it mapped and then resume execution.
Before starting this compaction, in a stop-the-world pause the heap will be
mremap(MREMAP_DONTUNMAP) so that the java heap is ready to receive
UFFD_EVENT_PAGEFAULT events after resuming execution.

To speedup mremap operations, pagetable movement was optimized by moving
PUD entries instead of PTE entries [1]. It was necessary as mremap of even
modest sized memory ranges also took several milliseconds, and stopping the
application for that long isn't acceptable in response-time sensitive
cases.

With UFFDIO_CONTINUE feature [2], it will be even more efficient to
implement this GC, particularly the 'non-moveable' portions of the heap.
It will also help in reducing the need to copy (UFFDIO_COPY) the pages.
However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this
patch will enable using UFFDIO_CONTINUE for the new userfaultfd-based heap
compaction."

[1] 
https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
[2] 
https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/

Signed-off-by: Brian Geffon 
---
 mm/mremap.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..2c57dc4bc8b6 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -653,10 +653,6 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
return ERR_PTR(-EINVAL);
}
 
-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
-   return ERR_PTR(-EINVAL);
-
if (is_vm_hugetlb_page(vma))
return ERR_PTR(-EINVAL);
 
-- 
2.31.0.rc2.261.g7f71774620-goog



Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP

2021-03-16 Thread Brian Geffon
Hi Hugh,
Thanks for this suggestion, responses in line.

> A better patch would say:
>
> -   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> -   vma->vm_flags & VM_SHARED))
> +   if ((flags & MREMAP_DONTUNMAP) &&
> +   (vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
> return ERR_PTR(-EINVAL);
>
> VM_DONTEXPAND is what has long been used on special mappings, to prevent
> surprises from mremap changing the size of the mapping: MREMAP_DONTUNMAP
> introduced a different way of expanding the mapping, so VM_DONTEXPAND
> still seems a reasonable name (I've thrown in VM_PFNMAP there because
> it's in the VM_DONTEXPAND test lower down: for safety I guess, and best
> if both behave the same - though one says -EINVAL and the other -EFAULT).

I like this idea and am happy to mail a new patch. I think it may make
sense to bring the lower block up here so that it becomes more clear
that it's not duplicate code and that the MREMAP_DONTUNMAP case
returns -EINVAL and other cases return -EFAULT. I wonder if the
-EFAULT error code would have made more sense from the start for both
cases, do you have any thoughts on changing the error code at this
point?

> With that VM_DONTEXPAND check in, Dmitry's commit cd544fd1dc92
> ("mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio")
> can still be reverted (as you agreed on 28th December), even though
> vma_is_anonymous() will no longer protect it.

I agree and if Dmitry does not have time I would be happy to mail a
revert to cd544fd1dc92 as we discussed in [1]. Dmitry, would you like
me to do that?

> Was there an mremap(2) man page update for MREMAP_DONTUNMAP?
> Whether or not there was before, it ought to get one now.

Yes, the mremap(2) man page was updated when this flag was added and
it will require a further update to reflect this expanded mapping
support.

Thanks
Brian

1. https://lkml.org/lkml/2020/12/28/2340


[PATCH] eventfd: Introduce EFD_ZERO_ON_WAKE

2021-03-10 Thread Brian Geffon
This patch introduces a new flag to eventfd, called EFD_ZERO_ON_WAKE.
This change is primarily introduced for use cases which do not care about
the value stored in the eventfd itself. Such existing use cases require an
additional read syscall to clear the count.

This flag provides the following guarantees:

(1) Writes can never block or return EAGAIN.
The reason this is true is because we don't actually need to store the
value and as a result the internal value is only changed between 0 and
1 and back to 0. Therefore POLLERR and POLLOUT are never possible
outcomes. A poll with POLLOUT or a write will always immediately
return regardless of EFD_NONBLOCK.

(2) Read / POLLIN result in the internal value being reset to 0.
When EFD_NONBLOCK is set reads when the internal value is 0 will
immediately return with EAGAIN, as it always has. Similiarly, when
a read is performed without EFD_NONBLOCK it will block until a write
occurs. In both cases after the read completes successfully the
internal value is reset to 0. When polling with POLLIN, upon return
of a POLLIN event the internal value will be reset to 0.

Signed-off-by: Brian Geffon 
---
 fs/eventfd.c| 39 +--
 include/linux/eventfd.h |  8 +++-
 2 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..56bf04d6461e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -172,8 +172,21 @@ static __poll_t eventfd_poll(struct file *file, poll_table 
*wait)
 */
count = READ_ONCE(ctx->count);
 
-   if (count > 0)
+   if (count > 0) {
+   if ((ctx->flags & EFD_ZERO_ON_WAKE) &&
+   (poll_requested_events(wait) & EPOLLIN)) {
+   /*
+* We're going to cause a wake on EPOLLIN, we need to 
zero the count.
+* We validate that EPOLLIN is a requested event 
because if the user
+* did something odd like POLLPRI we wouldn't want to 
zero the count
+* if no wake happens.
+*/
+   spin_lock_irq(&ctx->wqh.lock);
+   ctx->count = 0;
+   spin_unlock_irq(&ctx->wqh.lock);
+   }
events |= EPOLLIN;
+   }
if (count == ULLONG_MAX)
events |= EPOLLERR;
if (ULLONG_MAX - 1 > count)
@@ -239,8 +252,11 @@ static ssize_t eventfd_read(struct kiocb *iocb, struct 
iov_iter *to)
__add_wait_queue(&ctx->wqh, &wait);
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
-   if (ctx->count)
+   if (ctx->count) {
+   if (ctx->flags & EFD_ZERO_ON_WAKE)
+   ctx->count = 0;
break;
+   }
if (signal_pending(current)) {
__remove_wait_queue(&ctx->wqh, &wait);
__set_current_state(TASK_RUNNING);
@@ -280,6 +296,18 @@ static ssize_t eventfd_write(struct file *file, const char 
__user *buf, size_t c
return -EINVAL;
spin_lock_irq(&ctx->wqh.lock);
res = -EAGAIN;
+
+   /*
+* In the case of EFD_ZERO_ON_WAKE the actual count is never needed, 
for this
+* reason we only adjust it to set it from 0 to 1 or 1 to 0. This means 
that
+* write will never return EWOULDBLOCK or block, because there is always
+* going to be enough space to write as the amount we will increment 
could
+* be at most 1 as it's clamped below. Additionally, we know that 
POLLERR
+* cannot be returned when EFD_ZERO_ON_WAKE is used for the same reason.
+*/
+   if (ctx->flags & EFD_ZERO_ON_WAKE)
+   ucnt = (ctx->count == 0) ? 1 : 0;
+
if (ULLONG_MAX - ctx->count > ucnt)
res = sizeof(ucnt);
else if (!(file->f_flags & O_NONBLOCK)) {
@@ -414,9 +442,16 @@ static int do_eventfd(unsigned int count, int flags)
BUILD_BUG_ON(EFD_CLOEXEC != O_CLOEXEC);
BUILD_BUG_ON(EFD_NONBLOCK != O_NONBLOCK);
 
+   /* O_NOFOLLOW has been repurposed as EFD_ZERO_ON_WAKE */
+   BUILD_BUG_ON(EFD_ZERO_ON_WAKE != O_NOFOLLOW);
+
if (flags & ~EFD_FLAGS_SET)
return -EINVAL;
 
+   /* The semaphore semantics would be lost if using EFD_ZERO_ON_WAKE */
+   if ((flags & EFD_ZERO_ON_WAKE) && (flags & EFD_SEMAPHORE))
+   return -EINVAL;
+
ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
if (!ctx)
return -ENOMEM;
diff --git a/include/linux/eventfd.h b/include

Re: [PATCH v3 05/11] x86/entry: Convert ret_from_fork to C

2021-03-04 Thread Brian Gerst
On Thu, Mar 4, 2021 at 2:16 PM Andy Lutomirski  wrote:
>
> ret_from_fork is written in asm, slightly differently, for x86_32 and
> x86_64.  Convert it to C.
>
> This is a straight conversion without any particular cleverness.  As a
> further cleanup, the code that sets up the ret_from_fork argument registers
> could be adjusted to put the arguments in the correct registers.

An alternative would be to stash the function pointer and argument in
the pt_regs of the new kthread task, and test for PF_KTHREAD instead.
That would eliminate the need to pass those two values to the C
version of ret_from_fork().

> This will cause the ORC unwinder to find pt_regs even for kernel threads on
> x86_64.  This seems harmless.
>
> The 32-bit comment above the now-deleted schedule_tail_wrapper was
> obsolete: the encode_frame_pointer mechanism (see copy_thread()) solves the
> same problem more cleanly.
>
> Cc: Josh Poimboeuf 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/entry/common.c  | 23 ++
>  arch/x86/entry/entry_32.S| 51 +---
>  arch/x86/entry/entry_64.S| 33 +
>  arch/x86/include/asm/switch_to.h |  2 +-
>  arch/x86/kernel/process.c|  2 +-
>  arch/x86/kernel/process_32.c |  2 +-
>  arch/x86/kernel/unwind_orc.c |  2 +-
>  7 files changed, 43 insertions(+), 72 deletions(-)
>
> diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
> index 95776f16c1cb..ef1c65938a6b 100644
> --- a/arch/x86/entry/common.c
> +++ b/arch/x86/entry/common.c
> @@ -214,6 +214,29 @@ SYSCALL_DEFINE0(ni_syscall)
> return -ENOSYS;
>  }
>
> +void ret_from_fork(struct task_struct *prev,
> +  int (*kernel_thread_fn)(void *),
> +  void *kernel_thread_arg,
> +  struct pt_regs *user_regs);
> +
> +__visible void noinstr ret_from_fork(struct task_struct *prev,
> +int (*kernel_thread_fn)(void *),
> +void *kernel_thread_arg,
> +struct pt_regs *user_regs)
> +{
> +   instrumentation_begin();
> +
> +   schedule_tail(prev);
> +
> +   if (kernel_thread_fn) {

This should have an unlikely(), as kernel threads should be the rare case.

--
Brian Gerst


Re: [PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP

2021-03-03 Thread Brian Geffon
I apologize, this patch didn't include my signed off by, here it is:

Signed-off-by: Brian Geffon 


On Wed, Mar 3, 2021 at 9:53 AM Brian Geffon  wrote:
>
> Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This 
> change
> will widen the support to include shmem mappings. The primary use case
> is to support MREMAP_DONTUNMAP on mappings which may have been created from
> a memfd.
>
> Lokesh Gidra who works on the Android JVM, provided an explanation of how such
> a feature will improve Android JVM garbage collection:
> "Android is developing a new garbage collector (GC), based on userfaultfd. The
> garbage collector will use userfaultfd (uffd) on the java heap during 
> compaction.
> On accessing any uncompacted page, the application threads will find it 
> missing,
> at which point the thread will create the compacted page and then use 
> UFFDIO_COPY
> ioctl to get it mapped and then resume execution. Before starting this 
> compaction,
> in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that 
> the
> java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming 
> execution.
>
> To speedup mremap operations, pagetable movement was optimized by moving PUD 
> entries
> instead of PTE entries [1]. It was necessary as mremap of even modest sized 
> memory
> ranges also took several milliseconds, and stopping the application for that 
> long
> isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE 
> feature [2],
> it will be even more efficient to implement this GC, particularly the 
> 'non-moveable'
> portions of the heap. It will also help in reducing the need to copy 
> (UFFDIO_COPY)
> the pages. However, for this to work, the java heap has to be on a 'shared' 
> vma.
> Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this 
> patch will
> enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction."
>
> [1] 
> https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
> [2] 
> https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/
> ---
>  mm/mremap.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index ec8f840399ed..6934d199da54 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -653,8 +653,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
> addr,
> return ERR_PTR(-EINVAL);
> }
>
> -   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
> -   vma->vm_flags & VM_SHARED))
> +   if (flags & MREMAP_DONTUNMAP && !(vma_is_anonymous(vma) || 
> vma_is_shmem(vma)))
> return ERR_PTR(-EINVAL);
>
> if (is_vm_hugetlb_page(vma))
> --
> 2.31.0.rc0.254.gbdcc3b1a9d-goog
>


[PATCH] mm: Allow shmem mappings with MREMAP_DONTUNMAP

2021-03-03 Thread Brian Geffon
Currently MREMAP_DONTUNMAP only accepts private anonymous mappings. This change
will widen the support to include shmem mappings. The primary use case
is to support MREMAP_DONTUNMAP on mappings which may have been created from
a memfd.

Lokesh Gidra who works on the Android JVM, provided an explanation of how such
a feature will improve Android JVM garbage collection:
"Android is developing a new garbage collector (GC), based on userfaultfd. The
garbage collector will use userfaultfd (uffd) on the java heap during 
compaction.
On accessing any uncompacted page, the application threads will find it missing,
at which point the thread will create the compacted page and then use 
UFFDIO_COPY
ioctl to get it mapped and then resume execution. Before starting this 
compaction,
in a stop-the-world pause the heap will be mremap(MREMAP_DONTUNMAP) so that the
java heap is ready to receive UFFD_EVENT_PAGEFAULT events after resuming 
execution.

To speedup mremap operations, pagetable movement was optimized by moving PUD 
entries
instead of PTE entries [1]. It was necessary as mremap of even modest sized 
memory
ranges also took several milliseconds, and stopping the application for that 
long
isn't acceptable in response-time sensitive cases. With UFFDIO_CONTINUE feature 
[2],
it will be even more efficient to implement this GC, particularly the 
'non-moveable'
portions of the heap. It will also help in reducing the need to copy 
(UFFDIO_COPY)
the pages. However, for this to work, the java heap has to be on a 'shared' vma.
Currently MREMAP_DONTUNMAP only supports private anonymous mappings, this patch 
will
enable using UFFDIO_CONTINUE for the new userfaultfd-based heap compaction."

[1] 
https://lore.kernel.org/linux-mm/20201215030730.nc3cu98e4%25a...@linux-foundation.org/
[2] 
https://lore.kernel.org/linux-mm/20210302000133.272579-1-axelrasmus...@google.com/
---
 mm/mremap.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index ec8f840399ed..6934d199da54 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -653,8 +653,7 @@ static struct vm_area_struct *vma_to_resize(unsigned long 
addr,
return ERR_PTR(-EINVAL);
}
 
-   if (flags & MREMAP_DONTUNMAP && (!vma_is_anonymous(vma) ||
-   vma->vm_flags & VM_SHARED))
+   if (flags & MREMAP_DONTUNMAP && !(vma_is_anonymous(vma) || 
vma_is_shmem(vma)))
return ERR_PTR(-EINVAL);
 
if (is_vm_hugetlb_page(vma))
-- 
2.31.0.rc0.254.gbdcc3b1a9d-goog



Re: [PATCH v4 5/5] ibmvfc: reinitialize sub-CRQs and perform channel enquiry after LPM

2021-02-26 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 2/5] ibmvfc: fix invalid sub-CRQ handles after hard reset

2021-02-26 Thread Brian King
Reviewed-by: Brian King 

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v2 5/5] ibmvfc: reinitialize sub-CRQs and perform channel enquiry after LPM

2021-02-25 Thread Brian King
On 2/25/21 2:48 PM, Tyrel Datwyler wrote:
> A live partition migration (LPM) results in a CRQ disconnect similar to
> a hard reset. In this LPM case the hypervisor moslty perserves the CRQ
> transport such that it simply needs to be reenabled. However, the
> capabilities may have changed such as fewer channels, or no channels at
> all. Further, its possible that there may be sub-CRQ support, but no
> channel support. The CRQ reenable path currently doesn't take any of
> this into consideration.
> 
> For simpilicty release and reinitialize sub-CRQs during reenable, and
> set do_enquiry and using_channels with the appropriate values to trigger
> channel renegotiation.
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvfc.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
> index 4ac2c442e1e2..9ae6be56e375 100644
> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
> @@ -903,6 +903,9 @@ static int ibmvfc_reenable_crq_queue(struct ibmvfc_host 
> *vhost)
>  {
>   int rc = 0;
>   struct vio_dev *vdev = to_vio_dev(vhost->dev);
> + unsigned long flags;
> +
> + ibmvfc_release_sub_crqs(vhost);
>  
>   /* Re-enable the CRQ */
>   do {
> @@ -914,6 +917,16 @@ static int ibmvfc_reenable_crq_queue(struct ibmvfc_host 
> *vhost)
>   if (rc)
>   dev_err(vhost->dev, "Error enabling adapter (rc=%d)\n", rc);
>  
> + spin_lock_irqsave(vhost->host->host_lock, flags);
> + spin_lock(vhost->crq.q_lock);
> + vhost->do_enquiry = 1;
> + vhost->using_channels = 0;
> +
> + ibmvfc_init_sub_crqs(vhost);
> +
> + spin_unlock(vhost->crq.q_lock);
> + spin_unlock_irqrestore(vhost->host->host_lock, flags);

ibmvfc_init_sub_crqs can sleep, for multiple reasons, so you can't hold
a lock when you call it. There is a GFP_KERNEL allocation in it, and the
patch before this one adds an msleep in an error path.

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v2 4/5] ibmvfc: store return code of H_FREE_SUB_CRQ during cleanup

2021-02-25 Thread Brian King
Reviewed-by: Brian King 

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



[PATCH] mwifiex: don't print SSID to logs

2021-02-24 Thread Brian Norris
There are a few reasons not to dump SSIDs as-is in kernel logs:

1) they're not guaranteed to be any particular text encoding (UTF-8,
   ASCII, ...) in general
2) it's somewhat redundant; the BSSID should be enough to uniquely
   identify the AP/STA to which we're connecting
3) BSSIDs have an easily-recognized format, whereas SSIDs do not (they
   are free-form)
4) other common drivers (e.g., everything based on mac80211) get along
   just fine by only including BSSIDs when logging state transitions

Additional notes on reason #3: this is important for the
privacy-conscious, especially when providing tools that convey
kernel logs on behalf of a user -- e.g., when reporting bugs. So for
example, it's easy to automatically filter logs for MAC addresses, but
it's much harder to filter SSIDs out of unstructured text.

Signed-off-by: Brian Norris 
---
 drivers/net/wireless/marvell/mwifiex/cfg80211.c | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/net/wireless/marvell/mwifiex/cfg80211.c 
b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
index a2ed268ce0da..0961f4a5e415 100644
--- a/drivers/net/wireless/marvell/mwifiex/cfg80211.c
+++ b/drivers/net/wireless/marvell/mwifiex/cfg80211.c
@@ -2300,8 +2300,7 @@ mwifiex_cfg80211_assoc(struct mwifiex_private *priv, 
size_t ssid_len,
is_scanning_required = 1;
} else {
mwifiex_dbg(priv->adapter, MSG,
-   "info: trying to associate to '%.*s' bssid 
%pM\n",
-   req_ssid.ssid_len, (char *)req_ssid.ssid,
+   "info: trying to associate to bssid %pM\n",
bss->bssid);
memcpy(&priv->cfg_bssid, bss->bssid, ETH_ALEN);
break;
@@ -2378,8 +2377,7 @@ mwifiex_cfg80211_connect(struct wiphy *wiphy, struct 
net_device *dev,
}
 
mwifiex_dbg(adapter, INFO,
-   "info: Trying to associate to %.*s and bssid %pM\n",
-   (int)sme->ssid_len, (char *)sme->ssid, sme->bssid);
+   "info: Trying to associate to bssid %pM\n", sme->bssid);
 
if (!mwifiex_stop_bg_scan(priv))
cfg80211_sched_scan_stopped_locked(priv->wdev.wiphy, 0);
@@ -2512,9 +2510,8 @@ mwifiex_cfg80211_join_ibss(struct wiphy *wiphy, struct 
net_device *dev,
goto done;
}
 
-   mwifiex_dbg(priv->adapter, MSG,
-   "info: trying to join to %.*s and bssid %pM\n",
-   params->ssid_len, (char *)params->ssid, params->bssid);
+   mwifiex_dbg(priv->adapter, MSG, "info: trying to join to bssid %pM\n",
+   params->bssid);
 
mwifiex_set_ibss_params(priv, params);
 
-- 
2.30.0.617.g56c4b15f3c-goog



Re: [PATCH v2] nand: brcmnand: fix OOB R/W with Hamming ECC

2021-02-24 Thread Brian Norris
On Wed, Feb 24, 2021 at 12:02 AM Álvaro Fernández Rojas
 wrote:
> Fixes: 27c5b17cd1b1 ("mtd: nand: add NAND driver "library" for Broadcom STB 
> NAND controller")

FWIW, I could believe this was broken. We weren't testing Hamming ECC
(nor JFFS2) at the time, so it could easily have obvious bugs like
this.

And since I got this far, and I'm still in MAINTAINERS, I guess:

Acked-by: Brian Norris 


Re: [PATCH 4/4] ibmvfc: store return code of H_FREE_SUB_CRQ during cleanup

2021-02-16 Thread Brian King
On 2/11/21 12:57 PM, Tyrel Datwyler wrote:
> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
> index ba6fcf9cbc57..23b803ac4a13 100644
> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
> @@ -5670,7 +5670,7 @@ static int ibmvfc_register_scsi_channel(struct 
> ibmvfc_host *vhost,
>  
>  irq_failed:
>   do {
> - plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address, 
> scrq->cookie);
> + rc = plpar_hcall_norets(H_FREE_SUB_CRQ, vdev->unit_address, 
> scrq->cookie);
>   } while (rc == H_BUSY || H_IS_LONG_BUSY(rc));

Other places in the driver where we get a busy return code back we have an 
msleep(100).
Should we be doing that here as well?

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH 3/4] ibmvfc: treat H_CLOSED as success during sub-CRQ registration

2021-02-16 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH 2/4] ibmvfc: fix invalid sub-CRQ handles after hard reset

2021-02-16 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-02-16 Thread Brian Foster
On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
> On 13.01.21 22:53, Dave Chinner wrote:
> > [...]
> > I agree that a throttling fix is needed, but I'm trying to
> > understand the scope and breadth of the problem first instead of
> > jumping the gun and making the wrong fix for the wrong reasons that
> > just papers over the underlying problems that the throttling bug has
> > made us aware of...
> 
> Are you still working on this?
> 
> If it takes more time to understand the potential underlying problem, the fix 
> for the problem at hand should be applied.
> 
> This is a real world problem, accidentally found in the wild. It appears very 
> rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 
> 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix 
> use-after-free on CIL context on shutdown") which silently added a condition 
> to the wakeup. The condition is based on a wrong assumption.
> 
> Why is this "papering over"? If a reminder was needed, there were better ways 
> than randomly hanging the system.
> 
> Why is
> 
> if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> wake_up_all(&cil->xc_push_wait);
> 
> , which doesn't work reliably, preferable to
> 
> if (waitqueue_active(&cil->xc_push_wait))
> wake_up_all(&cil->xc_push_wait);
> 
> which does?
> 

JFYI, Dave followed up with a patch a couple weeks or so ago:

https://lore.kernel.org/linux-xfs/20210128044154.806715-5-da...@fromorbit.com/

Brian

> Best
>   Donald
> 
> > Cheers,
> > 
> > Dave
> 



Re: [PATCH 1/4] ibmvfc: simplify handling of sub-CRQ initialization

2021-02-15 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH] xfs: fix boolreturn.cocci warnings

2021-02-10 Thread Brian Foster
On Thu, Feb 11, 2021 at 04:09:16AM +0800, kernel test robot wrote:
> From: kernel test robot 
> 
> fs/xfs/xfs_log.c:1062:9-10: WARNING: return of 0/1 in function 
> 'xfs_log_need_covered' with return type bool
> 
>  Return statements in functions returning bool should use
>  true/false instead of 1/0.
> Generated by: scripts/coccinelle/misc/boolreturn.cocci
> 
> Fixes: 37444fc4cc39 ("xfs: lift writable fs check up into log worker task")
> CC: Brian Foster 
> Reported-by: kernel test robot 
> Signed-off-by: kernel test robot 
> ---

Reviewed-by: Brian Foster 

> 
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git 
> xfs-5.12-merge
> head:   560ab6c0d12ebccabb83638abe23a7875b946f9a
> commit: 37444fc4cc398266fe0f71a9c0925620d44fb76a [25/36] xfs: lift writable 
> fs check up into log worker task
> 
>  xfs_log.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1059,7 +1059,7 @@ xfs_log_need_covered(
>   boolneeded = false;
>  
>   if (!xlog_cil_empty(log))
> - return 0;
> + return false;
>  
>   spin_lock(&log->l_icloglock);
>   switch (log->l_covered_state) {
> 



Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

2021-02-08 Thread Brian Foster
On Tue, Feb 09, 2021 at 07:43:14AM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > > There was a v2 inline that incorporated some directed feedback.
> > > > Otherwise there were questions and ideas about making the whole thing
> > > > faster, but I've no idea if that addresses the problem or not (if so,
> > > > that would be an entirely different set of patches). I'll wait and see
> > > > what Darrick thinks about this and rebase/repost if the approach is
> > > > agreeable..
> > > 
> > > There is always the school of thought that says that the best way to
> > > get people to focus on this is to rebase and repost.  Otherwise, they
> > > are all too likely to assume that you lost interest in this.
> > 
> > I was hoping that a better solution would emerge for clearing
> > PageWriteback on hundreds of thousands of pages, but nothing easy popped
> > out.
> > 
> > The hardcoded threshold in "[PATCH v2 2/2] xfs: kick extra large ioends
> > to completion workqueue" gives me unease because who's to say if marking
> > 262,144 pages on a particular CPU will actually stall it long enough to
> > trip the hangcheck?  Is the number lower on (say) some pokey NAS box
> > with a lot of storage but a slow CPU?
> 
> It's also not the right thing to do given the IO completion
> workqueue is a bound workqueue. Anything that is doing large amounts
> of CPU intensive work should be on a unbound workqueue so that the
> scheduler can bounce it around different CPUs as needed.
> 
> Quite frankly, the problem is a huge long ioend chain being built by
> the submission code. We need to keep ioend completion overhead down.
> It runs in either softirq or bound workqueue context and so
> individual items of work that are performed in this context must not
> be -unbounded- in size or time. Unbounded ioend chains are bad for
> IO latency, they are bad for memory reclaim and they are bad for CPU
> scheduling.
> 
> As I've said previously, we gain nothing by aggregating ioends past
> a few tens of megabytes of submitted IO. The batching gains are
> completely diminished once we've got enough IO in flight to keep the
> submission queue full. We're talking here about gigabytes of
> sequential IOs in a single ioend chain which are 2-3 orders of
> magnitude larger than needed for optimal background IO submission
> and completion efficiency and throughput. IOWs, we really should be
> limiting the ioend chain length at submission time, not trying to
> patch over bad completion behaviour that results from sub-optimal IO
> submission behaviour...
> 

That was the patch I posted prior to the aforementioned set. Granted, it
was an RFC, but for reference:

https://lore.kernel.org/linux-fsdevel/20200825144917.GA321765@bfoster/

(IIRC, you also had a variant that was essentially the same change.)

The discussion that followed in that thread was around the preference to
move completion of large chains into workqueue context instead of
breaking up the chains. The series referenced in my first reply fell out
of that as a targeted fix for the stall warning.

> > That said, /some/ threshold is probably better than no threshold.  Could
> > someone try to confirm if that series of Brian's fixes this problem too?
> 
> 262144 pages is still too much work to be doing in a single softirq
> IO completion callback. It's likely to be too much work for a bound
> workqueue, too, especially when you consider that the workqueue
> completion code will merge sequential ioends into one ioend, hence
> making the IO completion loop counts bigger and latency problems worse
> rather than better...
> 

That was just a conservative number picked based on observation of the
original report (10+ GB ioends IIRC). I figured the review cycle would
involve narrowing it down to something more generically reasonable
(10s-100s of MB?) once we found an acceptable approach (and hopefully
received some testing feedback), but we've never really got to that
point..

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 



Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

2021-02-08 Thread Brian Foster
On Mon, Feb 08, 2021 at 09:28:24AM -0800, Darrick J. Wong wrote:
> On Mon, Feb 09, 2021 at 09:11:40AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 08, 2021 at 10:44:58AM -0500, Brian Foster wrote:
> > > On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > > > > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > > > > Dear Linux folks,
> > > > > > > 
> > > > > > > 
> > > > > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 
> > > > > > > 5.4.57, we
> > > > > > > twice saw a self-detected stall on a CPU (October 27th, 2020, 
> > > > > > > January 18th,
> > > > > > > 2021).
> > > > > > > 
> > > > > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > > > > > 
> > > > > > > ```
> > > > > > > [0.00] Linux version 5.4.57.mx64.340
> > > > > > > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP 
> > > > > > > Tue Aug 11
> > > > > > > 13:20:33 CEST 2020
> > > > > > > […]
> > > > > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > > > > [48962.987511] rcu:   4-: (20999 ticks this GP)
> > > > > > > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696
> > > > > > > [48962.998805](t=21017 jiffies g=14529009 q=32263)
> > > > > > > [48963.004074] Task dump for CPU 4:
> > > > > > > [48963.007689] kworker/4:2 R  running task0 25587 
> > > > > > >  2
> > > > > > > 0x80004008
> > > > > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > > > > [48963.020570] Call Trace:
> > > > > > > [48963.023311]  
> > > > > > > [48963.025560]  sched_show_task+0x11e/0x150
> > > > > > > [48963.029957]  rcu_dump_cpu_stacks+0x70/0xa0
> > > > > > > [48963.034545]  rcu_sched_clock_irq+0x502/0x770
> > > > > > > [48963.039322]  ? tick_sched_do_timer+0x60/0x60
> > > > > > > [48963.044106]  update_process_times+0x24/0x60
> > > > > > > [48963.048791]  tick_sched_timer+0x37/0x70
> > > > > > > [48963.053089]  __hrtimer_run_queues+0x11f/0x2b0
> > > > > > > [48963.057960]  ? recalibrate_cpu_khz+0x10/0x10
> > > > > > > [48963.062744]  hrtimer_interrupt+0xe5/0x240
> > > > > > > [48963.067235]  smp_apic_timer_interrupt+0x6f/0x130
> > > > > > > [48963.072407]  apic_timer_interrupt+0xf/0x20
> > > > > > > [48963.076994]  
> > > > > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 
> > > > > > > ed ea 5e
> > > > > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 
> > > > > > > 56 9d 
> > > > > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > > > > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 
> > > > > > > ORIG_RAX:
> > > > > > > ff13
> > > > > > > [48963.115003] RAX: 82407588 RBX: 82407580 RCX:
> > > > > > > 82407588
> > > > > > > [48963.122994] RDX: 82407588 RSI: 0202 RDI:
> > > > > > > 82407580
> > > > > > > [48963.130989] RBP: 0202 R08: 8203ea00 R09:
> > > > > > > 0001
> > > > > > > [48963.138982] R10: c9000738fbb8 R11: 0001 R12:
> > > > > > > 82407588
> > > > > > > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15:
> > > > > > > ea005e7ae600
> > > > > > > [48963.154971]  wake_up_page_bit+0xe0/0x100
> > > > > > > [48963.159366]  xfs_destroy_ioend+0xce/0x1c0
> > > > > > > [48963.163857]  xfs_end_ioend+0xcf/0x1a0
> &

Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

2021-02-08 Thread Brian Foster
On Mon, Feb 08, 2021 at 06:57:24AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 08, 2021 at 09:07:24AM -0500, Brian Foster wrote:
> > On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > > > Dear Linux folks,
> > > > 
> > > > 
> > > > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 
> > > > 5.4.57, we
> > > > twice saw a self-detected stall on a CPU (October 27th, 2020, January 
> > > > 18th,
> > > > 2021).
> > > > 
> > > > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > > > 
> > > > ```
> > > > [0.00] Linux version 5.4.57.mx64.340
> > > > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue 
> > > > Aug 11
> > > > 13:20:33 CEST 2020
> > > > […]
> > > > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > > > [48962.987511] rcu: 4-: (20999 ticks this GP)
> > > > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696
> > > > [48962.998805]  (t=21017 jiffies g=14529009 q=32263)
> > > > [48963.004074] Task dump for CPU 4:
> > > > [48963.007689] kworker/4:2 R  running task0 25587  2
> > > > 0x80004008
> > > > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > > > [48963.020570] Call Trace:
> > > > [48963.023311]  
> > > > [48963.025560]  sched_show_task+0x11e/0x150
> > > > [48963.029957]  rcu_dump_cpu_stacks+0x70/0xa0
> > > > [48963.034545]  rcu_sched_clock_irq+0x502/0x770
> > > > [48963.039322]  ? tick_sched_do_timer+0x60/0x60
> > > > [48963.044106]  update_process_times+0x24/0x60
> > > > [48963.048791]  tick_sched_timer+0x37/0x70
> > > > [48963.053089]  __hrtimer_run_queues+0x11f/0x2b0
> > > > [48963.057960]  ? recalibrate_cpu_khz+0x10/0x10
> > > > [48963.062744]  hrtimer_interrupt+0xe5/0x240
> > > > [48963.067235]  smp_apic_timer_interrupt+0x6f/0x130
> > > > [48963.072407]  apic_timer_interrupt+0xf/0x20
> > > > [48963.076994]  
> > > > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > > > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 
> > > > 5e
> > > > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d 
> > > > 
> > > > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > > > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 ORIG_RAX:
> > > > ff13
> > > > [48963.115003] RAX: 82407588 RBX: 82407580 RCX:
> > > > 82407588
> > > > [48963.122994] RDX: 82407588 RSI: 0202 RDI:
> > > > 82407580
> > > > [48963.130989] RBP: 0202 R08: 8203ea00 R09:
> > > > 0001
> > > > [48963.138982] R10: c9000738fbb8 R11: 0001 R12:
> > > > 82407588
> > > > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15:
> > > > ea005e7ae600
> > > > [48963.154971]  wake_up_page_bit+0xe0/0x100
> > > > [48963.159366]  xfs_destroy_ioend+0xce/0x1c0
> > > > [48963.163857]  xfs_end_ioend+0xcf/0x1a0
> > > > [48963.167958]  xfs_end_io+0xa4/0xd0
> > > > [48963.171672]  process_one_work+0x1e5/0x410
> > > > [48963.176163]  worker_thread+0x2d/0x3c0
> > > > [48963.180265]  ? cancel_delayed_work+0x90/0x90
> > > > [48963.185048]  kthread+0x117/0x130
> > > > [48963.188663]  ? kthread_create_worker_on_cpu+0x70/0x70
> > > > [48963.194321]  ret_from_fork+0x35/0x40
> > > > ```
> > > > 
> > > > As it’s just log level INFO, is there anything what should be done, or 
> > > > was
> > > > the system probably just “overloaded”?
> > > 
> > > I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> > > rather than CONFIG_PREEMPT_VOLUNTARY=y.
> > > 
> > > If so, and if the problem is that you are temporarily overdriving xfs I/O,
> > > one approach would be as follows:
> > > 
> > > 
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index f16d5f1..06be426 100644
> > > --- a/fs/xfs/xfs_aops.c
&

Re: rcu: INFO: rcu_sched self-detected stall on CPU: Workqueue: xfs-conv/md0 xfs_end_io

2021-02-08 Thread Brian Foster
On Fri, Feb 05, 2021 at 09:12:40AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 05, 2021 at 08:29:06AM +0100, Paul Menzel wrote:
> > Dear Linux folks,
> > 
> > 
> > On a Dell PowerEdge T630/0NT78X, BIOS 2.8.0 05/23/2018 with Linux 5.4.57, we
> > twice saw a self-detected stall on a CPU (October 27th, 2020, January 18th,
> > 2021).
> > 
> > Both times, the workqueue is `xfs-conv/md0 xfs_end_io`.
> > 
> > ```
> > [0.00] Linux version 5.4.57.mx64.340
> > (r...@theinternet.molgen.mpg.de) (gcc version 7.5.0 (GCC)) #1 SMP Tue Aug 11
> > 13:20:33 CEST 2020
> > […]
> > [48962.981257] rcu: INFO: rcu_sched self-detected stall on CPU
> > [48962.987511] rcu: 4-: (20999 ticks this GP)
> > idle=fe6/1/0x4002 softirq=3630188/3630188 fqs=4696
> > [48962.998805]  (t=21017 jiffies g=14529009 q=32263)
> > [48963.004074] Task dump for CPU 4:
> > [48963.007689] kworker/4:2 R  running task0 25587  2
> > 0x80004008
> > [48963.015591] Workqueue: xfs-conv/md0 xfs_end_io
> > [48963.020570] Call Trace:
> > [48963.023311]  
> > [48963.025560]  sched_show_task+0x11e/0x150
> > [48963.029957]  rcu_dump_cpu_stacks+0x70/0xa0
> > [48963.034545]  rcu_sched_clock_irq+0x502/0x770
> > [48963.039322]  ? tick_sched_do_timer+0x60/0x60
> > [48963.044106]  update_process_times+0x24/0x60
> > [48963.048791]  tick_sched_timer+0x37/0x70
> > [48963.053089]  __hrtimer_run_queues+0x11f/0x2b0
> > [48963.057960]  ? recalibrate_cpu_khz+0x10/0x10
> > [48963.062744]  hrtimer_interrupt+0xe5/0x240
> > [48963.067235]  smp_apic_timer_interrupt+0x6f/0x130
> > [48963.072407]  apic_timer_interrupt+0xf/0x20
> > [48963.076994]  
> > [48963.079347] RIP: 0010:_raw_spin_unlock_irqrestore+0xa/0x10
> > [48963.085491] Code: f3 90 83 e8 01 75 e8 65 8b 3d 42 0f 56 7e e8 ed ea 5e
> > ff 48 29 e8 4c 39 e8 76 cf 80 0b 08 eb 8c 0f 1f 44 00 00 c6 07 00 56 9d 
> > 0f 1f 44 00 00 0f 1f 44 00 00 b8 00 fe ff ff f0 0f c1 07 56 9d
> > [48963.106524] RSP: 0018:c9000738fd40 EFLAGS: 0202 ORIG_RAX:
> > ff13
> > [48963.115003] RAX: 82407588 RBX: 82407580 RCX:
> > 82407588
> > [48963.122994] RDX: 82407588 RSI: 0202 RDI:
> > 82407580
> > [48963.130989] RBP: 0202 R08: 8203ea00 R09:
> > 0001
> > [48963.138982] R10: c9000738fbb8 R11: 0001 R12:
> > 82407588
> > [48963.146976] R13: ea005e7ae600 R14: 8897b7e5a040 R15:
> > ea005e7ae600
> > [48963.154971]  wake_up_page_bit+0xe0/0x100
> > [48963.159366]  xfs_destroy_ioend+0xce/0x1c0
> > [48963.163857]  xfs_end_ioend+0xcf/0x1a0
> > [48963.167958]  xfs_end_io+0xa4/0xd0
> > [48963.171672]  process_one_work+0x1e5/0x410
> > [48963.176163]  worker_thread+0x2d/0x3c0
> > [48963.180265]  ? cancel_delayed_work+0x90/0x90
> > [48963.185048]  kthread+0x117/0x130
> > [48963.188663]  ? kthread_create_worker_on_cpu+0x70/0x70
> > [48963.194321]  ret_from_fork+0x35/0x40
> > ```
> > 
> > As it’s just log level INFO, is there anything what should be done, or was
> > the system probably just “overloaded”?
> 
> I am assuming that you are building your kernel with CONFIG_PREEMPT_NONE=y
> rather than CONFIG_PREEMPT_VOLUNTARY=y.
> 
> If so, and if the problem is that you are temporarily overdriving xfs I/O,
> one approach would be as follows:
> 
> 
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index f16d5f1..06be426 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -390,6 +390,7 @@ xfs_end_io(
>   list_del_init(&ioend->io_list);
>   xfs_ioend_try_merge(ioend, &completion_list);
>   xfs_end_ioend(ioend);
> + cond_resched();
>   }
>  }
>  
> 

FWIW, this looks quite similar to the problem I attempted to fix with
these patches:

https://lore.kernel.org/linux-xfs/20201002153357.56409-1-bfos...@redhat.com/

Brian

> 
> If you have instead built with CONFIG_PREEMPT_VOLUNTARY=y, then your
> problem is likely massive lock contention in wake_up_page_bit(), or
> perhaps someone having failed to release that lock.  The usual way to
> work this out is by enabling lockdep (CONFIG_PROVE_LOCKING=y), but this
> is often not what you want enabled in production.
> 
> Darrick, thoughts from an xfs perspective?
> 
>   Thanx, Paul
> 



[PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper

2021-02-04 Thread Brian Vazquez
When a static function is annotated with INDIRECT_CALLABLE_SCOPE and
CONFIG_RETPOLINE is set, the static keyword is removed. Sometimes the
function needs to be exported but EXPORT_SYMBOL can't be used because if
CONFIG_RETPOLINE is not set, we will attempt to export a static symbol.

This patch introduces a new indirect call wrapper:
EXPORT_INDIRECT_CALLABLE. This basically does EXPORT_SYMBOL when
CONFIG_RETPOLINE is set, but does nothing when it's not.

Reported-by: Stephen Rothwell 
Signed-off-by: Brian Vazquez 
---
 include/linux/indirect_call_wrapper.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
index 54c02c84906a..a8345c8a613d 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -36,6 +36,7 @@
 
 #define INDIRECT_CALLABLE_DECLARE(f)   f
 #define INDIRECT_CALLABLE_SCOPE
+#define EXPORT_INDIRECT_CALLABLE(f)EXPORT_SYMBOL(f)
 
 #else
 #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
@@ -44,6 +45,7 @@
 #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALLABLE_DECLARE(f)
 #define INDIRECT_CALLABLE_SCOPEstatic
+#define EXPORT_INDIRECT_CALLABLE(f)
 #endif
 
 /*
-- 
2.30.0.365.g02bc693789-goog



[PATCH net-next 2/2] net: fix building errors on powerpc when CONFIG_RETPOLINE is not set

2021-02-04 Thread Brian Vazquez
This commit fixes the errores reported when building for powerpc:

 ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL

Fixes: f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu")
Fixes: bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check 
functions")
Reported-by: Stephen Rothwell 
Signed-off-by: Brian Vazquez 
---
 net/ipv4/route.c | 4 ++--
 net/ipv6/route.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9e6537709794..be31e2446470 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry 
*ipv4_dst_check(struct dst_entry *dst,
return NULL;
return dst;
 }
-EXPORT_SYMBOL(ipv4_dst_check);
+EXPORT_INDIRECT_CALLABLE(ipv4_dst_check);
 
 static void ipv4_send_dest_unreach(struct sk_buff *skb)
 {
@@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const 
struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ipv4_mtu);
+EXPORT_INDIRECT_CALLABLE(ipv4_mtu);
 
 static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8d9e053dc071..0d1784b0d65d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry 
*ip6_dst_check(struct dst_entry *dst,
 
return dst_ret;
 }
-EXPORT_SYMBOL(ip6_dst_check);
+EXPORT_INDIRECT_CALLABLE(ip6_dst_check);
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
 {
@@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct 
dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ip6_mtu);
+EXPORT_INDIRECT_CALLABLE(ip6_mtu);
 
 /* MTU selection:
  * 1. mtu on route is locked - use it
-- 
2.30.0.365.g02bc693789-goog



[PATCH net-next 2/2] net: fix building errors on powerpc when CONFIG_RETPOLINE is not set

2021-02-04 Thread Brian Vazquez
This commit fixes the errores reported when building for powerpc:

 ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
 ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL

Fixes: f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu")
Fixes: bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check 
functions")
Reported-by: Stephen Rothwell 
Signed-off-by: Brian Vazquez 
---
 net/ipv4/route.c | 4 ++--
 net/ipv6/route.c | 4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9e6537709794..be31e2446470 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry 
*ipv4_dst_check(struct dst_entry *dst,
return NULL;
return dst;
 }
-EXPORT_SYMBOL(ipv4_dst_check);
+EXPORT_INDIRECT_CALLABLE(ipv4_dst_check);
 
 static void ipv4_send_dest_unreach(struct sk_buff *skb)
 {
@@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const 
struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ipv4_mtu);
+EXPORT_INDIRECT_CALLABLE(ipv4_mtu);
 
 static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 8d9e053dc071..0d1784b0d65d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry 
*ip6_dst_check(struct dst_entry *dst,
 
return dst_ret;
 }
-EXPORT_SYMBOL(ip6_dst_check);
+EXPORT_INDIRECT_CALLABLE(ip6_dst_check);
 
 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
 {
@@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct 
dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ip6_mtu);
+EXPORT_INDIRECT_CALLABLE(ip6_mtu);
 
 /* MTU selection:
  * 1. mtu on route is locked - use it
-- 
2.30.0.365.g02bc693789-goog



Re: [PATCH] xfs: fix unused variable warning

2021-02-04 Thread Brian Foster
On Thu, Feb 04, 2021 at 05:03:44PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> When debugging is disabled, the ASSERT() is left out and
> the 'log' variable becomes unused:
> 
> fs/xfs/xfs_log.c::16: error: unused variable 'log' 
> [-Werror,-Wunused-variable]
> 
> Remove the variable declaration and open-code it inside
> of the assertion.
> 
> Fixes: 303591a0a947 ("xfs: cover the log during log quiesce")
> Signed-off-by: Arnd Bergmann 
> ---

I sent basically the same patch[1] about a week ago, but either one is
fine with me:

Reviewed-by: Brian Foster 

[1] https://lore.kernel.org/linux-xfs/20210125132616.GB2047559@bfoster/

>  fs/xfs/xfs_log.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 58699881c100..d8b814227734 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1108,12 +1108,11 @@ static int
>  xfs_log_cover(
>   struct xfs_mount*mp)
>  {
> - struct xlog *log = mp->m_log;
>   int error = 0;
>   boolneed_covered;
>  
> - ASSERT((xlog_cil_empty(log) && xlog_iclogs_empty(log) &&
> - !xfs_ail_min_lsn(log->l_ailp)) ||
> + ASSERT((xlog_cil_empty(mp->m_log) && xlog_iclogs_empty(mp->m_log) &&
> + !xfs_ail_min_lsn(mp->m_log->l_ailp)) ||
>  XFS_FORCED_SHUTDOWN(mp));
>  
>   if (!xfs_log_writable(mp))
> -- 
> 2.29.2
> 



Re: [PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper

2021-02-04 Thread Brian Vazquez
Yeah, I'm also not seeing it on patchwork. But I did get the email on
both corp and personal email. So maybe something is failing at
patchwork?


On Thu, Feb 4, 2021 at 9:50 AM George McCollister
 wrote:
>
> I don't see the second patch.
>
> Regards,
> George McCollister


Re: linux-next: build failure after merge of the net-next tree

2021-02-04 Thread Brian Vazquez
Hi all, I've just sent the patch series:
https://patchwork.kernel.org/project/netdevbpf/list/?series=428099.

Thanks,
Brian

On Thu, Feb 4, 2021 at 7:36 AM Vladimir Oltean  wrote:
>
> Hi Brian,
>
> On Wed, Feb 03, 2021 at 07:52:08PM -0800, Brian Vazquez wrote:
> > Hi Stephen, thanks for the report. I'm having trouble trying to
> > compile for ppc, but I believe this should fix the problem, could you
> > test this patch, please? Thanks!
>
> Could you please submit the patch formally to net-next? Thanks.


[PATCH net-next 1/2] net: add EXPORT_INDIRECT_CALLABLE wrapper

2021-02-04 Thread Brian Vazquez
When a static function is annotated with INDIRECT_CALLABLE_SCOPE and
CONFIG_RETPOLINE is set, the static keyword is removed. Sometimes the
function needs to be exported but EXPORT_SYMBOL can't be used because if
CONFIG_RETPOLINE is not set, we will attempt to export a static symbol.

This patch introduces a new indirect call wrapper:
EXPORT_INDIRECT_CALLABLE. This basically does EXPORT_SYMBOL when
CONFIG_RETPOLINE is set, but does nothing when it's not.

Reported-by: Stephen Rothwell 
Signed-off-by: Brian Vazquez 
---
 include/linux/indirect_call_wrapper.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
index 54c02c84906a..a8345c8a613d 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -36,6 +36,7 @@
 
 #define INDIRECT_CALLABLE_DECLARE(f)   f
 #define INDIRECT_CALLABLE_SCOPE
+#define EXPORT_INDIRECT_CALLABLE(f)EXPORT_SYMBOL(f)
 
 #else
 #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
@@ -44,6 +45,7 @@
 #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALLABLE_DECLARE(f)
 #define INDIRECT_CALLABLE_SCOPEstatic
+#define EXPORT_INDIRECT_CALLABLE(f)
 #endif
 
 /*
-- 
2.30.0.365.g02bc693789-goog



Re: linux-next: build failure after merge of the net-next tree

2021-02-03 Thread Brian Vazquez
Hi Stephen, thanks for the report. I'm having trouble trying to
compile for ppc, but I believe this should fix the problem, could you
test this patch, please? Thanks!

diff --git a/include/linux/indirect_call_wrapper.h
b/include/linux/indirect_call_wrapper.h
index 54c02c84906a..077f96be65c6 100644
--- a/include/linux/indirect_call_wrapper.h
+++ b/include/linux/indirect_call_wrapper.h
@@ -36,6 +36,7 @@

 #define INDIRECT_CALLABLE_DECLARE(f)   f
 #define INDIRECT_CALLABLE_SCOPE
+#define INDIRECT_CALLABLE_EXPORT(f)EXPORT_SYMBOL(f)

 #else
 #define INDIRECT_CALL_1(f, f1, ...) f(__VA_ARGS__)
@@ -44,6 +45,7 @@
 #define INDIRECT_CALL_4(f, f4, f3, f2, f1, ...) f(__VA_ARGS__)
 #define INDIRECT_CALLABLE_DECLARE(f)
 #define INDIRECT_CALLABLE_SCOPEstatic
+#define INDIRECT_CALLABLE_EXPORT(f)
 #endif

 /*
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 9e6537709794..9dd8ff3887b7 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1206,7 +1206,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry
*ipv4_dst_check(struct dst_entry *dst,
return NULL;
return dst;
 }
-EXPORT_SYMBOL(ipv4_dst_check);
+INDIRECT_CALLABLE_EXPORT(ipv4_dst_check);

 static void ipv4_send_dest_unreach(struct sk_buff *skb)
 {
@@ -1337,7 +1337,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int
ipv4_mtu(const struct dst_entry *dst)

return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ipv4_mtu);
+INDIRECT_CALLABLE_EXPORT(ipv4_mtu);

 static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f447f82e6819..75d6a0db1fa6 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2644,7 +2644,7 @@ INDIRECT_CALLABLE_SCOPE struct dst_entry
*ip6_dst_check(struct dst_entry *dst,

return dst_ret;
 }
-EXPORT_SYMBOL(ip6_dst_check);
+INDIRECT_CALLABLE_EXPORT(ip6_dst_check);

 static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
 {
@@ -3115,7 +3115,7 @@ INDIRECT_CALLABLE_SCOPE unsigned int
ip6_mtu(const struct dst_entry *dst)

return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
-EXPORT_SYMBOL(ip6_mtu);
+INDIRECT_CALLABLE_EXPORT(ip6_mtu);

 /* MTU selection:
  * 1. mtu on route is locked - use it


On Wed, Feb 3, 2021 at 5:33 PM Stephen Rothwell  wrote:
>
> Hi all,
>
> After merging the net-next tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
>
> ERROR: modpost: "ip6_dst_check" [vmlinux] is a static EXPORT_SYMBOL
> ERROR: modpost: "ipv4_dst_check" [vmlinux] is a static EXPORT_SYMBOL
> ERROR: modpost: "ipv4_mtu" [vmlinux] is a static EXPORT_SYMBOL
> ERROR: modpost: "ip6_mtu" [vmlinux] is a static EXPORT_SYMBOL
>
> Caused by commits
>
>   f67fbeaebdc0 ("net: use indirect call helpers for dst_mtu")
>   bbd807dfbf20 ("net: indirect call helpers for ipv4/ipv6 dst_check 
> functions")
>
> I have used the net-next tree from next-20210203 fot today.
>
> --
> Cheers,
> Stephen Rothwell


Re: [PATCH] mwifiex: Report connected BSS with cfg80211_connect_bss()

2021-02-01 Thread Brian Norris
On Sun, Jan 31, 2021 at 11:07 PM Yen-lin Lai  wrote:
> When a network is moved or reconfigured on the different channel, there
> can be multiple BSSes with the same BSSID and SSID in scan result
> before the old one expires. Then, it can cause cfg80211_connect_result
> to map current_bss to a bss with the wrong channel.
>
> Let mwifiex_cfg80211_assoc return the selected BSS and then the caller
> can report it cfg80211_connect_bss.
>
> Signed-off-by: Yen-lin Lai 

This seems sane to me:

Reviewed-by: Brian Norris 


[PATCH net-next v3 2/4] net: use indirect call helpers for dst_output

2021-02-01 Thread Brian Vazquez
This patch avoids the indirect call for the common case:
ip6_output and ip_output

Signed-off-by: Brian Vazquez 
---
 include/net/dst.h | 8 +++-
 net/ipv4/ip_output.c  | 1 +
 net/ipv6/ip6_output.c | 1 +
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 98cf6e8c06c4..3932e9931f08 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -436,10 +436,16 @@ static inline void dst_set_expires(struct dst_entry *dst, 
int timeout)
dst->expires = expires;
 }
 
+INDIRECT_CALLABLE_DECLARE(int ip6_output(struct net *, struct sock *,
+struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(int ip_output(struct net *, struct sock *,
+struct sk_buff *));
 /* Output packet to network from transport.  */
 static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff 
*skb)
 {
-   return skb_dst(skb)->output(net, sk, skb);
+   return INDIRECT_CALL_INET(skb_dst(skb)->output,
+ ip6_output, ip_output,
+ net, sk, skb);
 }
 
 INDIRECT_CALLABLE_DECLARE(int ip6_input(struct sk_buff *));
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 959b94e32f2b..3aab53beb4ea 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -434,6 +434,7 @@ int ip_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
 }
+EXPORT_SYMBOL(ip_output);
 
 /*
  * copy saddr and daddr, possibly using 64bit load/stores
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 117cd95df213..ff4f9ebcf7f6 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -217,6 +217,7 @@ int ip6_output(struct net *net, struct sock *sk, struct 
sk_buff *skb)
ip6_finish_output,
!(IP6CB(skb)->flags & IP6SKB_REROUTED));
 }
+EXPORT_SYMBOL(ip6_output);
 
 bool ip6_autoflowlabel(struct net *net, const struct ipv6_pinfo *np)
 {
-- 
2.30.0.365.g02bc693789-goog



[PATCH net-next v3 3/4] net: use indirect call helpers for dst_mtu

2021-02-01 Thread Brian Vazquez
This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu

Signed-off-by: Brian Vazquez 
---
 include/net/dst.h | 4 +++-
 net/ipv4/route.c  | 6 --
 net/ipv6/route.c  | 6 --
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 3932e9931f08..9f474a79ed7d 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -194,9 +194,11 @@ dst_feature(const struct dst_entry *dst, u32 feature)
return dst_metric(dst, RTAX_FEATURES) & feature;
 }
 
+INDIRECT_CALLABLE_DECLARE(unsigned int ip6_mtu(const struct dst_entry *));
+INDIRECT_CALLABLE_DECLARE(unsigned int ipv4_mtu(const struct dst_entry *));
 static inline u32 dst_mtu(const struct dst_entry *dst)
 {
-   return dst->ops->mtu(dst);
+   return INDIRECT_CALL_INET(dst->ops->mtu, ip6_mtu, ipv4_mtu, dst);
 }
 
 /* RTT metrics are stored in milliseconds for user ABI, but used as jiffies */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e26652ff7059..4fac91f8bd6c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -135,7 +135,8 @@ static int ip_rt_gc_timeout __read_mostly   = RT_GC_TIMEOUT;
 
 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int ipv4_default_advmss(const struct dst_entry *dst);
-static unsigned int ipv4_mtu(const struct dst_entry *dst);
+INDIRECT_CALLABLE_SCOPE
+unsigned int   ipv4_mtu(const struct dst_entry *dst);
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void ipv4_link_failure(struct sk_buff *skb);
 static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock 
*sk,
@@ -1311,7 +1312,7 @@ static unsigned int ipv4_default_advmss(const struct 
dst_entry *dst)
return min(advmss, IPV4_MAX_PMTU - header_size);
 }
 
-static unsigned int ipv4_mtu(const struct dst_entry *dst)
+INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst)
 {
const struct rtable *rt = (const struct rtable *)dst;
unsigned int mtu = rt->rt_pmtu;
@@ -1333,6 +1334,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
+EXPORT_SYMBOL(ipv4_mtu);
 
 static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 41d8f801b75f..4d83700d5602 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -83,7 +83,8 @@ enum rt6_nud_state {
 
 static struct dst_entry*ip6_dst_check(struct dst_entry *dst, u32 
cookie);
 static unsigned int ip6_default_advmss(const struct dst_entry *dst);
-static unsigned int ip6_mtu(const struct dst_entry *dst);
+INDIRECT_CALLABLE_SCOPE
+unsigned int   ip6_mtu(const struct dst_entry *dst);
 static struct dst_entry *ip6_negative_advice(struct dst_entry *);
 static voidip6_dst_destroy(struct dst_entry *);
 static voidip6_dst_ifdown(struct dst_entry *,
@@ -3089,7 +3090,7 @@ static unsigned int ip6_default_advmss(const struct 
dst_entry *dst)
return mtu;
 }
 
-static unsigned int ip6_mtu(const struct dst_entry *dst)
+INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst)
 {
struct inet6_dev *idev;
unsigned int mtu;
@@ -3111,6 +3112,7 @@ static unsigned int ip6_mtu(const struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
+EXPORT_SYMBOL(ip6_mtu);
 
 /* MTU selection:
  * 1. mtu on route is locked - use it
-- 
2.30.0.365.g02bc693789-goog



[PATCH net-next v3 4/4] net: indirect call helpers for ipv4/ipv6 dst_check functions

2021-02-01 Thread Brian Vazquez
This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check

Signed-off-by: Brian Vazquez 
---
 include/net/dst.h   |  7 ++-
 net/core/sock.c | 12 ++--
 net/ipv4/route.c|  7 +--
 net/ipv4/tcp_ipv4.c |  5 -
 net/ipv6/route.c|  7 +--
 net/ipv6/tcp_ipv6.c |  5 -
 6 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 9f474a79ed7d..26f134ad3a25 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -459,10 +459,15 @@ static inline int dst_input(struct sk_buff *skb)
  ip6_input, ip_local_deliver, skb);
 }
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *,
+ u32));
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 {
if (dst->obsolete)
-   dst = dst->ops->check(dst, cookie);
+   dst = INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check,
+ipv4_dst_check, dst, cookie);
return dst;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 648a5cb46209..0ed98f20448a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -526,11 +526,17 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(__sk_receive_skb);
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *,
+ u32));
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
struct dst_entry *dst = __sk_dst_get(sk);
 
-   if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+   if (dst && dst->obsolete &&
+   INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check,
+  dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
sk->sk_dst_pending_confirm = 0;
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
@@ -546,7 +552,9 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 {
struct dst_entry *dst = sk_dst_get(sk);
 
-   if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+   if (dst && dst->obsolete &&
+   INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check,
+  dst, cookie) == NULL) {
sk_dst_reset(sk);
dst_release(dst);
return NULL;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 4fac91f8bd6c..9e6537709794 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -133,7 +133,8 @@ static int ip_rt_gc_timeout __read_mostly   = RT_GC_TIMEOUT;
  * Interface to generic destination cache.
  */
 
-static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
+INDIRECT_CALLABLE_SCOPE
+struct dst_entry   *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int ipv4_default_advmss(const struct dst_entry *dst);
 INDIRECT_CALLABLE_SCOPE
 unsigned int   ipv4_mtu(const struct dst_entry *dst);
@@ -1188,7 +1189,8 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock 
*sk)
 }
 EXPORT_SYMBOL_GPL(ipv4_sk_redirect);
 
-static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
+INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst,
+u32 cookie)
 {
struct rtable *rt = (struct rtable *) dst;
 
@@ -1204,6 +1206,7 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry 
*dst, u32 cookie)
return NULL;
return dst;
 }
+EXPORT_SYMBOL(ipv4_dst_check);
 
 static void ipv4_send_dest_unreach(struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 777306b5bc22..611039207d30 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1649,6 +1649,8 @@ u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr 
*iph,
return mss;
 }
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *
@@ -1668,7 +1670,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
sk_mark_napi_id(sk, skb);
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
-   !dst->ops->check(ds

[PATCH net-next v3 0/4] net: use INDIRECT_CALL in some dst_ops

2021-02-01 Thread Brian Vazquez
This patch series uses the INDIRECT_CALL wrappers in some dst_ops
functions to mitigate retpoline costs. Benefits depend on the
platform as described below.

Background: The kernel rewrites the retpoline code at
__x86_indirect_thunk_r11 depending on the CPU's requirements.
The INDIRECT_CALL wrappers provide hints on possible targets and
save the retpoline overhead using a direct call in case the
target matches one of the hints.

The retpoline overhead for the following three cases has been
measured by Luigi Rizzo in microbenchmarks, using CPU performance
counters, and cover reasonably well the range of possible retpoline
overheads compared to a plain indirect call (in equal conditions,
specifically with predicted branch, hot cache):

- just "jmp *(%r11)" on modern platforms like Intel Cascadelake.
  In this case the overhead is just 2 clock cycles:

- "lfence; jmp *(%r11)" on e.g. some recent AMD CPUs.
  In this case the lfence is blocked until pending reads complete,
  so the actual overhead depends on previous instructions.
  The best case we have measured 15 clock cycles of overhead.

- worst case, e.g. skylake, the full retpoline is used

__x86_indirect_thunk_r11: call set_u_target
capture_speculation:  pause
  lfence
  jmp capture_speculation
.align 16
set_up_target:mov %r11, (%rsp)
  ret

   In this case the overhead has been measured in 35-40 clock cycles.

The actual time saved hence depends on the platform and current
clock speed (which varies heavily, especially when C-states are active).
Also note that actual benefit might be lower than expected if the
longer retpoline overlaps with some pending memory read.

MEASUREMENTS:
The INDIRECT_CALL wrappers in this patchset involve the processing
of incoming SYN and generation of syncookies. Hence, the test has been
run by configuring a receiving host with a single NIC rx queue, disabling
RPS and RFS so that all processing occurs on the same core.
An external source generates SYN fast enough to saturate the receiving CPU.
We ran two sets of experiments, with and without the dst_output patch,
comparing the number of syncookies generated over a 20s period
in multiple runs.


Assuming the CPU is saturated, the time per packet is
   t = number_of_packets/total_time
and if the two datasets have statistically meaningful difference,
the difference in times between the two cases gives an estimate
of the benefits from one INDIRECT_CALL.

Here are the experimental results:

Skylake Syncookies over 20s (5 tests)
---
indirect9166325 9182023 9170093 9134014 9171082
retpoline   9099308 9126350 9154841 9056377 9122376

Computing the stats on the ns_pkt = 20e6/total_packets gives the following:

$ ministat -c 95 -w 70 /tmp/sk-indirect /tmp/sk-retp
x /tmp/sk-indirect
+ /tmp/sk-retp
+--+
|x xx x +  x+ +   +   +|
||__M__A___|_|M_A___|  |
+--+
N   Min   MaxMedian   AvgStddev
x   5   2.17817e-06   2.18962e-06 2.181e-06  2.182292e-06 4.3252133e-09
+   5   2.18464e-06   2.20839e-06   2.19241e-06  2.194974e-06 8.8695958e-09
Difference at 95.0% confidence
1.2682e-08 +/- 1.01766e-08
0.581132% +/- 0.466326%
(Student's t, pooled s = 6.97772e-09)

This suggests a difference of 13ns +/- 10ns
Our expectation from microbenchmarks was 35-40 cycles per call,
but part of the gains may be eaten by stalls from pending memory reads.

For Cascadelake:
Cascadelake Syncookies over 20s (5 tests)
-
indirect 10339797 10297547 10366826 10378891 10384854
retpoline10332674 10366805 10320374 10334272 10374087

Computing the stats on the ns_pkt = 20e6/total_packets gives no
meaningful difference even at just 80% (this was expected):

$ ministat -c 80 -w 70 /tmp/cl-indirect /tmp/cl-retp
x /tmp/cl-indirect
+ /tmp/cl-retp
+--+
|   xx  + *   x   + ++x|
||__|_M_A_A___M|___|   |
+--+
N   Min   MaxMedian   AvgStddev
x   5   1.92588e-06   1.94221e-06   1.92923e-06  1.931716e-06 6.6936746e-09
+   5   1.92788e-06   1.93791e-06   1.93531e-06  1.933188e-06 4.3734106e-09
No difference proven at 80.0% confidence

Changed in v3:
- fix From: tag
- provide measurements

Changed in v2:
-fix build issues reported by kernel test robot

Brian Vazquez (4):
  net:

[PATCH net-next v3 1/4] net: use indirect call helpers for dst_input

2021-02-01 Thread Brian Vazquez
This patch avoids the indirect call for the common case:
ip_local_deliver and ip6_input

Signed-off-by: Brian Vazquez 
---
 include/net/dst.h   | 6 +-
 net/ipv4/ip_input.c | 1 +
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 10f0a8399867..98cf6e8c06c4 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct sk_buff;
 
@@ -441,10 +442,13 @@ static inline int dst_output(struct net *net, struct sock 
*sk, struct sk_buff *s
return skb_dst(skb)->output(net, sk, skb);
 }
 
+INDIRECT_CALLABLE_DECLARE(int ip6_input(struct sk_buff *));
+INDIRECT_CALLABLE_DECLARE(int ip_local_deliver(struct sk_buff *));
 /* Input packet from network to transport.  */
 static inline int dst_input(struct sk_buff *skb)
 {
-   return skb_dst(skb)->input(skb);
+   return INDIRECT_CALL_INET(skb_dst(skb)->input,
+ ip6_input, ip_local_deliver, skb);
 }
 
 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
diff --git a/net/ipv4/ip_input.c b/net/ipv4/ip_input.c
index b0c244af1e4d..3a025c011971 100644
--- a/net/ipv4/ip_input.c
+++ b/net/ipv4/ip_input.c
@@ -253,6 +253,7 @@ int ip_local_deliver(struct sk_buff *skb)
   net, NULL, skb, skb->dev, NULL,
   ip_local_deliver_finish);
 }
+EXPORT_SYMBOL(ip_local_deliver);
 
 static inline bool ip_rcv_options(struct sk_buff *skb, struct net_device *dev)
 {
-- 
2.30.0.365.g02bc693789-goog



Re: [PATCH] drm/msm/mdp5: Fix wait-for-commit for cmd panels

2021-01-31 Thread Brian Masney
On Wed, Jan 27, 2021 at 05:24:40PM +0200, Iskren Chernev wrote:
> Before the offending commit in msm_atomic_commit_tail wait_flush was
> called once per frame, after the commit was submitted. After it
> wait_flush is also called at the beginning to ensure previous
> potentially async commits are done.
> 
> For cmd panels the source of wait_flush is a ping-pong irq notifying
> a completion. The completion needs to be notified with complete_all so
> multiple waiting parties (new async committers) can proceed.
> 
> Signed-off-by: Iskren Chernev 
> Suggested-by: Rob Clark 
> Fixes: 2d99ced787e3d ("drm/msm: async commit support")

Nice job tracking down this fix!

Reviewed-by: Brian Masney 
Tested-by: Brian Masney 



Re: [PATCH] soc: qcom: ocmem: don't return NULL in of_get_ocmem

2021-01-31 Thread Brian Masney
On Sat, Jan 30, 2021 at 03:23:49PM +0100, Luca Weiss wrote:
> If ocmem probe fails for whatever reason, of_get_ocmem returned NULL.
> Without this, users must check for both NULL and IS_ERR on the returned
> pointer - which didn't happen in drivers/gpu/drm/msm/adreno/adreno_gpu.c
> leading to a NULL pointer dereference.
> 
> Fixes: 88c1e9404f1d ("soc: qcom: add OCMEM driver")
> Signed-off-by: Luca Weiss 

Reviewed-by: Brian Masney 



Re: [PATCH V2 3/6] x86_32/sysenter: switch to the task stack without emptying the entry stack

2021-01-26 Thread Brian Gerst
On Mon, Jan 25, 2021 at 11:35 AM Lai Jiangshan  wrote:
>
> From: Lai Jiangshan 
>
> Like the way x86_64 uses the "old" stack, we can save the entry stack
> pointer to a register and switch to the task stack.  So that we have
> space on the "old" stack to save more things or scratch registers.
>
> Signed-off-by: Lai Jiangshan 
> ---
>  arch/x86/entry/entry_32.S | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 3b4d1a63d1f0..4513702ba45d 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -905,19 +905,18 @@ SYM_FUNC_START(entry_SYSENTER_32)
> pushl   %eax
> BUG_IF_WRONG_CR3 no_user_check=1
> SWITCH_TO_KERNEL_CR3 scratch_reg=%eax
> -   popl%eax
> -   popfl
>
> -   /* Stack empty again, switch to task stack */
> -   movlTSS_entry2task_stack(%esp), %esp
> +   /* Switch to task stack */
> +   movl%esp, %eax
> +   movl(2*4+TSS_entry2task_stack)(%esp), %esp
>
>  .Lsysenter_past_esp:
> pushl   $__USER_DS  /* pt_regs->ss */
> pushl   $0  /* pt_regs->sp (placeholder) */
> -   pushfl  /* pt_regs->flags (except IF = 0) */
> +   pushl   4(%eax) /* pt_regs->flags (except IF = 0) */

__KERNEL_DS isn't loaded at this point, so this needs an explicit %ss:
override.  You probably didn't catch this because the default
__USER_DS was still loaded.

> pushl   $__USER_CS  /* pt_regs->cs */
> pushl   $0  /* pt_regs->ip = 0 (placeholder) */
> -   pushl   %eax/* pt_regs->orig_ax */
> +   pushl   (%eax)  /* pt_regs->orig_ax */

Add an %ss: override here too.

> SAVE_ALL pt_regs_ax=$-ENOSYS/* save rest, stack already switched 
> */
>
> /*
> --
> 2.19.1.6.gb485710b
>

--
Brian Gerst


Re: linux-next: build warning after merge of the xfs tree

2021-01-25 Thread Brian Foster
On Mon, Jan 25, 2021 at 09:55:32AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the xfs tree, today's linux-next build (powerpc
> ppc64_defconfig) produced this warning:
> 
> fs/xfs/xfs_log.c: In function 'xfs_log_cover':
> fs/xfs/xfs_log.c::16: warning: unused variable 'log' [-Wunused-variable]
>   |  struct xlog  *log = mp->m_log;
>   |^~~
> 
> Introduced by commit
> 
>   303591a0a947 ("xfs: cover the log during log quiesce")
> 

Oops, patch below. Feel free to apply or squash into the original
commit.

Brian

--- 8< ---

>From 6078f06e2bd4c82111a85a2032c39a56654b0be6 Mon Sep 17 00:00:00 2001
From: Brian Foster 
Date: Mon, 25 Jan 2021 08:22:56 -0500
Subject: [PATCH] xfs: fix unused log variable in xfs_log_cover()

The log variable is only used in kernels with asserts enabled.
Remove it and open code the dereference to avoid unused variable
warnings.

Signed-off-by: Brian Foster 
---
 fs/xfs/xfs_log.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 58699881c100..d8b814227734 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1108,12 +1108,11 @@ static int
 xfs_log_cover(
struct xfs_mount*mp)
 {
-   struct xlog *log = mp->m_log;
int error = 0;
boolneed_covered;
 
-   ASSERT((xlog_cil_empty(log) && xlog_iclogs_empty(log) &&
-   !xfs_ail_min_lsn(log->l_ailp)) ||
+   ASSERT((xlog_cil_empty(mp->m_log) && xlog_iclogs_empty(mp->m_log) &&
+   !xfs_ail_min_lsn(mp->m_log->l_ailp)) ||
   XFS_FORCED_SHUTDOWN(mp));
 
if (!xfs_log_writable(mp))
-- 
2.26.2



Re: [PATCH 1/4] ARM: dts: qcom: msm8974: add gpu support

2021-01-24 Thread Brian Masney
On Sun, Jan 24, 2021 at 03:56:06PM +0100, Konrad Dybcio wrote:
> Hi,
> 
> 
> > +   gpu_opp_table: opp_table {
> > +   status = "disabled";
> 
> 
> Is there a good reason to disable this?
> 
> 
> > +   opp-8 {
> > +   opp-hz = /bits/ 64 <8>;
> > +   };
> 
> No. A330 can't go lightspeed (unless there's some secret ultra-binned 
> msm8974pro-v4-ad-5g). [1]

Hi Iskren,

I believe that the Nexus 5 GPU frequencies are documented in the
downstream kernel at:
https://github.com/AICP/kernel_lge_hammerhead/blob/n7.1/arch/arm/boot/dts/msm8974-gpu.dtsi#L67

I am fairly certain that the qcom,bus-freq property is an index into the
qcom,msm-bus,vectors-KBps property above. This will map to the
interconnect and operating points in the upstream kernel.

Note that the actual implementation in a3xx_gpu.c and a4xx_gpu.c
currently has this snippet to set the bus speed:

/*
 * Set the ICC path to maximum speed for now by multiplying the fastest
 * frequency by the bus width (8). We'll want to scale this later on to
 * improve battery life.
 */
icc_set_bw(icc_path, 0, Bps_to_icc(gpu->fast_rate) * 8);
icc_set_bw(ocmem_icc_path, 0, Bps_to_icc(gpu->fast_rate) * 8);

This should be fine for the time being. You'll want to document it
correctly in device tree though.

If the v2 changes too much, then feel free to drop my name from the
patch. I thought that I had made these changes already but apparently
not. :/

Brian


Re: [PATCH] xfs: set inode size after creating symlink

2021-01-21 Thread Brian Foster
On Thu, Jan 21, 2021 at 09:19:12AM -0600, Jeffrey Mitchell wrote:
> When XFS creates a new symlink, it writes its size to disk but not to the
> VFS inode. This causes i_size_read() to return 0 for that symlink until
> it is re-read from disk, for example when the system is rebooted.
> 
> I found this inconsistency while protecting directories with eCryptFS.
> The command "stat path/to/symlink/in/ecryptfs" will report "Size: 0" if
> the symlink was created after the last reboot on an XFS root.
> 
> Call i_size_write() in xfs_symlink()
> 
> Signed-off-by: Jeffrey Mitchell 
> ---

Reviewed-by: Brian Foster 

>  fs/xfs/xfs_symlink.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 1f43fd7f3209..c835827ae389 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -300,6 +300,7 @@ xfs_symlink(
>   }
>   ASSERT(pathlen == 0);
>   }
> + i_size_write(VFS_I(ip), ip->i_d.di_size);
>  
>   /*
>* Create the directory entry for the symlink.
> -- 
> 2.25.1
> 



Re: [PATCH] marvell/mwifiex: replace one-element array with flexible-array member.

2021-01-19 Thread Brian Norris
One more thing, for context:

On Tue, Jan 19, 2021 at 11:11 AM Brian Norris  wrote:
> On Fri, Jan 15, 2021 at 1:39 AM Jiapeng Zhong
>  wrote:
> >
> > Fix the follow coccicheck warnings:
> >
> > ./drivers/net/wireless/marvell/mwifiex/fw.h: WARNING use flexible-array
> > member instead(https://www.kernel.org/doc/html/latest/process/
> > deprecated.html#zero-length-and-one-element-arrays)
> >
> > Reported-by: Abaci Robot 
> > Signed-off-by: Jiapeng Zhong 
>
> Past experience unfortunately requires me to ask: did you test your
> changes? I understand that's a mostly legit warning, and a good
> deprecation notice, but that doesn't mean this is the right fix. One
> probably should instead audit the usage sites to see if they *are*
> already making proper sizeof (or other) comparisons, and if not, fix
> those first. And if any sites *are* doing correct sizeof computations
> using the existing struct layouts, then you are probably breaking
> them.
>
> Or if you have audited the usage of these structs, it's nice to make a
> small explanation of why this is correct, so I (and other readers)
> don't have to ask these questions :)

FYI, there are others who I believe are making similar blind changes
to this code:

[PATCH 1/1] mwifiex: Fix possible buffer overflows in mwifiex_config_scan
https://lore.kernel.org/linux-wireless/CA+ASDXPkLg2GGFJTt25YO7wae==yahftf8jxu520pl_vzat...@mail.gmail.com/

For that particular case (the 'ssid' field in
mwifiex_ie_types_wildcard_ssid_params), the previous patch-er was
incorrect, and I believe your change is a better one. But neither of
you provided useful analysis.

Brian


Re: [PATCH] marvell/mwifiex: replace one-element array with flexible-array member.

2021-01-19 Thread Brian Norris
Hi,

On Fri, Jan 15, 2021 at 1:39 AM Jiapeng Zhong
 wrote:
>
> Fix the follow coccicheck warnings:
>
> ./drivers/net/wireless/marvell/mwifiex/fw.h: WARNING use flexible-array
> member instead(https://www.kernel.org/doc/html/latest/process/
> deprecated.html#zero-length-and-one-element-arrays)
>
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Zhong 

Past experience unfortunately requires me to ask: did you test your
changes? I understand that's a mostly legit warning, and a good
deprecation notice, but that doesn't mean this is the right fix. One
probably should instead audit the usage sites to see if they *are*
already making proper sizeof (or other) comparisons, and if not, fix
those first. And if any sites *are* doing correct sizeof computations
using the existing struct layouts, then you are probably breaking
them.

Or if you have audited the usage of these structs, it's nice to make a
small explanation of why this is correct, so I (and other readers)
don't have to ask these questions :)

Regards,
Brian


RE: [PATCH 04/18] arch: hexagon: Don't select HAVE_OPROFILE

2021-01-15 Thread Brian Cain
> -Original Message-
> From: Viresh Kumar 
> Sent: Thursday, January 14, 2021 5:35 AM
...
> The "oprofile" user-space tools don't use the kernel OPROFILE support any
> more, and haven't in a long time. User-space has been converted to the
perf
> interfaces.
> 
> Don't select HAVE_OPROFILE for hexagon anymore.

Acked-by: Brian Cain 

> Suggested-by: Christoph Hellwig 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Viresh Kumar 
> ---
>  arch/hexagon/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig index
> 6e00c16a36b5..44a409967af1 100644
> --- a/arch/hexagon/Kconfig
> +++ b/arch/hexagon/Kconfig
> @@ -7,7 +7,6 @@ config HEXAGON
>   select ARCH_32BIT_OFF_T
>   select ARCH_HAS_SYNC_DMA_FOR_DEVICE
>   select ARCH_NO_PREEMPT
> - select HAVE_OPROFILE
>   # Other pending projects/to-do items.
>   # select HAVE_REGS_AND_STACK_ACCESS_API
>   # select HAVE_HW_BREAKPOINT if PERF_EVENTS
> --
> 2.25.0.rc1.19.g042ed3e048af




RE: [PATCH] hexagon: remove CONFIG_EXPERIMENTAL from defconfigs

2021-01-15 Thread Brian Cain
> -Original Message-
> From: Randy Dunlap 
...
> Since CONFIG_EXPERIMENTAL was removed in 2013, go ahead and drop it
> from any defconfig files.

Acked-by: Brian Cain 

> Fixes: 3d374d09f16f ("final removal of CONFIG_EXPERIMENTAL")
> Signed-off-by: Randy Dunlap 
> Cc: Kees Cook 
> Cc: Greg Kroah-Hartman 
> Cc: Brian Cain 
> Cc: linux-hexa...@vger.kernel.org
> Cc: Andrew Morton 
> ---
>  arch/hexagon/configs/comet_defconfig |1 -
>  1 file changed, 1 deletion(-)
> 
> --- linux-next-20210114.orig/arch/hexagon/configs/comet_defconfig
> +++ linux-next-20210114/arch/hexagon/configs/comet_defconfig
> @@ -1,7 +1,6 @@
>  CONFIG_SMP=y
>  CONFIG_DEFAULT_MMAP_MIN_ADDR=0
>  CONFIG_HZ_100=y
> -CONFIG_EXPERIMENTAL=y
>  CONFIG_CROSS_COMPILE="hexagon-"
>  CONFIG_LOCALVERSION="-smp"
>  # CONFIG_LOCALVERSION_AUTO is not set



Re: [PATCH v5 00/21] ibmvfc: initial MQ development/enablement

2021-01-14 Thread Brian King
Tyrel,

I think this patch series is looking pretty good. I don't think we need
to wait for resolution of the can_queue issue being discussed, since
that is an issue that exists prior to this patch series and this patch
series doesn't make the issue any worse. Let's work that separately.

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v5 21/21] ibmvfc: provide modules parameters for MQ settings

2021-01-14 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v5 18/21] ibmvfc: send Cancel MAD down each hw scsi channel

2021-01-14 Thread Brian King
Reviewed-by: Brian King 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement

2021-01-14 Thread Brian King
On 1/13/21 7:27 PM, Ming Lei wrote:
> On Wed, Jan 13, 2021 at 11:13:07AM -0600, Brian King wrote:
>> On 1/12/21 6:33 PM, Tyrel Datwyler wrote:
>>> On 1/12/21 2:54 PM, Brian King wrote:
>>>> On 1/11/21 5:12 PM, Tyrel Datwyler wrote:
>>>>> Introduce several new vhost fields for managing MQ state of the adapter
>>>>> as well as initial defaults for MQ enablement.
>>>>>
>>>>> Signed-off-by: Tyrel Datwyler 
>>>>> ---
>>>>>  drivers/scsi/ibmvscsi/ibmvfc.c | 8 
>>>>>  drivers/scsi/ibmvscsi/ibmvfc.h | 9 +
>>>>>  2 files changed, 17 insertions(+)
>>>>>
>>>>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c 
>>>>> b/drivers/scsi/ibmvscsi/ibmvfc.c
>>>>> index ba95438a8912..9200fe49c57e 100644
>>>>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
>>>>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
>>>>> @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = {
>>>>>   .max_sectors = IBMVFC_MAX_SECTORS,
>>>>>   .shost_attrs = ibmvfc_attrs,
>>>>>   .track_queue_depth = 1,
>>>>> + .host_tagset = 1,
>>>>
>>>> This doesn't seem right. You are setting host_tagset, which means you want 
>>>> a
>>>> shared, host wide, tag set for commands. It also means that the total
>>>> queue depth for the host is can_queue. However, it looks like you are 
>>>> allocating
>>>> max_requests events for each sub crq, which means you are over allocating 
>>>> memory.
>>>
>>> With the shared tagset yes the queue depth for the host is can_queue, but 
>>> this
>>> also implies that the max queue depth for each hw queue is also can_queue. 
>>> So,
>>> in the worst case that all commands are queued down the same hw queue we 
>>> need an
>>> event pool with can_queue commands.
>>>
>>>>
>>>> Looking at this closer, we might have bigger problems. There is a host wide
>>>> max number of commands that the VFC host supports, which gets returned on
>>>> NPIV Login. This value can change across a live migration event.
>>>
>>> From what I understand the max commands can only become less.
>>>
>>>>
>>>> The ibmvfc driver, which does the same thing the lpfc driver does, modifies
>>>> can_queue on the scsi_host *after* the tag set has been allocated. This 
>>>> looks
>>>> to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like
>>>> we look at can_queue once the tag set is setup, and I'm not seeing a good 
>>>> way
>>>> to dynamically change the host queue depth once the tag set is setup. 
>>>>
>>>> Unless I'm missing something, our best options appear to either be to 
>>>> implement
>>>> our own host wide busy reference counting, which doesn't sound very good, 
>>>> or
>>>> we need to add some API to block / scsi that allows us to dynamically 
>>>> change
>>>> can_queue.
>>>
>>> Changing can_queue won't do use any good with the shared tagset becasue each
>>> queue still needs to be able to queue can_queue number of commands in the 
>>> worst
>>> case.
>>
>> The issue I'm trying to highlight here is the following scenario:
>>
>> 1. We set shost->can_queue, then call scsi_add_host, which allocates the tag 
>> set.
>>
>> 2. On our NPIV login response from the VIOS, we might get a lower value than 
>> we
>> initially set in shost->can_queue, so we update it, but nobody ever looks at 
>> it
>> again, and we don't have any protection against sending too many commands to 
>> the host.
>>
>>
>> Basically, we no longer have any code that ensures we don't send more
>> commands to the VIOS than we are told it supports. According to the 
>> architecture,
>> if we actually do this, the VIOS will do an h_free_crq, which would be a bit
>> of a bug on our part.
>>
>> I don't think it was ever clearly defined in the API that a driver can
>> change shost->can_queue after calling scsi_add_host, but up until
>> commit 6eb045e092efefafc6687409a6fa6d1dabf0fb69, this worked and now
>> it doesn't. 
> 
> Actually it isn't related with commit 6eb045e092ef, because 
> blk_mq_alloc_tag_set()
> uses .can_queue to create driver tag sbitmap and requ

Re: [PATCH v4 00/21] ibmvfc: initial MQ development

2021-01-13 Thread Brian King
With the exception of the few comments I've shared, the rest of this looks
good to me and you can add my:

Reviewed-by: Brian King 

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 21/21] ibmvfc: provide modules parameters for MQ settings

2021-01-13 Thread Brian King
On 1/11/21 5:12 PM, Tyrel Datwyler wrote:
> @@ -5880,12 +5936,13 @@ static int ibmvfc_probe(struct vio_dev *vdev, const 
> struct vio_device_id *id)
>  
>   shost->transportt = ibmvfc_transport_template;
>   shost->can_queue = max_requests;
> + shost->can_queue = (max_requests / nr_scsi_hw_queues);

This looks to be in conflict with the fact that the first patch requested a 
shared tag set, right?

>   shost->max_lun = max_lun;
>   shost->max_id = max_targets;
>   shost->max_sectors = IBMVFC_MAX_SECTORS;
>   shost->max_cmd_len = IBMVFC_MAX_CDB_LEN;
>   shost->unique_id = shost->host_no;
> - shost->nr_hw_queues = IBMVFC_MQ ? IBMVFC_SCSI_HW_QUEUES : 1;
> + shost->nr_hw_queues = mq_enabled ? nr_scsi_hw_queues : 1;

You might want to range check this, to make sure its sane.
>  
>   vhost = shost_priv(shost);
>   INIT_LIST_HEAD(&vhost->targets);
> @@ -5897,8 +5954,8 @@ static int ibmvfc_probe(struct vio_dev *vdev, const 
> struct vio_device_id *id)
>   vhost->log_level = log_level;
>   vhost->task_set = 1;
>  
> - vhost->mq_enabled = IBMVFC_MQ;
> - vhost->client_scsi_channels = IBMVFC_SCSI_CHANNELS;
> + vhost->mq_enabled = mq_enabled;
> + vhost->client_scsi_channels = min(nr_scsi_hw_queues, nr_scsi_channels);
>   vhost->using_channels = 0;
>   vhost->do_enquiry = 1;
>  
> 


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement

2021-01-13 Thread Brian King
On 1/12/21 6:33 PM, Tyrel Datwyler wrote:
> On 1/12/21 2:54 PM, Brian King wrote:
>> On 1/11/21 5:12 PM, Tyrel Datwyler wrote:
>>> Introduce several new vhost fields for managing MQ state of the adapter
>>> as well as initial defaults for MQ enablement.
>>>
>>> Signed-off-by: Tyrel Datwyler 
>>> ---
>>>  drivers/scsi/ibmvscsi/ibmvfc.c | 8 
>>>  drivers/scsi/ibmvscsi/ibmvfc.h | 9 +
>>>  2 files changed, 17 insertions(+)
>>>
>>> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
>>> index ba95438a8912..9200fe49c57e 100644
>>> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
>>> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
>>> @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = {
>>> .max_sectors = IBMVFC_MAX_SECTORS,
>>> .shost_attrs = ibmvfc_attrs,
>>> .track_queue_depth = 1,
>>> +   .host_tagset = 1,
>>
>> This doesn't seem right. You are setting host_tagset, which means you want a
>> shared, host wide, tag set for commands. It also means that the total
>> queue depth for the host is can_queue. However, it looks like you are 
>> allocating
>> max_requests events for each sub crq, which means you are over allocating 
>> memory.
> 
> With the shared tagset yes the queue depth for the host is can_queue, but this
> also implies that the max queue depth for each hw queue is also can_queue. So,
> in the worst case that all commands are queued down the same hw queue we need 
> an
> event pool with can_queue commands.
> 
>>
>> Looking at this closer, we might have bigger problems. There is a host wide
>> max number of commands that the VFC host supports, which gets returned on
>> NPIV Login. This value can change across a live migration event.
> 
> From what I understand the max commands can only become less.
> 
>>
>> The ibmvfc driver, which does the same thing the lpfc driver does, modifies
>> can_queue on the scsi_host *after* the tag set has been allocated. This looks
>> to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like
>> we look at can_queue once the tag set is setup, and I'm not seeing a good way
>> to dynamically change the host queue depth once the tag set is setup. 
>>
>> Unless I'm missing something, our best options appear to either be to 
>> implement
>> our own host wide busy reference counting, which doesn't sound very good, or
>> we need to add some API to block / scsi that allows us to dynamically change
>> can_queue.
> 
> Changing can_queue won't do use any good with the shared tagset becasue each
> queue still needs to be able to queue can_queue number of commands in the 
> worst
> case.

The issue I'm trying to highlight here is the following scenario:

1. We set shost->can_queue, then call scsi_add_host, which allocates the tag 
set.

2. On our NPIV login response from the VIOS, we might get a lower value than we
initially set in shost->can_queue, so we update it, but nobody ever looks at it
again, and we don't have any protection against sending too many commands to 
the host.


Basically, we no longer have any code that ensures we don't send more
commands to the VIOS than we are told it supports. According to the 
architecture,
if we actually do this, the VIOS will do an h_free_crq, which would be a bit
of a bug on our part.

I don't think it was ever clearly defined in the API that a driver can
change shost->can_queue after calling scsi_add_host, but up until
commit 6eb045e092efefafc6687409a6fa6d1dabf0fb69, this worked and now
it doesn't. 

I started looking through drivers that do this, and so far, it looks like the
following drivers do: ibmvfc, lpfc, aix94xx, libfc, BusLogic, and likely 
others...

We probably need an API that lets us change shost->can_queue dynamically.

Thanks,

Brian

-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 01/21] ibmvfc: add vhost fields and defaults for MQ enablement

2021-01-12 Thread Brian King
On 1/11/21 5:12 PM, Tyrel Datwyler wrote:
> Introduce several new vhost fields for managing MQ state of the adapter
> as well as initial defaults for MQ enablement.
> 
> Signed-off-by: Tyrel Datwyler 
> ---
>  drivers/scsi/ibmvscsi/ibmvfc.c | 8 
>  drivers/scsi/ibmvscsi/ibmvfc.h | 9 +
>  2 files changed, 17 insertions(+)
> 
> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
> index ba95438a8912..9200fe49c57e 100644
> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
> @@ -3302,6 +3302,7 @@ static struct scsi_host_template driver_template = {
>   .max_sectors = IBMVFC_MAX_SECTORS,
>   .shost_attrs = ibmvfc_attrs,
>   .track_queue_depth = 1,
> + .host_tagset = 1,

This doesn't seem right. You are setting host_tagset, which means you want a
shared, host wide, tag set for commands. It also means that the total
queue depth for the host is can_queue. However, it looks like you are allocating
max_requests events for each sub crq, which means you are over allocating 
memory.

Looking at this closer, we might have bigger problems. There is a host wide
max number of commands that the VFC host supports, which gets returned on
NPIV Login. This value can change across a live migration event. 

The ibmvfc driver, which does the same thing the lpfc driver does, modifies
can_queue on the scsi_host *after* the tag set has been allocated. This looks
to be a concern with ibmvfc, not sure about lpfc, as it doesn't look like
we look at can_queue once the tag set is setup, and I'm not seeing a good way
to dynamically change the host queue depth once the tag set is setup. 

Unless I'm missing something, our best options appear to either be to implement
our own host wide busy reference counting, which doesn't sound very good, or
we need to add some API to block / scsi that allows us to dynamically change
can_queue.

Thanks,

Brian


-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH v4 18/21] ibmvfc: send Cancel MAD down each hw scsi channel

2021-01-12 Thread Brian King
On 1/11/21 5:12 PM, Tyrel Datwyler wrote:
> diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
> index b0b0212344f3..24e1278acfeb 100644
> --- a/drivers/scsi/ibmvscsi/ibmvfc.c
> +++ b/drivers/scsi/ibmvscsi/ibmvfc.c
> @@ -2418,18 +2418,79 @@ static struct ibmvfc_event *ibmvfc_init_tmf(struct 
> ibmvfc_queue *queue,
>   return evt;
>  }
>  
> -/**
> - * ibmvfc_cancel_all - Cancel all outstanding commands to the device
> - * @sdev:scsi device to cancel commands
> - * @type:type of error recovery being performed
> - *
> - * This sends a cancel to the VIOS for the specified device. This does
> - * NOT send any abort to the actual device. That must be done separately.
> - *
> - * Returns:
> - *   0 on success / other on failure
> - **/
> -static int ibmvfc_cancel_all(struct scsi_device *sdev, int type)
> +static int ibmvfc_cancel_all_mq(struct scsi_device *sdev, int type)
> +{
> + struct ibmvfc_host *vhost = shost_priv(sdev->host);
> + struct ibmvfc_event *evt, *found_evt, *temp;
> + struct ibmvfc_queue *queues = vhost->scsi_scrqs.scrqs;
> + unsigned long flags;
> + int num_hwq, i;
> + LIST_HEAD(cancelq);
> + u16 status;
> +
> + ENTER;
> + spin_lock_irqsave(vhost->host->host_lock, flags);
> + num_hwq = vhost->scsi_scrqs.active_queues;
> + for (i = 0; i < num_hwq; i++) {
> + spin_lock(queues[i].q_lock);
> + spin_lock(&queues[i].l_lock);
> + found_evt = NULL;
> + list_for_each_entry(evt, &queues[i].sent, queue_list) {
> + if (evt->cmnd && evt->cmnd->device == sdev) {
> + found_evt = evt;
> + break;
> + }
> + }
> + spin_unlock(&queues[i].l_lock);
> +

I really don't like the way the first for loop grabs all the q_locks, and then
there is a second for loop that drops all these locks... I think this code would
be cleaner if it looked like:

if (found_evt && vhost->logged_in) {
evt = ibmvfc_init_tmf(&queues[i], sdev, type);
evt->sync_iu = &queues[i].cancel_rsp;
ibmvfc_send_event(evt, vhost, default_timeout);
list_add_tail(&evt->cancel, &cancelq);
}

spin_unlock(queues[i].q_lock);

}

> + if (!found_evt)
> + continue;
> +
> + if (vhost->logged_in) {
> + evt = ibmvfc_init_tmf(&queues[i], sdev, type);
> + evt->sync_iu = &queues[i].cancel_rsp;
> + ibmvfc_send_event(evt, vhost, default_timeout);
> + list_add_tail(&evt->cancel, &cancelq);
> + }
> + }
> +
> + for (i = 0; i < num_hwq; i++)
> + spin_unlock(queues[i].q_lock);
> + spin_unlock_irqrestore(vhost->host->host_lock, flags);
> +
> + if (list_empty(&cancelq)) {
> + if (vhost->log_level > IBMVFC_DEFAULT_LOG_LEVEL)
> + sdev_printk(KERN_INFO, sdev, "No events found to 
> cancel\n");
> + return 0;
> + }
> +
> + sdev_printk(KERN_INFO, sdev, "Cancelling outstanding commands.\n");
> +
> + list_for_each_entry_safe(evt, temp, &cancelq, cancel) {
> + wait_for_completion(&evt->comp);
> + status = be16_to_cpu(evt->queue->cancel_rsp.mad_common.status);

You probably want a list_del(&evt->cancel) here.

> + ibmvfc_free_event(evt);
> +
> + if (status != IBMVFC_MAD_SUCCESS) {
> + sdev_printk(KERN_WARNING, sdev, "Cancel failed with 
> rc=%x\n", status);
> + switch (status) {
> + case IBMVFC_MAD_DRIVER_FAILED:
> + case IBMVFC_MAD_CRQ_ERROR:
> + /* Host adapter most likely going through reset, return 
> success to
> +  * the caller will wait for the command being cancelled 
> to get returned
> +  */
> + break;
> + default:
> + break;

I think this default break needs to be a return -EIO.

> + }
> + }
> + }
> +
> + sdev_printk(KERN_INFO, sdev, "Successfully cancelled outstanding 
> commands\n");
> + return 0;
> +}
> +
> +static int ibmvfc_cancel_all_sq(struct scsi_device *sdev, int type)
>  {
>   struct ibmvfc_host *vhost = shost_priv(sdev->host);
>   struct ibmvfc_event *evt, *found_evt;
> @@ -2498,6 +2559,27 @@ static int ibmvfc_cancel_all(struct scsi_device *sdev, 
> int type)
>   return 0;
>  }
>  



-- 
Brian King
Power Linux I/O
IBM Linux Technology Center



Re: [PATCH 1/1] mwifiex: Fix possible buffer overflows in mwifiex_config_scan

2021-01-11 Thread Brian Norris
(Note: this is version 1; there's a later version posted, which does
not have a v2 tag...)

https://lore.kernel.org/linux-wireless/20201208150951.35866-1-ruc_zhangxiao...@163.com/

On Sat, Jan 9, 2021 at 7:11 AM Peter Seiderer  wrote:
> On Tue,  8 Dec 2020 20:45:23 +0800, Xiaohui Zhang  
> wrote:
> > From: Zhang Xiaohui 
> > mwifiex_config_scan() calls memcpy() without checking
> > the destination size may trigger a buffer overflower,
> > which a local user could use to cause denial of service
> > or the execution of arbitrary code.
> > Fix it by putting the length check before calling memcpy().
> >
> > Signed-off-by: Zhang Xiaohui 
> > ---
> >  drivers/net/wireless/marvell/mwifiex/scan.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/wireless/marvell/mwifiex/scan.c 
> > b/drivers/net/wireless/marvell/mwifiex/scan.c
> > index c2a685f63..b1d90678f 100644
> > --- a/drivers/net/wireless/marvell/mwifiex/scan.c
> > +++ b/drivers/net/wireless/marvell/mwifiex/scan.c
> > @@ -930,6 +930,8 @@ mwifiex_config_scan(struct mwifiex_private *priv,
> >   "DIRECT-", 7))
> >   wildcard_ssid_tlv->max_ssid_length = 0xfe;
> >
> > + if (ssid_len > 1)
> > + ssid_len = 1;
>
> Why do your believe the available size is only '1'? A SSID is expected
> to be of size IEEE80211_MAX_SSID_LE/32 and the wildcard_ssid_tlv pointer
> is casted from tlv_pos (some lines above) which is a pointer/index into
> scan_cfg_out->tlv_buf...

I pointed out this discrepancy already, taking a slightly different approach:

https://lore.kernel.org/linux-wireless/ca+asdxpvu5s0vm0aocyqln090u3bwa_nv358ywkpxuu223u...@mail.gmail.com/

> And the define (line 44) indicates there should be enough space for a SSID:
>
>   42 /* Memory needed to store a max number/size WildCard SSID TLV for a 
> firmware
>   43 scan */
>   44 #define WILDCARD_SSID_TLV_MAX_SIZE  \
>   45 (MWIFIEX_MAX_SSID_LIST_LENGTH * \
>   46 (sizeof(struct mwifiex_ie_types_wildcard_ssid_params)   \
>   47 + IEEE80211_MAX_SSID_LEN))

Ah, good catch. So this may not be a true overflow issue at all, even
if it's confusing and bad code. The "problem" is that this piece of
the struct is variable-length, and unless we're going to dynamically
resize/reallocate the whole buffer, we have to assume the initial
allocation was large enough (and per your note, it is).

> For sure something to improve here instead of using a confusing 'u8 ssid[1]'
> in struct mwifiex_ie_types_wildcard_ssid_params...

Yep.

Brian


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-11 Thread Brian Foster
On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > If the value goes below the limit while some threads are
> > > > > already waiting but before the push worker gets to it, these threads 
> > > > > are
> > > > > not woken.
> > > > > 
> > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > spinlock, which prevents additions to the waitqueue.
> > > > > 
> > > > > Signed-off-by: Donald Buczek 
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > >   /*
> > > > >* Wake up any background push waiters now this context is 
> > > > > being pushed.
> > > > >*/
> > > > > - if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > + if (waitqueue_active(&cil->xc_push_wait))
> > > > >   wake_up_all(&cil->xc_push_wait);
> > > > 
> > > > That just smells wrong to me. It *might* be correct, but this
> > > > condition should pair with the sleep condition, as space used by a
> > > > CIL context should never actually decrease
> > > > 
> > > 
> > > ... but I'm a little confused by this assertion. The shadow buffer
> > > allocation code refers to the possibility of shadow buffers falling out
> > > that are smaller than currently allocated buffers. Further, the
> > > _insert_format_items() code appears to explicitly optimize for this
> > > possibility by reusing the active buffer, subtracting the old size/count
> > > values from the diff variables and then reformatting the latest
> > > (presumably smaller) item to the lv.
> > 
> > Individual items might shrink, but the overall transaction should
> > grow. Think of a extent to btree conversion of an inode fork. THe
> > data in the inode fork decreases from a list of extents to a btree
> > root block pointer, so the inode item shrinks. But then we add a new
> > btree root block that contains all the extents + the btree block
> > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > size.
> > 
> > IOWs, while the inode item has decreased in size, the overall
> > space consumed by the transaction has gone up and so the CIL ctx
> > used_space should increase. Hence we can't just look at individual
> > log items and whether they have decreased in size - we have to look
> > at all the items in the transaction to understand how the space used
> > in that transaction has changed. i.e. it's the aggregation of all
> > items in the transaction that matter here, not so much the
> > individual items.
> > 
> 
> Ok, that makes more sense...
> 
> > > Of course this could just be implementation detail. I haven't dug into
> > > the details in the remainder of this thread and I don't have specific
> > > examples off the top of my head, but perhaps based on the ability of
> > > various structures to change formats and the ability of log vectors to
> > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > shrink in size as well? Just from poking around the CIL it seems like
> > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > for recalculating split res as well)...
> > 
> > Yes, there may be situations where it decreases. It may be this is
> > fine, but the assumption *I've made* in lots of the CIL push code is
> > that ctx->used_space rarely, if ever, will go backwards.
> > 
> 
> ... and rarely seems a bit more pragmatic than never.
> 

FWIW, a cursory look at the inode size/format code (motivated by
Donald's recent log dump that appear

Re: [PATCH] x86/vm86/32: Remove VM86_SCREEN_BITMAP support

2021-01-08 Thread Brian Gerst
On Fri, Jan 8, 2021 at 1:59 PM Andy Lutomirski  wrote:
>
> The implementation was rather buggy.  It unconditionally marked PTEs
> read-only, even for VM_SHARED mappings.  I'm not sure whether this is
> actually a problem, but it certainly seems unwise.  More importantly, it
> released the mmap lock before flushing the TLB, which could allow a racing
> CoW operation to falsely believe that the underlying memory was not
> writable.
>
> I can't find any users at all of this mechanism, so just remove it.
>
> Cc: Andrea Arcangeli 
> Cc: Linux-MM 
> Cc: Jason Gunthorpe 
> Cc: x...@kernel.org
> Cc: Linus Torvalds 
> Cc: Matthew Wilcox 
> Cc: Jann Horn 
> Cc: Jan Kara 
> Cc: Yu Zhao 
> Cc: Peter Xu 
> Signed-off-by: Andy Lutomirski 
> ---
>  arch/x86/include/uapi/asm/vm86.h |  2 +-
>  arch/x86/kernel/vm86_32.c| 55 ++--
>  2 files changed, 10 insertions(+), 47 deletions(-)
>
> diff --git a/arch/x86/include/uapi/asm/vm86.h 
> b/arch/x86/include/uapi/asm/vm86.h
> index d2ee4e307ef8..50004fb4590d 100644
> --- a/arch/x86/include/uapi/asm/vm86.h
> +++ b/arch/x86/include/uapi/asm/vm86.h
> @@ -106,7 +106,7 @@ struct vm86_struct {
>  /*
>   * flags masks
>   */
> -#define VM86_SCREEN_BITMAP 0x0001
> +#define VM86_SCREEN_BITMAP 0x0001/* no longer supported */
>
>  struct vm86plus_info_struct {
> unsigned long force_return_for_pic:1;
> diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
> index 764573de3996..28b9e8d511e1 100644
> --- a/arch/x86/kernel/vm86_32.c
> +++ b/arch/x86/kernel/vm86_32.c
> @@ -160,49 +160,6 @@ void save_v86_state(struct kernel_vm86_regs *regs, int 
> retval)
> do_exit(SIGSEGV);
>  }
>
> -static void mark_screen_rdonly(struct mm_struct *mm)
> -{
> -   struct vm_area_struct *vma;
> -   spinlock_t *ptl;
> -   pgd_t *pgd;
> -   p4d_t *p4d;
> -   pud_t *pud;
> -   pmd_t *pmd;
> -   pte_t *pte;
> -   int i;
> -
> -   mmap_write_lock(mm);
> -   pgd = pgd_offset(mm, 0xA);
> -   if (pgd_none_or_clear_bad(pgd))
> -   goto out;
> -   p4d = p4d_offset(pgd, 0xA);
> -   if (p4d_none_or_clear_bad(p4d))
> -   goto out;
> -   pud = pud_offset(p4d, 0xA);
> -   if (pud_none_or_clear_bad(pud))
> -   goto out;
> -   pmd = pmd_offset(pud, 0xA);
> -
> -   if (pmd_trans_huge(*pmd)) {
> -   vma = find_vma(mm, 0xA);
> -   split_huge_pmd(vma, pmd, 0xA);
> -   }
> -   if (pmd_none_or_clear_bad(pmd))
> -   goto out;
> -   pte = pte_offset_map_lock(mm, pmd, 0xA, &ptl);
> -   for (i = 0; i < 32; i++) {
> -   if (pte_present(*pte))
> -   set_pte(pte, pte_wrprotect(*pte));
> -   pte++;
> -   }
> -   pte_unmap_unlock(pte, ptl);
> -out:
> -   mmap_write_unlock(mm);
> -   flush_tlb_mm_range(mm, 0xA, 0xA + 32*PAGE_SIZE, PAGE_SHIFT, 
> false);
> -}
> -
> -
> -
>  static int do_vm86_irq_handling(int subfunction, int irqnumber);
>  static long do_sys_vm86(struct vm86plus_struct __user *user_vm86, bool plus);
>
> @@ -282,6 +239,15 @@ static long do_sys_vm86(struct vm86plus_struct __user 
> *user_vm86, bool plus)
> offsetof(struct vm86_struct, int_revectored)))
> return -EFAULT;
>
> +
> +   /* VM86_SCREEN_BITMAP had numerous bugs and appears to have no users. 
> */
> +   if (v.flags & VM86_SCREEN_BITMAP) {
> +   char comm[TASK_COMM_LEN];
> +
> +   pr_info_once("vm86: '%s' uses VM86_SCREEN_BITMAP, which is no 
> longer supported\n", get_task_comm(comm, current);
> +   return -EINVAL;
> +   }
> +
> memset(&vm86regs, 0, sizeof(vm86regs));
>
> vm86regs.pt.bx = v.regs.ebx;
> @@ -370,9 +336,6 @@ static long do_sys_vm86(struct vm86plus_struct __user 
> *user_vm86, bool plus)
>     update_task_stack(tsk);
> preempt_enable();
>
> -   if (vm86->flags & VM86_SCREEN_BITMAP)
> -   mark_screen_rdonly(tsk->mm);
> -
> memcpy((struct kernel_vm86_regs *)regs, &vm86regs, sizeof(vm86regs));
> return regs->ax;
>  }

You can also remove screen_bitmap from struct vm86 (the kernel
internal structure), and the check_v8086_mode() function.

--
Brian Gerst


Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-08 Thread Brian Foster
On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > If the value goes below the limit while some threads are
> > > > already waiting but before the push worker gets to it, these threads are
> > > > not woken.
> > > > 
> > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > optimization. This is possible, because we hold the xc_push_lock
> > > > spinlock, which prevents additions to the waitqueue.
> > > > 
> > > > Signed-off-by: Donald Buczek 
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > /*
> > > >  * Wake up any background push waiters now this context is 
> > > > being pushed.
> > > >  */
> > > > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > +   if (waitqueue_active(&cil->xc_push_wait))
> > > > wake_up_all(&cil->xc_push_wait);
> > > 
> > > That just smells wrong to me. It *might* be correct, but this
> > > condition should pair with the sleep condition, as space used by a
> > > CIL context should never actually decrease
> > > 
> > 
> > ... but I'm a little confused by this assertion. The shadow buffer
> > allocation code refers to the possibility of shadow buffers falling out
> > that are smaller than currently allocated buffers. Further, the
> > _insert_format_items() code appears to explicitly optimize for this
> > possibility by reusing the active buffer, subtracting the old size/count
> > values from the diff variables and then reformatting the latest
> > (presumably smaller) item to the lv.
> 
> Individual items might shrink, but the overall transaction should
> grow. Think of a extent to btree conversion of an inode fork. THe
> data in the inode fork decreases from a list of extents to a btree
> root block pointer, so the inode item shrinks. But then we add a new
> btree root block that contains all the extents + the btree block
> header, and it gets rounded up to ithe 128 byte buffer logging chunk
> size.
> 
> IOWs, while the inode item has decreased in size, the overall
> space consumed by the transaction has gone up and so the CIL ctx
> used_space should increase. Hence we can't just look at individual
> log items and whether they have decreased in size - we have to look
> at all the items in the transaction to understand how the space used
> in that transaction has changed. i.e. it's the aggregation of all
> items in the transaction that matter here, not so much the
> individual items.
> 

Ok, that makes more sense...

> > Of course this could just be implementation detail. I haven't dug into
> > the details in the remainder of this thread and I don't have specific
> > examples off the top of my head, but perhaps based on the ability of
> > various structures to change formats and the ability of log vectors to
> > shrink in size, shouldn't we expect the possibility of a CIL context to
> > shrink in size as well? Just from poking around the CIL it seems like
> > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > for recalculating split res as well)...
> 
> Yes, there may be situations where it decreases. It may be this is
> fine, but the assumption *I've made* in lots of the CIL push code is
> that ctx->used_space rarely, if ever, will go backwards.
> 

... and rarely seems a bit more pragmatic than never.

> e.g. we run the first transaction into the CIL, it steals the sapce
> needed for the cil checkpoint headers for the transaciton. Then if
> the space returned by the item formatting is negative (because it is
> in the AIL and being relogged), the CIL checkpoint now doesn't have
> the space reserved it needs to run a checkpoint. That transaction is
> a sync transaction, so it forces the log, and now we push the CIL
> without sufficient reservation to write out the log headers and the
> items we just formatted
> 

Hmmm... that seem

Re: [PATCH] xfs: Wake CIL push waiters more reliably

2021-01-04 Thread Brian Foster
On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > Threads, which committed items to the CIL, wait in the xc_push_wait
> > waitqueue when used_space in the push context goes over a limit. These
> > threads need to be woken when the CIL is pushed.
> > 
> > The CIL push worker tries to avoid the overhead of calling wake_all()
> > when there are no waiters waiting. It does so by checking the same
> > condition which caused the waits to happen. This, however, is
> > unreliable, because ctx->space_used can actually decrease when items are
> > recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.
> 
> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...
> 

I agree that this warrants a bit more explanation and analysis before
changing the current code...

> > If the value goes below the limit while some threads are
> > already waiting but before the push worker gets to it, these threads are
> > not woken.
> > 
> > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > optimization. This is possible, because we hold the xc_push_lock
> > spinlock, which prevents additions to the waitqueue.
> > 
> > Signed-off-by: Donald Buczek 
> > ---
> >  fs/xfs/xfs_log_cil.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b0ef071b3cb5..d620de8e217c 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > /*
> >  * Wake up any background push waiters now this context is being pushed.
> >  */
> > -   if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > +   if (waitqueue_active(&cil->xc_push_wait))
> > wake_up_all(&cil->xc_push_wait);
> 
> That just smells wrong to me. It *might* be correct, but this
> condition should pair with the sleep condition, as space used by a
> CIL context should never actually decrease
> 

... but I'm a little confused by this assertion. The shadow buffer
allocation code refers to the possibility of shadow buffers falling out
that are smaller than currently allocated buffers. Further, the
_insert_format_items() code appears to explicitly optimize for this
possibility by reusing the active buffer, subtracting the old size/count
values from the diff variables and then reformatting the latest
(presumably smaller) item to the lv.

Of course this could just be implementation detail. I haven't dug into
the details in the remainder of this thread and I don't have specific
examples off the top of my head, but perhaps based on the ability of
various structures to change formats and the ability of log vectors to
shrink in size, shouldn't we expect the possibility of a CIL context to
shrink in size as well? Just from poking around the CIL it seems like
the surrounding code supports it (xlog_cil_insert_items() checks len > 0
for recalculating split res as well)...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 



Re: [PATCH 1/4] ARM: dts: qcom: msm8974: add gpu support

2020-12-30 Thread Brian Masney
On Wed, Dec 30, 2020 at 05:51:29PM +0200, Iskren Chernev wrote:
> From: Brian Masney 
> 
> Add support for the a3xx GPU
> 
> Signed-off-by: Brian Masney 

Reviewed-by: Brian Masney 



Re: [PATCH 1/2] drm/msm: Call msm_init_vram before binding the gpu

2020-12-30 Thread Brian Masney
On Wed, Dec 30, 2020 at 05:29:42PM +0200, Iskren Chernev wrote:
> From: Craig Tatlor 
> 
> vram.size is needed when binding a gpu without an iommu and is defined
> in msm_init_vram(), so run that before binding it.
> 
> Signed-off-by: Craig Tatlor 

For the series:

Reviewed-by: Brian Masney 

Next time, please include a cover letter so that tags added to the cover
letter can be applied to all patches in the series via patchwork.

Brian


Re: [PATCH 2/6] mm/mremap: For MREMAP_DONTUNMAP check security_vm_enough_memory_mm()

2020-12-30 Thread Brian Geffon
Ah, you're right. This individual patch looks good to me.

Brian

On Mon, Dec 28, 2020 at 11:12 AM Dmitry Safonov  wrote:
>
> On 12/28/20 6:21 PM, Brian Geffon wrote:
> > This looks good to me with a small comment.
> >
> >> if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
> >> /* OOM: unable to split vma, just get accounts right */
> >> -   if (vm_flags & VM_ACCOUNT)
> >> +   if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
> >> vm_acct_memory(new_len >> PAGE_SHIFT);
> >
> > Checking MREMAP_DONTUNMAP in the do_munmap path is unnecessary as
> > MREMAP_DONTUNMAP will have already returned by this point.
>
> In this code it is also used as err-path. In case move_page_tables()
> fails to move all page tables or .mremap() callback fails, the new VMA
> is unmapped.
>
> IOW, MREMAP_DONTUNMAP returns under:
> :   if (unlikely(!err && (flags & MREMAP_DONTUNMAP))) {
>
> --
>   Dima


Re: [PATCH 2/6] mm/mremap: For MREMAP_DONTUNMAP check security_vm_enough_memory_mm()

2020-12-28 Thread Brian Geffon
This looks good to me with a small comment.

> if (do_munmap(mm, old_addr, old_len, uf_unmap) < 0) {
> /* OOM: unable to split vma, just get accounts right */
> -   if (vm_flags & VM_ACCOUNT)
> +   if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP))
> vm_acct_memory(new_len >> PAGE_SHIFT);

Checking MREMAP_DONTUNMAP in the do_munmap path is unnecessary as
MREMAP_DONTUNMAP will have already returned by this point.


Re: [PATCH 3/6] mremap: Don't allow MREMAP_DONTUNMAP on special_mappings and aio

2020-12-28 Thread Brian Geffon
I don't think this situation can ever happen MREMAP_DONTUNMAP is
already restricted to anonymous mappings (defined as not having
vm_ops) and vma_to_resize checks that the mapping is anonymous before
move_vma is called.



On Mon, Oct 12, 2020 at 6:34 PM Dmitry Safonov  wrote:
>
> As kernel expect to see only one of such mappings, any further
> operations on the VMA-copy may be unexpected by the kernel.
> Maybe it's being on the safe side, but there doesn't seem to be any
> expected use-case for this, so restrict it now.
>
> Fixes: commit e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
> Signed-off-by: Dmitry Safonov 
> ---
>  arch/x86/kernel/cpu/resctrl/pseudo_lock.c | 2 +-
>  fs/aio.c  | 5 -
>  include/linux/mm.h| 2 +-
>  mm/mmap.c | 6 +-
>  mm/mremap.c   | 2 +-
>  5 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c 
> b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> index 0daf2f1cf7a8..e916646adc69 100644
> --- a/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> +++ b/arch/x86/kernel/cpu/resctrl/pseudo_lock.c
> @@ -1458,7 +1458,7 @@ static int pseudo_lock_dev_release(struct inode *inode, 
> struct file *filp)
> return 0;
>  }
>
> -static int pseudo_lock_dev_mremap(struct vm_area_struct *area)
> +static int pseudo_lock_dev_mremap(struct vm_area_struct *area, unsigned long 
> flags)
>  {
> /* Not supported */
> return -EINVAL;
> diff --git a/fs/aio.c b/fs/aio.c
> index d5ec30385566..3be3c0f77548 100644
> --- a/fs/aio.c
> +++ b/fs/aio.c
> @@ -324,13 +324,16 @@ static void aio_free_ring(struct kioctx *ctx)
> }
>  }
>
> -static int aio_ring_mremap(struct vm_area_struct *vma)
> +static int aio_ring_mremap(struct vm_area_struct *vma, unsigned long flags)
>  {
> struct file *file = vma->vm_file;
> struct mm_struct *mm = vma->vm_mm;
> struct kioctx_table *table;
> int i, res = -EINVAL;
>
> +   if (flags & MREMAP_DONTUNMAP)
> +   return -EINVAL;
> +
> spin_lock(&mm->ioctx_lock);
> rcu_read_lock();
> table = rcu_dereference(mm->ioctx_table);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 16b799a0522c..fd51a4a1f722 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -550,7 +550,7 @@ struct vm_operations_struct {
> void (*open)(struct vm_area_struct * area);
> void (*close)(struct vm_area_struct * area);
> int (*split)(struct vm_area_struct * area, unsigned long addr);
> -   int (*mremap)(struct vm_area_struct * area);
> +   int (*mremap)(struct vm_area_struct *area, unsigned long flags);
> vm_fault_t (*fault)(struct vm_fault *vmf);
> vm_fault_t (*huge_fault)(struct vm_fault *vmf,
> enum page_entry_size pe_size);
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bdd19f5b994e..50f853b0ec39 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3372,10 +3372,14 @@ static const char *special_mapping_name(struct 
> vm_area_struct *vma)
> return ((struct vm_special_mapping *)vma->vm_private_data)->name;
>  }
>
> -static int special_mapping_mremap(struct vm_area_struct *new_vma)
> +static int special_mapping_mremap(struct vm_area_struct *new_vma,
> + unsigned long flags)
>  {
> struct vm_special_mapping *sm = new_vma->vm_private_data;
>
> +   if (flags & MREMAP_DONTUNMAP)
> +   return -EINVAL;
> +
> if (WARN_ON_ONCE(current->mm != new_vma->vm_mm))
> return -EFAULT;
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index c248f9a52125..898e9818ba6d 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -384,7 +384,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> if (moved_len < old_len) {
> err = -ENOMEM;
> } else if (vma->vm_ops && vma->vm_ops->mremap) {
> -   err = vma->vm_ops->mremap(new_vma);
> +   err = vma->vm_ops->mremap(new_vma, flags);
> }
>
> if (unlikely(err)) {
> --
> 2.28.0
>


[tip: x86/urgent] fanotify: Fix sys_fanotify_mark() on native x86-32

2020-12-28 Thread tip-bot2 for Brian Gerst
The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 2ca408d9c749c32288bc28725f9f12ba30299e8f
Gitweb:
https://git.kernel.org/tip/2ca408d9c749c32288bc28725f9f12ba30299e8f
Author:Brian Gerst 
AuthorDate:Mon, 30 Nov 2020 17:30:59 -05:00
Committer: Borislav Petkov 
CommitterDate: Mon, 28 Dec 2020 11:58:59 +01:00

fanotify: Fix sys_fanotify_mark() on native x86-32

Commit

  121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 
64-bit arguments")

converted native x86-32 which take 64-bit arguments to use the
compat handlers to allow conversion to passing args via pt_regs.
sys_fanotify_mark() was however missed, as it has a general compat
handler. Add a config option that will use the syscall wrapper that
takes the split args for native 32-bit.

 [ bp: Fix typo in Kconfig help text. ]

Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls 
taking 64-bit arguments")
Reported-by: Paweł Jasiak 
Signed-off-by: Brian Gerst 
Signed-off-by: Borislav Petkov 
Acked-by: Jan Kara 
Acked-by: Andy Lutomirski 
Link: https://lkml.kernel.org/r/20201130223059.101286-1-brge...@gmail.com
---
 arch/Kconfig   |  6 ++
 arch/x86/Kconfig   |  1 +
 fs/notify/fanotify/fanotify_user.c | 17 +++--
 include/linux/syscalls.h   | 24 
 4 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 78c6f05..24862d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1105,6 +1105,12 @@ config HAVE_ARCH_PFN_VALID
 config ARCH_SUPPORTS_DEBUG_PAGEALLOC
bool
 
+config ARCH_SPLIT_ARG64
+   bool
+   help
+  If a 32-bit architecture requires 64-bit arguments to be split into
+  pairs of 32-bit arguments, select this option.
+
 source "kernel/gcov/Kconfig"
 
 source "scripts/gcc-plugins/Kconfig"
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7b6dd10..21f8511 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -19,6 +19,7 @@ config X86_32
select KMAP_LOCAL
select MODULES_USE_ELF_REL
select OLD_SIGACTION
+   select ARCH_SPLIT_ARG64
 
 config X86_64
def_bool y
diff --git a/fs/notify/fanotify/fanotify_user.c 
b/fs/notify/fanotify/fanotify_user.c
index 3e01d8f..dcab112 100644
--- a/fs/notify/fanotify/fanotify_user.c
+++ b/fs/notify/fanotify/fanotify_user.c
@@ -1285,26 +1285,23 @@ fput_and_out:
return ret;
 }
 
+#ifndef CONFIG_ARCH_SPLIT_ARG64
 SYSCALL_DEFINE5(fanotify_mark, int, fanotify_fd, unsigned int, flags,
  __u64, mask, int, dfd,
  const char  __user *, pathname)
 {
return do_fanotify_mark(fanotify_fd, flags, mask, dfd, pathname);
 }
+#endif
 
-#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE6(fanotify_mark,
+#if defined(CONFIG_ARCH_SPLIT_ARG64) || defined(CONFIG_COMPAT)
+SYSCALL32_DEFINE6(fanotify_mark,
int, fanotify_fd, unsigned int, flags,
-   __u32, mask0, __u32, mask1, int, dfd,
+   SC_ARG64(mask), int, dfd,
const char  __user *, pathname)
 {
-   return do_fanotify_mark(fanotify_fd, flags,
-#ifdef __BIG_ENDIAN
-   ((__u64)mask0 << 32) | mask1,
-#else
-   ((__u64)mask1 << 32) | mask0,
-#endif
-dfd, pathname);
+   return do_fanotify_mark(fanotify_fd, flags, SC_VAL64(__u64, mask),
+   dfd, pathname);
 }
 #endif
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f3929af..7688bc9 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -251,6 +251,30 @@ static inline int is_syscall_trace_event(struct 
trace_event_call *tp_event)
static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))
 #endif /* __SYSCALL_DEFINEx */
 
+/* For split 64-bit arguments on 32-bit architectures */
+#ifdef __LITTLE_ENDIAN
+#define SC_ARG64(name) u32, name##_lo, u32, name##_hi
+#else
+#define SC_ARG64(name) u32, name##_hi, u32, name##_lo
+#endif
+#define SC_VAL64(type, name) ((type) name##_hi << 32 | name##_lo)
+
+#ifdef CONFIG_COMPAT
+#define SYSCALL32_DEFINE1 COMPAT_SYSCALL_DEFINE1
+#define SYSCALL32_DEFINE2 COMPAT_SYSCALL_DEFINE2
+#define SYSCALL32_DEFINE3 COMPAT_SYSCALL_DEFINE3
+#define SYSCALL32_DEFINE4 COMPAT_SYSCALL_DEFINE4
+#define SYSCALL32_DEFINE5 COMPAT_SYSCALL_DEFINE5
+#define SYSCALL32_DEFINE6 COMPAT_SYSCALL_DEFINE6
+#else
+#define SYSCALL32_DEFINE1 SYSCALL_DEFINE1
+#define SYSCALL32_DEFINE2 SYSCALL_DEFINE2
+#define SYSCALL32_DEFINE3 SYSCALL_DEFINE3
+#define SYSCALL32_DEFINE4 SYSCALL_DEFINE4
+#define SYSCALL32_DEFINE5 SYSCALL_DEFINE5
+#define SYSCALL32_DEFINE6 SYSCALL_DEFINE6
+#endif
+
 /*
  * Called before coming back to user-mode. Returning to user

Re: v5.10.1 xfs deadlock

2020-12-18 Thread Brian Foster
On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
> On 17.12.20 20:43, Brian Foster wrote:
> > On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> > > Dear xfs developer,
> > > 
> > > I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs 
> > > filesystems on md raid6 raids.
> > > 
> > > The stress test was essentially `cp -a`ing a Linux source repository with 
> > > two threads in parallel on each filesystem.
> > > 
> > > After about on hour, the processes to one filesystem (md1) blocked, 30 
> > > minutes later the process to the other filesystem (md0) did.
> > > 
> > >  root  7322  2167  0 Dec16 pts/100:00:06 cp -a 
> > > /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
> > >  root  7329  2169  0 Dec16 pts/100:00:05 cp -a 
> > > /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
> > >  root 13856  2170  0 Dec16 pts/100:00:08 cp -a 
> > > /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
> > >  root 13899  2168  0 Dec16 pts/100:00:05 cp -a 
> > > /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> > > 

Do you have any indication of whether these workloads actually hung or
just became incredibly slow?

> > > Some info from the system (all stack traces, slabinfo) is available here: 
> > > https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> > > 
> > > It stands out, that there are many (549 for md0, but only 10 for md1)  
> > > "xfs-conv" threads all with stacks like this
> > > 
> > >  [<0>] xfs_log_commit_cil+0x6cc/0x7c0
> > >  [<0>] __xfs_trans_commit+0xab/0x320
> > >  [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
> > >  [<0>] xfs_end_ioend+0xc6/0x110
> > >  [<0>] xfs_end_io+0xad/0xe0
> > >  [<0>] process_one_work+0x1dd/0x3e0
> > >  [<0>] worker_thread+0x2d/0x3b0
> > >  [<0>] kthread+0x118/0x130
> > >  [<0>] ret_from_fork+0x22/0x30
> > > 
> > > xfs_log_commit_cil+0x6cc is
> > > 
> > >xfs_log_commit_cil()
> > >  xlog_cil_push_background(log)
> > >xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> > > 

This looks like the transaction commit throttling code. That was
introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
Throttle commits on delayed background CIL push"). The purpose of that
change was to prevent the CIL from growing too large. FWIW, I don't
recall that being a functional problem so it should be possible to
simply remove that blocking point and see if that avoids the problem or
if we simply stall out somewhere else, if you wanted to give that a
test.

> > > Some other threads, including the four "cp" commands are also blocking at 
> > > xfs_log_commit_cil+0x6cc
> > > 
> > > There are also single "flush" process for each md device with this stack 
> > > signature:
> > > 
> > >  [<0>] xfs_map_blocks+0xbf/0x400
> > >  [<0>] iomap_do_writepage+0x15e/0x880
> > >  [<0>] write_cache_pages+0x175/0x3f0
> > >  [<0>] iomap_writepages+0x1c/0x40
> > >  [<0>] xfs_vm_writepages+0x59/0x80
> > >  [<0>] do_writepages+0x4b/0xe0
> > >  [<0>] __writeback_single_inode+0x42/0x300
> > >  [<0>] writeback_sb_inodes+0x198/0x3f0
> > >  [<0>] __writeback_inodes_wb+0x5e/0xc0
> > >  [<0>] wb_writeback+0x246/0x2d0
> > >  [<0>] wb_workfn+0x26e/0x490
> > >  [<0>] process_one_work+0x1dd/0x3e0
> > >  [<0>] worker_thread+0x2d/0x3b0
> > >  [<0>] kthread+0x118/0x130
> > >  [<0>] ret_from_fork+0x22/0x30
> > > 

Is writeback still blocked as such or was this just a transient stack?

> > > xfs_map_blocks+0xbf is the
> > > 
> > >  xfs_ilock(ip, XFS_ILOCK_SHARED);
> > > 
> > > in xfs_map_blocks().
> > > 
> > > The system is low on free memory
> > > 
> > >  MemTotal:   197587764 kB
> > >  MemFree:  2196496 kB
> > >  MemAvailable:   189895408 kB
> > > 
> > > but responsive.
> > > 
> > > I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from 
> > > linux-scsi) , but it is unlikely that this blocking is related

Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery

2020-12-17 Thread Brian Norris
On Thu, Dec 17, 2020 at 2:57 PM Ben Greear  wrote:
> On 12/17/20 2:24 PM, Brian Norris wrote:
> > I'd also note that we don't operate in AP mode -- only STA -- and IIRC
> > Ben, you've complained about AP mode in the past.
>
> I complain about all sorts of things, but I'm usually running
> station mode :)

Hehe, fair :) Maybe I'm mixed up.

But I do get the feeling that specifically within the ath10k family,
there are wildly different use cases (mobile, PC, AP) and chips (and
firmware) that tend to go along with them, and that those use cases
get a fairly different population of {developers, testers, reporters}.
So claiming "feature X works" pretty much always has to be couched in
which chips, firmware, and use case. And there's certainly some wisdom
in these sections:

https://wireless.wiki.kernel.org/en/users/drivers/ath10k/submittingpatches#hardware_families
https://wireless.wiki.kernel.org/en/users/drivers/ath10k/submittingpatches#tested-on_tag

> Do you actually see iwlwifi stations stay associated through
> firmware crashes?

Yes.

> Anyway, happy to hear some have seamless recovery, and in that case,
> I have no objections to the patch.

OK! I hope I'm not the only one with such results, because then I
still might question my sanity (and test coverage), but that's still
my understanding.

BTW, I haven't yet closely reviewed the patch series myself, but I ACK
the concept.

Brian


Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery

2020-12-17 Thread Brian Norris
On Tue, Dec 15, 2020 at 10:51:13PM +0530, Youghandhar Chintala wrote:
> From: Rakesh Pillai 

I meant to mention in my other reply: the threading on this series is
broken (as in, it doesn't exist). It looks like you're using
git-send-email (good!), but somehow it doesn't have any In-Reply-To or
References (bad!). Did you send all your mail in one invocation, or did
you send them as separate git-send-email commands? Anyway, please
investigate what when wrong so you can get this right in the future.

For one, this affects Patchwork's ability to group patch series (not to
mention everybody who uses a decent mail reader, with proper threading).
See for example the lore archive, which only is threading replies to
this cover letter:

https://lore.kernel.org/linux-wireless/20201215172113.5038-1-yough...@codeaurora.org/

Regards,
Brian


Re: [PATCH 0/3] mac80211: Trigger disconnect for STA during recovery

2020-12-17 Thread Brian Norris
On Tue, Dec 15, 2020 at 10:23:33AM -0800, Ben Greear wrote:
> On 12/15/20 9:21 AM, Youghandhar Chintala wrote:
> > From: Rakesh Pillai 
> > 
> > Currently in case of target hardware restart ,we just reconfig and
> > re-enable the security keys and enable the network queues to start
> > data traffic back from where it was interrupted.
> 
> Are there any known mac80211 radios/drivers that *can* support seamless 
> restarts?
> 
> If not, then just could always enable this feature in mac80211?

I'm quite sure that iwlwifi intentionally supports a seamless restart.
>From my experience with dealing with user reports, I don't recall any
issues where restart didn't function as expected, unless there was some
deeper underlying failure (e.g., hardware/power failure; driver bugs /
lockups).

I don't have very good stats for ath10k/QCA6174, but it survives
our testing OK and I again don't recall any user-reported complaints in
this area. I'd say this is a weaker example though, as I don't have as
clear of data. (By contrast, ath10k/WCN399x, which Rakesh, et al, are
patching here, does not pass our tests at all, and clearly fails to
recover from "seamless" restarts, as noted in patch 3.)

I'd also note that we don't operate in AP mode -- only STA -- and IIRC
Ben, you've complained about AP mode in the past.

Brian


Re: v5.10.1 xfs deadlock

2020-12-17 Thread Brian Foster
On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> Dear xfs developer,
> 
> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs 
> filesystems on md raid6 raids.
> 
> The stress test was essentially `cp -a`ing a Linux source repository with two 
> threads in parallel on each filesystem.
> 
> After about on hour, the processes to one filesystem (md1) blocked, 30 
> minutes later the process to the other filesystem (md0) did.
> 
> root  7322  2167  0 Dec16 pts/100:00:06 cp -a 
> /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
> root  7329  2169  0 Dec16 pts/100:00:05 cp -a 
> /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
> root 13856  2170  0 Dec16 pts/100:00:08 cp -a 
> /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
> root 13899  2168  0 Dec16 pts/100:00:05 cp -a 
> /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> 
> Some info from the system (all stack traces, slabinfo) is available here: 
> https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> 
> It stands out, that there are many (549 for md0, but only 10 for md1)  
> "xfs-conv" threads all with stacks like this
> 
> [<0>] xfs_log_commit_cil+0x6cc/0x7c0
> [<0>] __xfs_trans_commit+0xab/0x320
> [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
> [<0>] xfs_end_ioend+0xc6/0x110
> [<0>] xfs_end_io+0xad/0xe0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x2d/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> xfs_log_commit_cil+0x6cc is
> 
>   xfs_log_commit_cil()
> xlog_cil_push_background(log)
>   xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> 
> Some other threads, including the four "cp" commands are also blocking at 
> xfs_log_commit_cil+0x6cc
> 
> There are also single "flush" process for each md device with this stack 
> signature:
> 
> [<0>] xfs_map_blocks+0xbf/0x400
> [<0>] iomap_do_writepage+0x15e/0x880
> [<0>] write_cache_pages+0x175/0x3f0
> [<0>] iomap_writepages+0x1c/0x40
> [<0>] xfs_vm_writepages+0x59/0x80
> [<0>] do_writepages+0x4b/0xe0
> [<0>] __writeback_single_inode+0x42/0x300
> [<0>] writeback_sb_inodes+0x198/0x3f0
> [<0>] __writeback_inodes_wb+0x5e/0xc0
> [<0>] wb_writeback+0x246/0x2d0
> [<0>] wb_workfn+0x26e/0x490
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x2d/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> xfs_map_blocks+0xbf is the
> 
> xfs_ilock(ip, XFS_ILOCK_SHARED);
> 
> in xfs_map_blocks().
> 
> The system is low on free memory
> 
> MemTotal:   197587764 kB
> MemFree:  2196496 kB
> MemAvailable:   189895408 kB
> 
> but responsive.
> 
> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from 
> linux-scsi) , but it is unlikely that this blocking is related to that, 
> because the md block devices itself are responsive (`xxd /dev/md0` )
> 
> I can keep the system in the state for a while. Is there an idea what was 
> going from or an idea what data I could collect from the running system to 
> help? I have full debug info and could walk lists or retrieve data structures 
> with gdb.
> 

It might be useful to dump the values under /sys/fs/xfs//log/* for
each fs to get an idea of the state of the logs as well...

Brian

> Best
>   Donald
> 



[PATCH net-next v2 4/4] net: indirect call helpers for ipv4/ipv6 dst_check functions

2020-12-11 Thread Brian Vazquez
From: brianvv 

This patch avoids the indirect call for the common case:
ip6_dst_check and ipv4_dst_check

Signed-off-by: brianvv 
---
 include/net/dst.h   |  7 ++-
 net/core/sock.c | 12 ++--
 net/ipv4/route.c|  7 +--
 net/ipv4/tcp_ipv4.c |  5 -
 net/ipv6/route.c|  7 +--
 net/ipv6/tcp_ipv6.c |  5 -
 6 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 9f474a79ed7d..26f134ad3a25 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -459,10 +459,15 @@ static inline int dst_input(struct sk_buff *skb)
  ip6_input, ip_local_deliver, skb);
 }
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *,
+ u32));
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 static inline struct dst_entry *dst_check(struct dst_entry *dst, u32 cookie)
 {
if (dst->obsolete)
-   dst = dst->ops->check(dst, cookie);
+   dst = INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check,
+ipv4_dst_check, dst, cookie);
return dst;
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 4fd7e785f177..753b831a9d70 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -526,11 +526,17 @@ int __sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 }
 EXPORT_SYMBOL(__sk_receive_skb);
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ip6_dst_check(struct dst_entry *,
+ u32));
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
struct dst_entry *dst = __sk_dst_get(sk);
 
-   if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+   if (dst && dst->obsolete &&
+   INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check,
+  dst, cookie) == NULL) {
sk_tx_queue_clear(sk);
sk->sk_dst_pending_confirm = 0;
RCU_INIT_POINTER(sk->sk_dst_cache, NULL);
@@ -546,7 +552,9 @@ struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie)
 {
struct dst_entry *dst = sk_dst_get(sk);
 
-   if (dst && dst->obsolete && dst->ops->check(dst, cookie) == NULL) {
+   if (dst && dst->obsolete &&
+   INDIRECT_CALL_INET(dst->ops->check, ip6_dst_check, ipv4_dst_check,
+  dst, cookie) == NULL) {
sk_dst_reset(sk);
dst_release(dst);
return NULL;
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 4fac91f8bd6c..9e6537709794 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -133,7 +133,8 @@ static int ip_rt_gc_timeout __read_mostly   = RT_GC_TIMEOUT;
  * Interface to generic destination cache.
  */
 
-static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
+INDIRECT_CALLABLE_SCOPE
+struct dst_entry   *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int ipv4_default_advmss(const struct dst_entry *dst);
 INDIRECT_CALLABLE_SCOPE
 unsigned int   ipv4_mtu(const struct dst_entry *dst);
@@ -1188,7 +1189,8 @@ void ipv4_sk_redirect(struct sk_buff *skb, struct sock 
*sk)
 }
 EXPORT_SYMBOL_GPL(ipv4_sk_redirect);
 
-static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
+INDIRECT_CALLABLE_SCOPE struct dst_entry *ipv4_dst_check(struct dst_entry *dst,
+u32 cookie)
 {
struct rtable *rt = (struct rtable *) dst;
 
@@ -1204,6 +1206,7 @@ static struct dst_entry *ipv4_dst_check(struct dst_entry 
*dst, u32 cookie)
return NULL;
return dst;
 }
+EXPORT_SYMBOL(ipv4_dst_check);
 
 static void ipv4_send_dest_unreach(struct sk_buff *skb)
 {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index af2338294598..aba5061024c7 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1646,6 +1646,8 @@ u16 tcp_v4_get_syncookie(struct sock *sk, struct iphdr 
*iph,
return mss;
 }
 
+INDIRECT_CALLABLE_DECLARE(struct dst_entry *ipv4_dst_check(struct dst_entry *,
+  u32));
 /* The socket must have it's spinlock held when we get
  * here, unless it is a TCP_LISTEN socket.
  *
@@ -1665,7 +1667,8 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
sk_mark_napi_id(sk, skb);
if (dst) {
if (inet_sk(sk)->rx_dst_ifindex != skb->skb_iif ||
-   !dst->ops->check(dst, 0)) {
+   !INDIRECT_CALL_1(dst->ops->check, ipv4_dst_check,
+dst, 0)

[PATCH net-next v2 3/4] net: use indirect call helpers for dst_mtu

2020-12-11 Thread Brian Vazquez
From: brianvv 

This patch avoids the indirect call for the common case:
ip6_mtu and ipv4_mtu

Signed-off-by: brianvv 
---
 include/net/dst.h | 4 +++-
 net/ipv4/route.c  | 6 --
 net/ipv6/route.c  | 6 --
 3 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index 3932e9931f08..9f474a79ed7d 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -194,9 +194,11 @@ dst_feature(const struct dst_entry *dst, u32 feature)
return dst_metric(dst, RTAX_FEATURES) & feature;
 }
 
+INDIRECT_CALLABLE_DECLARE(unsigned int ip6_mtu(const struct dst_entry *));
+INDIRECT_CALLABLE_DECLARE(unsigned int ipv4_mtu(const struct dst_entry *));
 static inline u32 dst_mtu(const struct dst_entry *dst)
 {
-   return dst->ops->mtu(dst);
+   return INDIRECT_CALL_INET(dst->ops->mtu, ip6_mtu, ipv4_mtu, dst);
 }
 
 /* RTT metrics are stored in milliseconds for user ABI, but used as jiffies */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index e26652ff7059..4fac91f8bd6c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -135,7 +135,8 @@ static int ip_rt_gc_timeout __read_mostly   = RT_GC_TIMEOUT;
 
 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie);
 static unsigned int ipv4_default_advmss(const struct dst_entry *dst);
-static unsigned int ipv4_mtu(const struct dst_entry *dst);
+INDIRECT_CALLABLE_SCOPE
+unsigned int   ipv4_mtu(const struct dst_entry *dst);
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void ipv4_link_failure(struct sk_buff *skb);
 static void ip_rt_update_pmtu(struct dst_entry *dst, struct sock 
*sk,
@@ -1311,7 +1312,7 @@ static unsigned int ipv4_default_advmss(const struct 
dst_entry *dst)
return min(advmss, IPV4_MAX_PMTU - header_size);
 }
 
-static unsigned int ipv4_mtu(const struct dst_entry *dst)
+INDIRECT_CALLABLE_SCOPE unsigned int ipv4_mtu(const struct dst_entry *dst)
 {
const struct rtable *rt = (const struct rtable *)dst;
unsigned int mtu = rt->rt_pmtu;
@@ -1333,6 +1334,7 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
+EXPORT_SYMBOL(ipv4_mtu);
 
 static void ip_del_fnhe(struct fib_nh_common *nhc, __be32 daddr)
 {
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 188e114b29b4..22caee290b6c 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -83,7 +83,8 @@ enum rt6_nud_state {
 
 static struct dst_entry*ip6_dst_check(struct dst_entry *dst, u32 
cookie);
 static unsigned int ip6_default_advmss(const struct dst_entry *dst);
-static unsigned int ip6_mtu(const struct dst_entry *dst);
+INDIRECT_CALLABLE_SCOPE
+unsigned int   ip6_mtu(const struct dst_entry *dst);
 static struct dst_entry *ip6_negative_advice(struct dst_entry *);
 static voidip6_dst_destroy(struct dst_entry *);
 static voidip6_dst_ifdown(struct dst_entry *,
@@ -3089,7 +3090,7 @@ static unsigned int ip6_default_advmss(const struct 
dst_entry *dst)
return mtu;
 }
 
-static unsigned int ip6_mtu(const struct dst_entry *dst)
+INDIRECT_CALLABLE_SCOPE unsigned int ip6_mtu(const struct dst_entry *dst)
 {
struct inet6_dev *idev;
unsigned int mtu;
@@ -3111,6 +3112,7 @@ static unsigned int ip6_mtu(const struct dst_entry *dst)
 
return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
+EXPORT_SYMBOL(ip6_mtu);
 
 /* MTU selection:
  * 1. mtu on route is locked - use it
-- 
2.29.2.576.ga3fc446d84-goog



  1   2   3   4   5   6   7   8   9   10   >